A comprehensive analysis of logistic regression which can be used as a guide for beginners and advanced data scientists alike.
Logistic regression, alongside linear regression, is one of the most widely used machine learning algorithms in real production settings. Here, we present a comprehensive analysis of logistic regression, which can be used as a guide for beginners and advanced data scientists alike.
Logistic regression is an extremely popular artificial intelligence approach that is used for classification tasks. It is widely adopted in real-life machine learning production settings.
Logistic regression is a machine learning algorithm used to predict the probability that an observation belongs to one of two possible classes.
What does that mean in practice?
We could use the logistic regression algorithm to predict the following:
How does logistic regression make predictions?
We train the model by feeding it input data and a binary class to which this data belongs.
For example, we would input the email subject line (“A Nigerian prince needs your help”) into the model with the accompanying class (“spam”). The model learns the patterns between the incoming data and the desired output as a mapping (aka, when input is “x”, predict “y”).
The logistic regression can then be used on novel input data which the model has never seen before (during training).
Let’s look at a concrete example.
Imagine that you’re tasked to predict whether or not a client of your bank will default on their loan repayments. The first thing to do is construct a dataset of historic client defaults. The data would contain client demographic information (e.g. age, gender, location, etc.), their financial information (loan size, times that payment was overdue, etc.), and whether they ended up defaulting on a loan or repaying it.
The “Yes” and “No” categories can be recoded into 1 and 0 for the target variable (computers deal better with numbers than words):
After this, we would train a logistic regression model, which would learn a mapping between the input variables (age, gender, loan size) and the expected output (defaulted). We could use the logistic regression model to predict the default probability on three new customers:
So, what does the new column Predicted default tell us? It states the probability of each of the new customers belonging to class 1 (defaulted on loan). We could come up with a threshold value (let’s say 0.5) and anything above that decision threshold would be default behavior (i.e. Customer 5 would be predicted to default on their loan payments, while Customers 4 and 6 would be predicted to repay them).
Business applications for logistic regression involve predicting future membership to a certain category. Logistic regression is extremely popular, so it has been used in a wide variety of business settings:
The machine learning model is favored in real-life production settings for several reasons:
The benefits of logistic regression from an engineering perspective make it more favorable than other, more advanced machine learning algorithms.
Logistic regression is a supervised machine learning classification algorithm. Let’s break it down a little:
Logistic regression is just one of the many classification algorithms. There are several other classification techniques that we have at our disposal when predicting class membership:
As well as being a machine learning model, logistic regression is a well-established and widely used statistical model. Although we will be focusing on the machine learning side of things, we will also draw some parallels to its statistical background to provide you with a complete picture. No need to worry, though - you won’t need to brush up on calculus or linear algebra to follow along!
Once trained, the model takes the form of a logistic regression equation:
In this equation:
Let’s break down the entire model into the linear model and the accompanying sigmoid function in order to understand how logistic regression predicts probabilities of an example belonging to the default class.
The linear model is part of the logistic regression. It represents a linear relationship between the input features and the predicted output. The linear part of the entire model can be summarized with the equation:
What does each component mean here?
So, why wouldn’t we just use the linear model to make predictions about class membership, as we did with linear regression? Let’s look at an example.
Imagine that we have the following table for the number of late payments made by a customer (x) and whether the customer later defaulted on their loan (y).
We could model the data with a linear regression in the following way:
There are a couple of problems here:
A better approach would be to model the probability of default using a sigmoid function.
The sigmoid function is a function that produces an s-shaped curve. It takes any real value as an argument and maps it to a range between 0 and 1 (exclusive). For the problem above, the sigmoid curve would look like this:
In machine learning, it is used to map the linear model in logistic regression to map the linear predictions to outcome probabilities (bounded between 0 and 1), which are easier to interpret for class membership.
We still have a problem, though. How do we map class membership probability to predicted class? We need a decision boundary to disambiguate between different probabilities.
A decision boundary is a threshold that we use to categorize the probabilities of logistic regression into discrete classes. A decision boundary could take the form:
y = 0 if predicted probability < 0.5
y = 1 if predicted probability > 0.5
Above, we presented the classical logistic regression, which predicts one of two classes. But based on the number and data type of the classes, there are different forms of logistic regression:
Irrespective of the type of logistic regression that we choose, training the logistic regression model follows a similar process in all cases.
The aim of training the logistic regression model is to figure out the best weights for our linear model within the logistic regression. In machine learning, we compute the optimal weights by optimizing the cost function.
The cost function J(Θ) is a formal representation of an objective that the algorithm is trying to achieve. In the case of logistic regression, the cost function is called LogLoss (or Cross-Entropy) and the goal is to minimize the following cost function equation:
The mathematics might look a bit intimidating, but you do not need to compute the cost function by hand. Python machine learning libraries like Scikit-learn do the hard work for you, so you just need to understand the principles behind it:
So, how do we achieve a low value for our cost function (aka, a model with good predictions)? We use gradient descent.
Gradient descent is a method of changing weights based on the loss function for each data point. We calculate the LogLoss cost function at each input-output data point.
We take a partial derivative of the weight and bias to get the slope of the cost function at each point. (No need to brush up on linear algebra and calculus right now. There are several matrix optimizations built into the Python library and Scikit-learn, which allow data science enthusiasts to unlock the power of advanced artificial intelligence without coding the answers themselves).
Based on the slope, gradient descent updates the values for the bias and the set of weights, then reiterates the training loop over new values (moving a step closer to the desired goal).
This iterative approach is repeated until a minimum error is reached, and gradient descent cannot minimize the cost function any further.
We can change the speed at which we reach the optimal minimum by adjusting the learning rate. A high learning rate changes the weights more drastically, while a low learning rate changes them more slowly.
There is a trade-off in the size of the learning rate. Too low, and you might be waiting forever for your model to converge on the best set of weights; too high, and you risk missing the best set of weights because the model would not converge.
There are two main metrics for evaluating how well our model functions after we’ve trained it:
P. S. We are making the assumption that you’ve trained and evaluated your model correctly. In other words, you need to make sure that you’ve trained the model on the training dataset and built evaluation metrics on the test dataset to avoid overfitting.
There are multiple methods that can be used to improve your logistic regression model.
The greatest improvements are usually achieved with a proper data cleaning process. Logistic regression uses a linear model, so it suffers from the same issues that linear regression does. To properly prepare the data for logistic regression modeling, you need to:
Logistic regression has additional assumptions and needs for cleaning:
Feature values can be comparably different by orders of magnitude. For instance, loan size is in the tens of thousands ($50,000), while “number of months late” is in single digits (0, 1, 2, …).
Features of different scales convert slower (or not at all) with gradient descent.
Normalize and standardize your features to speed up and improve model training.
Regularization is particularly useful in settings with multiple features (or independent variables). Regularization takes a complex model (with multiple predictors) and sets their weights to zero (L1 regularization). This effectively removes a predictor from the linear equation or lowers its weights towards zero (L2 regularization), making the feature less impactful on the final logistic regression equation.
Both of these approaches work great when you have an overly complex model which overfits.
The way in which you use logistic regression in practice depends on how much you know about the entire data science process.
We recommend that beginners start by modeling on datasets that have already been collected and cleaned, while experienced data scientists can scale their operations by choosing the right software for the task at hand.
There are over 45 different datasets that allow you to practice logistic regression for yourself. Among the best ones are:
Production data science means spending more than 80% of your time on data collection and cleaning. If you want to speed up the entire data pipeline, use software that automates tasks to give you more time for data modeling.
Keboola offers a platform for data scientists who want to build their own machine learning models. It comes with one-click deployed Jupyter Notebooks, through which all of the modeling can be done via Julia, R, or Python.
Deep dive into the data science process with this Jupyter Notebook:
Want to take things a step further? Keboola can assist you with instrumentalizing your entire data operations pipeline.
Being a data-centric platform, Keboola also allows you to build your ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. You can deploy multiple models with different algorithms to version your work and determine which ones perform best. Start building models today with our free trial.