How To

The Ultimate Guide to Logistic Regression for Machine Learning

Logistic regression, alongside linear regression, is one of the most widely used machine learning algorithms in real production settings. Here, we present a comprehensive analysis of logistic regression, which can be used as a guide for beginners and advanced data scientists alike.

Logistic regression is an extremely popular artificial intelligence approach that is used for classification tasks. It is widely adopted in real-life machine learning production settings.

Logistic regression is a machine learning algorithm used to predict the probability that an observation belongs to one of two possible classes.

*What does that mean in practice?*

We could use the logistic regression algorithm to predict the following:

- Build an email classifier to tell us whether an incoming email should be marked as “spam” or “not spam”.
- Check radiological images to predict whether a tumor is benign or malignant.
- Pour through historic bank records to predict whether a customer will default on their loan repayments or repay the loan.

*How does logistic regression make predictions?*

We train the model by feeding it input data and a binary class to which this data belongs.

For example, we would input the email subject line (“A Nigerian prince needs your help”) into the model with the accompanying class (“spam”). The model learns the patterns between the incoming data and the desired output as a mapping (aka, when input is “x”, predict “y”).

The logistic regression can then be used on novel input data which the model has never seen before (during training).

*Let’s look at a concrete example*.

Imagine that you’re tasked to predict whether or not a client of your bank will default on their loan repayments. The first thing to do is construct a dataset of historic client defaults. The data would contain client demographic information (e.g. age, gender, location, etc.), their financial information (loan size, times that payment was overdue, etc.), and whether they ended up defaulting on a loan or repaying it.

The “Yes” and “No” categories can be recoded into 1 and 0 for the target variable (computers deal better with numbers than words):

After this, we would train a logistic regression model, which would learn a mapping between the input variables (age, gender, loan size) and the expected output (defaulted). We could use the logistic regression model to predict the default probability on three new customers:

So, what does the new column *Predicted default* tell us? It states the probability of each of the new customers belonging to class 1 (defaulted on loan). We could come up with a threshold value (let’s say 0.5) and anything above that decision threshold would be default behavior (i.e. Customer 5 would be predicted to default on their loan payments, while Customers 4 and 6 would be predicted to repay them).

Business applications for logistic regression involve predicting future membership to a certain category. Logistic regression is extremely popular, so it has been used in a wide variety of business settings:

**Qualify leads**. Logistic regression has been used to segment users into distinct categories for business intelligence, e.g. it allows you to predict which of your users will convert from a freemium user to a paid subscriber (or from a lead to a customer). You can use this prediction to streamline your sales operations to shorten lead nurturing time and focus on those prospects, who have a higher likelihood of converting.**Recommend products**. For digital businesses in particular, the algorithm can be used for recommendations. For instance, you could use it to calculate the likelihood of a customer purchasing a specific product (from 0 to 1), and if the likelihood is greater than a certain threshold (for example, 80%) you should recommend the product for upselling on your e-commerce site. This increases your overall sales because you are recommending products which are more likely to be purchased.**Anticipate rare customer behavior**. Loan defaulting or churning are rare behaviors, which are difficult to predict due to their low incidence. Logistic regression is great at anticipating rare events. You can use this information to get ahead of the competition and prevent rare negative events from affecting your bottom line.

The machine learning model is favored in real-life production settings for several reasons:

**Ease of use**. Training the model and using it for predictions is very simple, and it does not require a lot of engineering overhead for maintenance.**Interpretability**. Unlike deep learning models (neural networks), logistic regression is straightforward to interpret. Although it is not as interpretable as linear regression, logistic regression can help us to assess which input variable is responsible for the greatest change in predicted value.**Scalability**. The algorithm is extremely efficient. Fast training times combined with low computational requirements make logistic regression easy to scale, even when the data volume and speed increase.**Real-time predictions**. Because of the ease of computation, logistic regression can be used in online settings, meaning that the model can be retrained with each new example and generate predictions in near real-time. This contrasts with computationally heavy approaches, such as neural networks or support vector machines, which require a lot of computing resources or long waiting times while new data is retrained. Ultimately, this makes them unsuitable for real-time applications (or at least very expensive).

The benefits of logistic regression from an engineering perspective make it more favorable than other, more advanced machine learning algorithms.

Logistic regression is a supervised machine learning classification algorithm. Let’s break it down a little:

- Supervised machine learning: supervised learning techniques train the model by providing it with pairs of input-output examples from which it can learn. For example, the logistic regression would learn from a specific example to associate three missed loan repayments with future default (class membership = 1).
- Classification algorithm: the purpose of the machine learning model is to classify examples into distinct (binary) classes. For instance, default vs. paying loans, email classification as spam or not spam, or a computer vision algorithm to predict whether the picture contains a dog or a ‘none-dog’ animal.

Logistic regression is just one of the many classification algorithms. There are several other classification techniques that we have at our disposal when predicting class membership:

- Support Vector Machines (SVM)
- Classification decision trees
- Random forest classification
- K-nearest neighbors (k-NN)
- Naive Bayes classifier

As well as being a machine learning model, logistic regression is a well-established and widely used statistical model. Although we will be focusing on the machine learning side of things, we will also draw some parallels to its statistical background to provide you with a complete picture. No need to worry, though - you won’t need to brush up on calculus or linear algebra to follow along!

Once trained, the model takes the form of a logistic regression equation:

In this equation:

- y is the predicted probability of belonging to the default class. In binary classification, we mark the default class with 1 and the other class with 0. y states the probability of an example belonging to the default class on a scale from 0 to 1 (exclusive). So y=0.99 would mean that the model predicts the example belonging to class 1. During training, y is also called the
*target variable*in machine learning, or the*dependent variable*in statistical modeling. It represents the categorical value that we are trying to predict. - 1/(1+e
**^-**z) is the sigmoid function. - wo + w1x is the linear model within logistic regression.

Let’s break down the entire model into the linear model and the accompanying sigmoid function in order to understand how logistic regression predicts probabilities of an example belonging to the default class.

The linear model is part of the logistic regression. It represents a linear relationship between the input features and the predicted output. The linear part of the entire model can be summarized with the equation:

What does each component mean here?

- x is the input variable. In statistics, x is referred to as an
*independent variable*, while machine learning calls it a*feature*. It is the information that is given to us at any time, both during training and predictions. - w0 is the
*bias term*. - w1 is the
*weight*for the input variable x. - In machine learning, we call wi
*weights*in general.

So, why wouldn’t we just use the linear model to make predictions about class membership, as we did with linear regression? Let’s look at an example.

Imagine that we have the following table for the number of late payments made by a customer (x) and whether the customer later defaulted on their loan (y).

We could model the data with a linear regression in the following way:

There are a couple of problems here:

- Linear regression predicts probabilities outside of the 0-1 range (so someone can have a -140% probability of default, which does not make sense).
- For a certain number of late payments (two in this example), it is unclear whether we should categorize them under non-defaulting or defaulting behavior.

A better approach would be to model the probability of default using a sigmoid function.

The sigmoid function is a function that produces an s-shaped curve. It takes any real value as an argument and maps it to a range between 0 and 1 (exclusive). For the problem above, the sigmoid curve would look like this:

In machine learning, it is used to map the linear model in logistic regression to map the linear predictions to outcome probabilities (bounded between 0 and 1), which are easier to interpret for class membership.

We still have a problem, though. How do we map class membership probability to predicted class? We need a decision boundary to disambiguate between different probabilities.

A decision boundary is a threshold that we use to categorize the probabilities of logistic regression into discrete classes. A decision boundary could take the form:

y = 0 if predicted probability < 0.5

y = 1 if predicted probability > 0.5

Above, we presented the classical logistic regression, which predicts one of two classes. But based on the number and data type of the classes, there are different forms of logistic regression:

- Binary logistic regression. The target variable takes one of two possible categorical values. For example, spam vs. not spam, 0 vs. 1, dog vs. not dog, etc.
- Multinomial logistic regression. The target variable takes one of three or more possible categorical values. For example, vote Republican vs. vote Democratic vs. No vote, or “buy product A” vs. “try product A” vs. “not buy or try product A”. We can train this type of logistic regression in the same way that we would train the binary classification problem, but we would use a method called ‘one vs. all’ instead. We choose a target class (let’s say A) and calculate the probability of A versus all of the other classes (B and C and…). We repeat the method for each class.
- Ordinal logistic regression. This is similar to multiple logistic regression, except the target categorical variables are ordered (for example, “medal on the Olympics”).

Irrespective of the type of logistic regression that we choose, training the logistic regression model follows a similar process in all cases.

The aim of training the logistic regression model is to figure out the best weights for our linear model within the logistic regression. In machine learning, we compute the optimal weights by optimizing the cost function.

The cost function J(Θ) is a formal representation of an objective that the algorithm is trying to achieve. In the case of logistic regression, the cost function is called LogLoss (or Cross-Entropy) and the goal is to minimize the following cost function equation:

The mathematics might look a bit intimidating, but you do not need to compute the cost function by hand. Python machine learning libraries like Scikit-learn do the hard work for you, so you just need to understand the principles behind it:

- The cost function checks what the average error is between actual class membership and predicted class membership. This is caused by the specific selection of weights within our linear model.
- The cost function not only penalizes big errors, but also errors which are too confident (too close to 0 or 1). This guarantees that our predictions stay within the 0-1 range, exclusive.

So, how do we achieve a low value for our cost function (aka, a model with good predictions)? We use gradient descent.

Gradient descent is a method of changing weights based on the loss function for each data point. We calculate the LogLoss cost function at each input-output data point.

We take a partial derivative of the weight and bias to get the slope of the cost function at each point. (No need to brush up on linear algebra and calculus right now. There are several matrix optimizations built into the Python library and Scikit-learn, which allow data science enthusiasts to unlock the power of advanced artificial intelligence without coding the answers themselves).

Based on the slope, gradient descent updates the values for the bias and the set of weights, then reiterates the training loop over new values (moving a step closer to the desired goal).

This iterative approach is repeated until a minimum error is reached, and gradient descent cannot minimize the cost function any further.

We can change the speed at which we reach the optimal minimum by adjusting the learning rate. A high learning rate changes the weights more drastically, while a low learning rate changes them more slowly.

There is a trade-off in the size of the learning rate. Too low, and you might be waiting forever for your model to converge on the best set of weights; too high, and you risk missing the best set of weights because the model would not converge.

There are two main metrics for evaluating how well our model functions after we’ve trained it:

**Accuracy**. Represents the percentage of correctly classified samples. An accuracy score of 90% would tell us that our logistic regression model correctly classified 90% of all examples.**ROC AUC**. Area Under the Receiver Operating Characteristic Curve (ROC AUC) describes the relationship between the true positive rate (TRP) - that is, the ratio of samples that we correctly predicted belonging to the correct class - versus the false positive rate (FPR) - that is, the ratio of samples for which we incorrectly predicted their class membership. ROC AUC is preferable to accuracy, especially in multiclass prediction settings or when we have a class imbalance problem.

P. S. We are making the assumption that you’ve trained and evaluated your model correctly. In other words, you need to make sure that you’ve trained the model on the training dataset and built evaluation metrics on the test dataset to avoid overfitting.

There are multiple methods that can be used to improve your logistic regression model.

The greatest improvements are usually achieved with a proper data cleaning process. Logistic regression uses a linear model, so it suffers from the same issues that linear regression does. To properly prepare the data for logistic regression modeling, you need to:

**Remove outliers**. Outliers will skew your model to perform less well.**Remove multicollinearity**. Logistic regression assumes that the predictor variables (features) are not correlated with one another. Check their pairwise correlation and from the analysis, remove those variables which are highly correlated.**Assert linear assumption**. If your independent variables do not have a linear relationship with your predictor variable, you need to log transform them to reshape polynomial relationships into linear.**Assert normal distribution**. The model assumes that the independent variables follow a Gaussian distribution. Transform your variables with log transform or BoxCox if they are not normally distributed.

Logistic regression has additional assumptions and needs for cleaning:

**Binary output variable**. Transform your output variable into 0 or 1.**Failure to converge**. The maximum likelihood estimation model (the ‘maths’) behind logistic regression assumes that no single variable will perfectly predict class membership. In the event that you have a feature that perfectly predicts the target class, the algorithm will try to assign it infinite weights (because it is so important) and thus will fail to converge to a solution. If you have a perfect predictor, simply remove it from the feature set... or just don’t model your data at all. At the end of the day, you do not need a machine learning model if you have a perfect predictor.

Feature values can be comparably different by orders of magnitude. For instance, loan size is in the tens of thousands ($50,000), while “number of months late” is in single digits (0, 1, 2, …).

Features of different scales convert slower (or not at all) with gradient descent.

Normalize and standardize your features to speed up and improve model training.

Regularization is particularly useful in settings with multiple features (or independent variables). Regularization takes a complex model (with multiple predictors) and sets their weights to zero (L1 regularization). This effectively removes a predictor from the linear equation or lowers its weights towards zero (L2 regularization), making the feature less impactful on the final logistic regression equation.

Both of these approaches work great when you have an overly complex model which overfits.

The way in which you use logistic regression in practice depends on how much you know about the entire data science process.

We recommend that beginners start by modeling on datasets that have already been collected and cleaned, while experienced data scientists can scale their operations by choosing the right software for the task at hand.

There are over 45 different datasets that allow you to practice logistic regression for yourself. Among the best ones are:

- The classic Titanic survival dataset. Predict whether a passenger or a crew member would have survived the Titanic’s collision with the iceberg.
- Predict whether a telecommunications customer will churn.
- Analyze which marketing approaches and demographic information can be used to predict whether a bank client will subscribe to a Portuguese bank’s term deposit.
- Model the probability of an employee leaving their company.

Production data science means spending more than 80% of your time on data collection and cleaning. If you want to speed up the entire data pipeline, use software that automates tasks to give you more time for data modeling.

Keboola offers a platform for data scientists who want to build their own machine learning models. It comes with one-click deployed Jupyter Notebooks, through which all of the modeling can be done via Julia, R, or Python.

Deep dive into the data science process with this Jupyter Notebook:

- Collect the relevant data.
- Explore and clean the data to discover patterns.
- Train your logistic regression model.
- Evaluate the model with a variety of metrics.

Want to take things a step further? Keboola can assist you with instrumentalizing your entire data operations pipeline.

Being a data-centric platform, Keboola also allows you to build your ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. You can deploy multiple models with different algorithms to version your work and determine which ones perform best. Start building models today with our free trial.