How To

The Ultimate Guide to Linear Regression for Machine Learning

Linear regression, alongside logistic regression, is one of the most widely used machine learning algorithms in real production settings. Here, we present a comprehensive analysis of linear regression, which can be used as a guide for both beginners and advanced data scientists alike.

Thanks to its simplicity, interpretability and speed, linear regression is one of the most widely used artificial intelligence algorithms in real-life machine learning problems.

You can think of linear regression as the answer to the question “How can I use X to predict Y?”, where X is some information that you have and Y is some information that you want to know.

Let’s look at a concrete example. You might be wondering how much you can sell your house for. You have information about your house, for instance, the number of bedrooms is 2 - this is your X. And you want to know how much your estate could be worth on the market. This is Y - the price in $ that you can sell your house for.

Linear regression creates an equation in which you input your given numbers (X) and it outputs the target variable that you want to find out (Y).

We obtain the equation by training it on pairs of (X, Y) values. In this case, we would use a dataset containing historic records of house purchases in the form of (“number of bedrooms”, “selling price”):

We would then visualize the data points on a scatter plot to see if there are any trends. A scatter plot is a two-dimensional plot, with each data point representing a house. On the x-axis, we would have values for “Number of bedrooms”, while on the y-axis, we would have the “Selling price” for the same houses:

Looking at the scatter plot, it seems that there is a trend: the more bedrooms that a house has, the higher its selling price (which is not surprising, to be honest). Now, let’s say that we trained a linear regression model to get an equation in the form:

*Selling price = $77,143 * (Number of bedrooms) - $74,286*

The equation acts as a prediction. If you input the number of bedrooms, you get the predicted value for the price at which the house is sold. For the specific example above:

*Your selling price = $77,143 * 2 bedrooms - $74,286 = $80,000*

In other words, you could sell your 2-bedroom house for approximately $80,000. But linear regression does more than just that. We can also visualize it graphically to see what the price would be for houses with a different number of bedrooms:

This is because linear regression tries to find a straight line that best fits the data. Linear regression is not limited to real-estate problems: it can also be applied to a variety of business use cases.

Linear regression is used for a wide array of business prediction problems:

**Predict future prices/costs**. If your business is buying items or services (e.g. raw materials expenses, stock prices, labor costs, etc.), you can use linear regression to predict what the prices of these items are going to be in the future.**Predict future revenue**. You can use linear regression to model your advertising data, discover the relation between advertising data and your revenue, and predict how much revenue your business will generate depending on how much you spend on ads in a given month.**Compare performance**. You just launched a new product line, but it’s unclear whether it is attracting more (and higher-spending) customers than your existing ones. Use linear regression to determine how your new product compares to the ones that you already have.

Linear regression is extremely useful in answering hard business questions, but there are other reasons why it is one of the most used machine learning algorithms...

When it comes to production data science settings, linear regression is the popular choice due to its many benefits:

**Ease of use**. The model is simple to implement computationally. It does not require a lot of engineering overhead, neither before launch nor during maintenance.**Interpretability**. Unlike the deep learning models (neural networks), linear regression is straightforward to interpret. This positions the machine learning algorithm ahead of black-box models, which do not explain which input variable causes the output variable to change.**Scalability**. The algorithm is not computationally heavy, which means that linear regression is perfect for use cases where scaling is expected. It scales well with increases in data volume (big data) and data velocity too.**Performs well in online settings**. Because of the ease of computation, linear regression can be used in online settings, meaning that the model can be retrained with each new example and generate predictions in near real-time. This contrasts with computationally heavy approaches like neural networks or support vector machines: these require a lot of computing resources or lengthy waiting times to retrain on new data, making them unsuitable for real-time applications (or at least very expensive).

These specific features explain why linear regression is one of the best models for making predictions using machine learning.

Linear regression is a supervised machine-learning regression algorithm. That’s a mouthful! Let’s break it down:

- Supervised machine learning: supervised learning techniques train the model by providing it with pairs of input-output examples from which it can learn. For example, we can say that when the input was $10,000 in marketing spend, we got the output (target) of $15,000 in revenue.
- Regression: The ‘regression’ part means that the model is suited to regression problems, aka, predicting continuous or quantitative values. An example could be predicting the number of new customers that a product will bring (43) or how much revenue will be generated next month ($34,000). This differs from classification algorithms which, instead of predicting continuous outputs, predict whether an input belongs to a class (e.g. the classification problem in which you need to predict whether the picture in front of you belongs to the class “dog” or the class “cat”).
- Algorithm: think of it as a recipe. The result of training the linear regression model on training data is an equation (recipe), which can be applied to new (previously unseen) data. It’s a bit like applying a cooking recipe to a fresh batch of ingredients!

Keep in mind that linear regression is just one of the many regression techniques that we have at our disposal. There are several types of these techniques in the field of predictive modeling:

- Simple and multiple linear regression
- Polynomial regression
- Ridge regression and Lasso regression (upgrades to linear regression)
- Decision trees regression
- Support Vector Machines (SVM)

Linear regression is such a useful and established algorithm, that it is both a statistical model and a machine learning model. Here, we will focus mainly on the machine learning side, but we will also draw some parallels to statistics in order to paint a complete picture.

Once trained, the model takes the form of a linear regression equation of this type:

In this equation:

- y is the output variable. It is also called the
*target variable*in machine learning, or the*dependent variable*in statistical modeling. It represents the continuous value that we are trying to predict. - x is the input variable. In machine learning, x is referred to as the
*feature,*while in statistics, it is called the*independent variable*. It represents the information given to us at any given time. - w0 is the
*bias term*or y-axis intercept. - w1 is the
*regression coefficient*or scale factor. In classical statistics, it is the equivalent of the slope on the best-fit straight line that is produced after the linear regression model has been fitted. - wi are called
*weights*in general.

The goal of the regression analysis (modeling) is to find the values for the unknown parameters of the equation; that is, to find the values for the weights w0 and w1.

Both simple and multiple linear regressions assume that there is a linear relationship between the input variable(s) and the output target variable.

The main difference is the number of independent variables that they take as inputs. Simple linear regression just takes a single feature, while multiple linear regression takes multiple x values. The above formula can be rewritten for a model with n-input variables as:

Where xi* *is the i-th feature with its own wi weight.

The simple linear regression model can be represented graphically as a best-fit line between the data points, while the multiple linear regression model can be represented as a plane (in 2-dimensions) or a hyperplane (in higher dimensions).

Despite their differences, both the simple and multiple regression models are linear models - they adopt the form of a *linear* equation. This is called the linear assumption. Quite simply, it means that we assume that the type of relationship between the set of independent variables and independent features is linear.

We train the linear regression algorithm with a method named Ordinary Least Squares (or just Least Squares). The goal of training is to find the weights wi in the linear equation y = wo + w1x.

The Ordinary Least Squares procedure has four main steps in machine learning:

1. Random weight initialization. In practice, w0 and w1 are unknown at the beginning. The goal of the procedure is to find the appropriate values for these model parameters. To start the process, we set the values of the weights at random.

2. Input the initialized weights into the linear equation and generate a prediction for each observation point. To continue with the example that we’ve already used:

3. Calculate the Residual Sum of Squares (RSS).

a) Residuals, or error terms, are the difference between each actual output and the predicted output. They are a point-by-point estimate of how well our regression function predicts outputs in comparison to true values. We obtain residuals by calculating *actual values - predicted values *for each observation.

b) We square the residuals (in other words, we compute residual2 for each observation point).

c) We sum the residuals to reach our RSS: 1,600,000,000 + 293,882,449 + 2,946,969,796 + 987,719,184 = 5,828,571,429.

d) The basis here is that a lower RSS means that our line of best fit comes closer to each data point. The further away the trend line is from actual observations, the higher the RSS. So, the closer the actual values are (blue points) to the regression line (red line), the better (the green lines representing residuals will be shorter).

4. Model parameter selection to minimize RSS. Machine learning approaches find the best parameters for the linear model by defining a cost function and minimizing it via gradient descent. By doing so, we obtain the best possible values for the weights.

The cost function is a formal representation of an objective that the algorithm is trying to achieve. In the case of linear regression, the cost function is the same as the residual sum of errors. The algorithm solves the minimization problem - it tries to minimize the cost function in order to achieve the best fitting line with the lowest residual errors.

This is achieved through gradient descent.

Gradient descent is a method of changing weights based on the loss function for each data point. We calculate the sum of squared errors at each input-output data point.

We take a partial derivative of the weight and bias to get the slope of the cost function at each point. (No need to brush up on linear algebra and calculus right now. There are several matrix optimizations built into the Python library and Scikit-learn, which allow data science enthusiasts to unlock the power of advanced artificial intelligence without coding the answer themselves).

Based on the slope, gradient descent updates the values for the set of weights and the bias and re-iterates the training loop over new values (moving a step closer to the desired goal).

This iterative approach is repeated until a minimum error is reached, and gradient descent cannot minimize the cost function any further.

The results are optimal weights for the problem at hand.

There is, however, one consideration to bear in mind when using gradient descent: the hyperparameter learning rate. The learning rate refers to how much the parameters are changed at each iteration. If the learning rate is too high, the model fails to converge and jumps from good to bad cost optimizations. If the learning rate is too low, the model will take too long to converge to the minimum error.

How do we evaluate the accuracy of our model?

First of all, you need to make sure that you train the model on the training dataset and build evaluation metrics on the test set to avoid overfitting. Afterward, you can check several evaluation metrics to determine how well your model performed.

There are various metrics to evaluate the goodness of fit:

**Mean Squared Error (MSE)**. MSE is computed as RSS divided by the total number of data points, i.e. the total number of observations or examples in our given dataset. MSE tells us what the average RSS is per data point.**Root Mean Squared Error (RMSE)**. RMSE takes the MSE value and applies a square root over it. It is similar to MSE, but much more intuitive for error interpretation. It is equivalent to the absolute error between our linear regression line and any hypothetical observation point. Unlike MSE and RSS (which use squared values), RMSE can be directly used to interpret the ‘average error’ that our prediction model makes.**R2 or R-squared or R2 score**. R-squared is a measure of how much variance in the dependent variable that our linear function accounts for. This measure is more technical than the other two, so it’s less intuitive for a non-statistician. As a rule of thumb, an R-squared value that is closer to 1 is better, because it accounts for more variance.

Once we have trained and evaluated our model, we improve it to make more accurate predictions.

There are multiple methods to improve your linear regression model.

The biggest improvement in your modeling will result from properly cleaning your data. Linear regression has several assumptions about the structure of underlying data, which, when violated, skews or even impedes the model from making accurate predictions. Make sure to:

**Remove outliers**. Outliers in the quantitative response y skew the slope of the line disproportionately. Remove them to have a better-fitted line.**Remove multicollinearity**. Linear regression assumes that there is little or no correlation between the input values - otherwise, it overfits the data. Create a correlation matrix for all of your features to check which pairs of features suffer from high correlation. Remove these features to keep just one.**Assert normal distribution**. The model assumes that the independent variables follow a Gaussian distribution. Transform your variables with log transform or BoxCox if they are not normally distributed.**Assert linear assumption**. If your independent variables do not have a linear relationship with your predictor variable, log transform them to reshape polynomial relationships into linear.

Features can come in different orders of magnitude. Using our example of the housing price prediction, the number of bedrooms would be on a scale from 1 - 10 (approximately), while the housing area in square feet would be 100-1000x bigger (1000-10,000 square feet).

Features of different scales convert slower (or not at all) with gradient descent.

Normalize and standardize your features to speed up and improve model training.

Regularization is not useful for the simple regression problem with one input variable. Instead, it is commonly used in multiple regression settings to lower the complexity of the model. The complexity relates to the number of coefficients or weights (or features) that a model uses for its predictions.

Regularization can be thought of as a feature selection method, whereby features with lower contributions to the goodness of fit are removed and/or diminished in their effects, while the important features are emphasized.

There are two regularization techniques that are frequently used in linear regression settings:

- Lasso L1 Regression - uses a penalty term to remove predictor variables, which have low contributions to overall model performance
- Ridge L2 Regression - uses a penalty term to lower the influence of predictor variables (but does not remove features)

The way in which you use linear regression in practice depends on how much you know about the entire data science process.

We recommend that beginners start with modeling data on already collected and cleaned datasets, while experienced data scientists can scale their operations by choosing the right software for the task at hand.

There are over 84 datasets that can be used to try out linear regression in practice. Among the best ones are:

- Predict the salary based on years of work experience
- Build an economic prediction of US food imports based on historical data
- Compare how different advertising channels affect total sales
- Predict the number of upvotes a social media post will get
- Predict the price at which the house will sell on a market given the real estate description

Production data science means spending more than 80% of your time on data collection and cleaning. If you want to speed up the entire data pipeline, use software that automates tasks to give you more time for data modeling.

Keboola offers a platform for data scientists who want to build their own machine learning models. It comes with one-click deployed Jupyter Notebooks, in which all of the modeling can be done using Julia, R, or Python.

Deep dive into the data science process with this Jupyter Notebook:

- Collect the relevant data
- Explore and clean the data to discover patterns
- Train your linear regression model
- Evaluate the model with a variety of metrics

Want to take it a step further? Keboola can assist you with instrumentalizing your entire data operations pipeline.

Being a data-centric platform, Keboola also allows you to build your ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. You can deploy multiple models with different algorithms to version your work and compare which ones perform the best. Start building models today with our free trial.