Linear regression, alongside logistic regression, is one of the most widely used machine learning algorithms in real production settings. Here, we present a comprehensive analysis of linear regression, which can be used as a guide for both beginners and advanced data scientists alike.
Thanks to its simplicity, interpretability and speed, linear regression is one of the most widely used artificial intelligence algorithms in real-life machine learning problems.
You can think of linear regression as the answer to the question “How can I use X to predict Y?”, where X is some information that you have and Y is some information that you want to know.
Let’s look at a concrete example. You might be wondering how much you can sell your house for. You have information about your house, for instance, the number of bedrooms is 2 - this is your X. And you want to know how much your estate could be worth on the market. This is Y - the price in $ that you can sell your house for.
Linear regression creates an equation in which you input your given numbers (X) and it outputs the target variable that you want to find out (Y).
We obtain the equation by training it on pairs of (X, Y) values. In this case, we would use a dataset containing historic records of house purchases in the form of (“number of bedrooms”, “selling price”):
We would then visualize the data points on a scatter plot to see if there are any trends. A scatter plot is a two-dimensional plot, with each data point representing a house. On the x-axis, we would have values for “Number of bedrooms”, while on the y-axis, we would have the “Selling price” for the same houses:
Looking at the scatter plot, it seems that there is a trend: the more bedrooms that a house has, the higher its selling price (which is not surprising, to be honest). Now, let’s say that we trained a linear regression model to get an equation in the form:
Selling price = $77,143 * (Number of bedrooms) - $74,286
The equation acts as a prediction. If you input the number of bedrooms, you get the predicted value for the price at which the house is sold. For the specific example above:
Your selling price = $77,143 * 2 bedrooms - $74,286 = $80,000
In other words, you could sell your 2-bedroom house for approximately $80,000. But linear regression does more than just that. We can also visualize it graphically to see what the price would be for houses with a different number of bedrooms:
This is because linear regression tries to find a straight line that best fits the data. Linear regression is not limited to real-estate problems: it can also be applied to a variety of business use cases.
Linear regression is used for a wide array of business prediction problems:
Linear regression is extremely useful in answering hard business questions, but there are other reasons why it is one of the most used machine learning algorithms...
When it comes to production data science settings, linear regression is the popular choice due to its many benefits:
These specific features explain why linear regression is one of the best models for making predictions using machine learning.
Linear regression is a supervised machine-learning regression algorithm. That’s a mouthful! Let’s break it down:
Keep in mind that linear regression is just one of the many regression techniques that we have at our disposal. There are several types of these techniques in the field of predictive modeling:
Linear regression is such a useful and established algorithm, that it is both a statistical model and a machine learning model. Here, we will focus mainly on the machine learning side, but we will also draw some parallels to statistics in order to paint a complete picture.
Once trained, the model takes the form of a linear regression equation of this type:
In this equation:
The goal of the regression analysis (modeling) is to find the values for the unknown parameters of the equation; that is, to find the values for the weights w0 and w1.
Both simple and multiple linear regressions assume that there is a linear relationship between the input variable(s) and the output target variable.
The main difference is the number of independent variables that they take as inputs. Simple linear regression just takes a single feature, while multiple linear regression takes multiple x values. The above formula can be rewritten for a model with n-input variables as:
Where xi is the i-th feature with its own wi weight.
The simple linear regression model can be represented graphically as a best-fit line between the data points, while the multiple linear regression model can be represented as a plane (in 2-dimensions) or a hyperplane (in higher dimensions).
Despite their differences, both the simple and multiple regression models are linear models - they adopt the form of a linear equation. This is called the linear assumption. Quite simply, it means that we assume that the type of relationship between the set of independent variables and independent features is linear.
We train the linear regression algorithm with a method named Ordinary Least Squares (or just Least Squares). The goal of training is to find the weights wi in the linear equation y = wo + w1x.
The Ordinary Least Squares procedure has four main steps in machine learning:
1. Random weight initialization. In practice, w0 and w1 are unknown at the beginning. The goal of the procedure is to find the appropriate values for these model parameters. To start the process, we set the values of the weights at random.
2. Input the initialized weights into the linear equation and generate a prediction for each observation point. To continue with the example that we’ve already used:
3. Calculate the Residual Sum of Squares (RSS).
a) Residuals, or error terms, are the difference between each actual output and the predicted output. They are a point-by-point estimate of how well our regression function predicts outputs in comparison to true values. We obtain residuals by calculating actual values - predicted values for each observation.
b) We square the residuals (in other words, we compute residual2 for each observation point).
c) We sum the residuals to reach our RSS: 1,600,000,000 + 293,882,449 + 2,946,969,796 + 987,719,184 = 5,828,571,429.
d) The basis here is that a lower RSS means that our line of best fit comes closer to each data point. The further away the trend line is from actual observations, the higher the RSS. So, the closer the actual values are (blue points) to the regression line (red line), the better (the green lines representing residuals will be shorter).
4. Model parameter selection to minimize RSS. Machine learning approaches find the best parameters for the linear model by defining a cost function and minimizing it via gradient descent. By doing so, we obtain the best possible values for the weights.
The cost function is a formal representation of an objective that the algorithm is trying to achieve. In the case of linear regression, the cost function is the same as the residual sum of errors. The algorithm solves the minimization problem - it tries to minimize the cost function in order to achieve the best fitting line with the lowest residual errors.
This is achieved through gradient descent.
Gradient descent is a method of changing weights based on the loss function for each data point. We calculate the sum of squared errors at each input-output data point.
We take a partial derivative of the weight and bias to get the slope of the cost function at each point. (No need to brush up on linear algebra and calculus right now. There are several matrix optimizations built into the Python library and Scikit-learn, which allow data science enthusiasts to unlock the power of advanced artificial intelligence without coding the answer themselves).
Based on the slope, gradient descent updates the values for the set of weights and the bias and re-iterates the training loop over new values (moving a step closer to the desired goal).
This iterative approach is repeated until a minimum error is reached, and gradient descent cannot minimize the cost function any further.
The results are optimal weights for the problem at hand.
There is, however, one consideration to bear in mind when using gradient descent: the hyperparameter learning rate. The learning rate refers to how much the parameters are changed at each iteration. If the learning rate is too high, the model fails to converge and jumps from good to bad cost optimizations. If the learning rate is too low, the model will take too long to converge to the minimum error.
How do we evaluate the accuracy of our model?
First of all, you need to make sure that you train the model on the training dataset and build evaluation metrics on the test set to avoid overfitting. Afterward, you can check several evaluation metrics to determine how well your model performed.
There are various metrics to evaluate the goodness of fit:
Once we have trained and evaluated our model, we improve it to make more accurate predictions.
There are multiple methods to improve your linear regression model.
The biggest improvement in your modeling will result from properly cleaning your data. Linear regression has several assumptions about the structure of underlying data, which, when violated, skews or even impedes the model from making accurate predictions. Make sure to:
Features can come in different orders of magnitude. Using our example of the housing price prediction, the number of bedrooms would be on a scale from 1 - 10 (approximately), while the housing area in square feet would be 100-1000x bigger (1000-10,000 square feet).
Features of different scales convert slower (or not at all) with gradient descent.
Normalize and standardize your features to speed up and improve model training.
Regularization is not useful for the simple regression problem with one input variable. Instead, it is commonly used in multiple regression settings to lower the complexity of the model. The complexity relates to the number of coefficients or weights (or features) that a model uses for its predictions.
Regularization can be thought of as a feature selection method, whereby features with lower contributions to the goodness of fit are removed and/or diminished in their effects, while the important features are emphasized.
There are two regularization techniques that are frequently used in linear regression settings:
The way in which you use linear regression in practice depends on how much you know about the entire data science process.
We recommend that beginners start with modeling data on already collected and cleaned datasets, while experienced data scientists can scale their operations by choosing the right software for the task at hand.
There are over 84 datasets that can be used to try out linear regression in practice. Among the best ones are:
Production data science means spending more than 80% of your time on data collection and cleaning. If you want to speed up the entire data pipeline, use software that automates tasks to give you more time for data modeling.
Keboola offers a platform for data scientists who want to build their own machine learning models. It comes with one-click deployed Jupyter Notebooks, in which all of the modeling can be done using Julia, R, or Python.
Deep dive into the data science process with this Jupyter Notebook:
Want to take it a step further? Keboola can assist you with instrumentalizing your entire data operations pipeline.
Being a data-centric platform, Keboola also allows you to build your ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. You can deploy multiple models with different algorithms to version your work and compare which ones perform the best. Start building models today with our free trial.