Take a moment and try Keboola for free.

  • Easy setup, no data storage required
  • Free forever for core features
  • Simple expansion with additional credits
cross-icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Join our newsletter

#noSpamWePromise
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
cross-icon
Subscribe

Run your data operations on a single, unified platform.

  • Easy setup, no data storage required
  • Free forever for core features
  • Simple expansion with additional credits
cross-icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
How To
The Ultimate Guide to Random Forest Regression

Random forest is one of the most widely used machine learning algorithms in real production settings.

1. Introduction to random forest regression

Random forest is one of the most popular algorithms for regression problems (i.e. predicting continuous outcomes) because of its simplicity and high accuracy. In this guide, we’ll give you a gentle introduction to random forest and the reasons behind its high popularity.

1.1 How would random forest be described in layman’s terms?

Let’s start with an actual problem. Imagine you want to buy real estate, and you want to figure out what comprises a good deal so that you don’t get taken advantage of.

The obvious thing to do would be to look at historic prices of houses sold in the area, then create some kind of decision criteria to summarize the average selling prices given the real-estate specification. You can use the decision chart to evaluate whether the listed price for the apartment you are considering is a bargain or not. It could look like this:


The chart represents a decision tree through a series of yes/no questions, which lead you from the real-estate description (“3 bedrooms”) to its historic average price. You can use the decision tree to predict what the expected price of a real estate would be, given its attributes.

However, you could come up with a distinctly different decision tree structure:

This would also be a valid decision chart, but with totally different decision criteria. These decisions are just as well-founded and show you information that was absent in the first decision tree.

The random forest regression algorithm takes advantage of the ‘wisdom of the crowds’. It takes multiple (but different) regression decision trees and makes them ‘vote’. Each tree needs to predict the expected price of the real estate based on the decision criteria it picked. Random forest regression then calculates the average of all of the predictions to generate a great estimate of what the expected price for a real estate should be.

1.2 What are the business use cases of random forest?

Random forest regression is used to solve a variety of business problems where the company needs to predict a continuous value:

  1. Predict future prices/costs. Whenever your business is trading products or services (e.g. raw materials, stocks, labors, service offerings, etc.), you can use random forest regression to predict what the prices of these products and services will be in the future.
  2. Predict future revenue. Use random forest regression to model your operations. For example, you can input your investment data (advertisement, sales materials, cost of hours worked on long-term enterprise deals, etc.) and your revenue data, and random forest will discover the connection between the input and output. This connection can be used to predict how much revenue you will generate based on the growth activity that you pick (marketing, direct to customer sales, enterprise sales, etc.) and how much you are willing to spend on it. 
  3. Compare performance. Imagine that you’ve just launched a new product line. The problem is, it’s unclear whether the new product is attracting more (and higher spending) customers than the existing product line. Use random forest regression to determine how your new product compares to your existing ones.

Random forest regression is extremely useful in answering interesting and valuable business questions, but there are additional reasons why it is one of the most used machine learning algorithms.

1.3 What are the advantages of random forest for real production applications?

Random forest regression is a popular algorithm due to its many benefits in production settings:

  1. Extremely high accuracy. Thanks to its ‘wisdom of the crowds’ approach, random forest regression achieves extremely high accuracies. It usually produces better results than other linear models, including linear regression and logistic regression.
  2. Scales well. Computationally, the algorithm scales well when new features or samples are added to the dataset. 
  3. Interpretable. Although it is not as easily explainable as its underlying algorithm decision tree regression, random forests can be inspected to output the decision trees which were used in the final decision. The individual trees can be used to understand what the important decision nodes were, as well as prompt questions around what led to the final prediction.
  4. Easy to use. Random forest works with both categorical and numerical input variables, so you spend less time one-hot encoding or labeling data. It’s not sensitive to missing data, and it can handle outliers to a certain extent. Overall, it saves you time that would otherwise be spent cleaning data, which is usually the biggest step in any data science pipeline. This doesn’t mean that you should skip the cleaning stage entirely: you will often obtain better performance by working the data into an appropriate shape. But random forest does make it easier to use and faster to deploy to reach the base model.

2. Machine learning approaches to random forest

Random forest is both a supervised learning algorithm and an ensemble algorithm. 

It is supervised in the sense that during training, it learns the mappings between inputs and outputs. For example, an input feature (or independent variable) in the training dataset would specify that an apartment has “3 bedrooms” (feature: number of bedrooms) and this maps to the output feature (or target) that the apartment will be sold for “$200,000” (target: price sold). 

Ensemble algorithms combine multiple other machine learning algorithms, in order to make more accurate predictions than any underlying algorithm could on its own. In the case of random forest, it ensembles multiple decision trees into its final decision.

Random forest can be used on both regression tasks (predict continuous outputs, such as price) or classification tasks (predict categorical or discrete outputs). Here, we will take a deeper look at using random forest for regression predictions.

2.1 The random forest regression model

The random forest algorithm follows a two-step process:

  1. Builds n decision tree regressors (estimators). The number of estimators n defaults to 100 in Scikit Learn (the machine learning Python library), where it is called n_estimators. The trees are built following the specified hyperparameters (e.g. minimum number of samples at the leaf nodes, maximum depth that a tree can grow, etc.).
  2. Average prediction across estimators. Each decision tree regression predicts a number as an output for a given input. Random forest regression takes the average of those predictions as its ‘final’ output.


Let’s delve deeper into how random forest regression builds regression trees. 

2.1.1 Decision tree regression

Regression using decision trees follows the same pattern as any decision tree algorithm:

1. Attribute selection. The decision tree regression algorithm looks at all attributes and their values to determine which attribute value would lead to the ‘best split’. For regression problems, the algorithm looks at MSE (mean squared error) as its objective or cost function, which needs to be minimized. This is equal to variance reduction as a feature selection criterion. Note: Scikit learn also has an MAE (mean absolute error) implementation.

2. Once it finds the best split point candidate, it splits the dataset at that value (called the root node) and repeats the process of attribute selection for the other ranges.

3. The algorithm continues iteratively until either:

a) We have grown terminal or leaf nodes so that they reach each sample (there are no stopping criteria).

b) We reached some stopping criteria. For example, we might have set a maximum depth, which only allows a certain number of splits from the root node to the terminal nodes. Or we might have set a minimum number of samples in each terminal node to prevent them from splitting beyond a certain point.

So, why is a single tree not enough? Why do we need a forest of trees?

2.1.2 Why are decision trees not enough?

Decision trees have a couple of problems:

  1. The main problem is that they tend to overfit very easily. This causes high variance, which can be seen as high test errors on the test dataset, despite high accuracy on the training dataset. In other words, decision trees do not generalize well to novel data.
  2. Decision trees are easily swayed by data that splits the attributes well. Imagine that there’s a single feature, whose values almost deterministically split the dataset. Let's say that when X1 = 0, then Y will always be below 10, while when X1 = 1, then Y will always be equal or greater to 10. Almost every decision tree will use this feature in its split criteria, making the trees overly correlated with each other.

The ensemble of decision trees introduces randomness, which mitigates the issues above. So how does random forest impose randomness? And how does this help make better predictions?

2.1.3 What is random about random forest regression?

The ensemble of decision trees has high accuracy because it uses randomness on two levels:

  1. The algorithm randomly selects a subset of features, which can be used as candidates at each split. This prevents the multitude of decision trees from relying on the same set of features, which automatically solves Problem 2 above and decorrelates individual trees.
  2. Each tree draws a random sample of data from the training dataset when generating its splits. This introduces a further element of randomness, which prevents the individual trees from overfitting the data. Since they cannot see all of the data, they cannot overfit it.

Ensembling decision trees allows us to compensate for the weaknesses of each individual tree.

2.3 Beyond random forest: how to improve the model

The base model can be improved in a couple of ways by tuning the parameters of the random forest regressor:

  1. Specify the maximum depth of the trees. By default, trees are expanded until all leaves are either pure or contain less than the minimum samples for the split. This can still cause the trees to overfit or underfit. Play with the hyperparameter to find an optimal number for max_depth.
  2. Increase or decrease the number of estimators. How does changing the number of trees affect performance? More trees usually means higher accuracy at the cost of slower learning. If you wish to speed up your random forest, lower the number of estimators. If you want to increase the accuracy of your model, increase the number of trees.
  3. Specify the maximum number of features to be included at each node split. This depends very heavily on your dataset. If your independent variables are highly correlated, you’ll want to decrease the maximum number of features. If your input attributes are not correlated and your model is suffering from low accuracy, increase the number of features to be included.

3. Random forest regression in practice

The way in which you use random forest regression in practice depends on how much you know about the entire data science process.

We recommend that beginners start by modeling data on datasets that have already been collected and cleaned, while experienced data scientists can scale their operations by choosing the right software for the task at hand.

3.1 Beginner projects to try out random forest regression

There are over 84 datasets to try out random forest regression in practice. Among the best ones are:

  1. Predict the salary based on years of work experience.
  2. Build an economic prediction of US food imports based on historical data.
  3. Compare how different advertising channels affect total sales.
  4. Predict the number of upvotes a social media post will get.
  5. Predict the price at which the house will sell on a market given the real estate description.

3.2 Production software for advanced data science

Data scientists spend more than 80% of their time on data collection and cleaning. If you want to speed up the entire data pipeline, use software that automates tasks to give you more time for data modeling. 

Keboola offers a platform for data scientists who want to build their own machine learning models. It comes with one-click deployed Jupyter Notebooks, through which all of the modeling can be done using Julia, R, or Python. 

Deep dive into the data science process with this Jupyter Notebook:

  1. Collect the relevant data.
  2. Explore and clean the data to discover patterns.
  3. Train your random forest regression model.
  4. Evaluate the model with a variety of metrics.

Want to take it a step further? Keboola can assist you with instrumentalizing your entire data operations pipeline. 

Being a data-centric platform, Keboola also allows you to build your ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. Deploy multiple models with different algorithms to version your work and compare which ones perform best. Start building models today with our free trial. 


Stay in touch

Download the files
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.