Introduction to Random Forest Regression

Random forest is the popular algorithms for regression problems like predicting continuous outcomes because of its simplicity and high accuracy. Random forest is an ensemble machine learning algorithm. It is perhaps the most popular and widely used machine learning algorithm given its good or excellent performance across a wide range of classification and regression predictive modeling problems.

A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap Aggregation, commonly known as bagging. Bagging, in the Random Forest method which involves training each decision tree on a different data sample where sampling is done with replacement. It is simple to utilize given that it has not many key hyperparameters and reasonable heuristics for designing these hyperparameters. Random forest is a supervised learning algorithm. The "forest" it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

How it works?

Random Forests produce better results, work well on large datasets, and are able to work with missing data by creating estimates for them. However, they pose a major challenge that is that they can’t extrapolate outside unseen data. We’ll dive deeper into these challenges in a minute

The ensemble of decision trees has high accuracy because it uses randomness on two levels:

  1. The algorithm randomly selects a subset of features, which can be used as candidates at each split. This prevents the multitude of decision trees from relying on the same set of features, which automatically solves Problem 2 above and decorrelates individual trees.
  2. Each tree draws a random sample of data from the training dataset when generating its splits. This introduces a further element of randomness, which prevents the individual trees from overfitting the data. Since they cannot see all of the data, they cannot overfit it.

This is to say that many trees, constructed in a certain “random” way form a Random Forest. 

  • Each tree is created from a different sample of rows and at each node, a different sample of features is selected for splitting. 
  • Each of the trees makes its own individual prediction. 
  • These predictions are then averaged to produce a single result. 

Point to be noted:

We can use Random Forest Regression when the data has a non-linear trend and extrapolation out of the training data is not important.

We should not use Random Forest Regression when the data is in time series form as the time series problem requires the identification of a growing or decreasing trend that a Random Forest Regressor will not be able to formulate.