Part 2: Tutorial- Build a Random Forest model in Orange?

Ashish Agarwal
4 min readDec 31, 2020

This is part 2 of ‘What is Orange’ series. It is highly encouraged to read Part 1 — What is Orange before proceeding with this tutorial

About the Tutorial

We will be using the Kaggle Housing Prices dataset to predict house prices using a Random Forest model. Download the train.csv and test.csv files from the kaggle site. It should be noted that the main objective of this tutorial is to learn Orange and quickly churn a basic model for prediction. Hence it may not have the best of imputation or feature engineering techniques.

Before we delve further, it’s important to understand a few basic conventions of Orange. Fig 1 shows a File loading widget along with 2 Data Table widgets.

Fig 1: Widgets and Connections
  1. Every widget has an Input-side and an Output-side E.g. in Fig 1, input-side of Data Table(1) widget is connected to output-side of File widget
  2. All widgets can either be double-clicked or drag-and-drop to put on canvas area
  3. Two widgets can be connected by simply clicking the output and input sides of the respective widgets

Step 1: Load the train.csv data

Drag/Double-click the File widget into canvas-area, double click to open the properties and select the train.csv. That’s it, data is loaded and you can see the features at the bottom. The variable which we want to predict is SalePrice. Scroll down and select its role as target. Also, make Id as meta since it is just a tag for identification and has no role to play in prediction.

You can see that 1460 records are loaded and 5.9% have missing values

Fig 2: Load the train.csv file, select the target

Step 2: Explore the loaded data

Drag 2 more widgets — Data Table and Feature Statistics and connect with our loaded data. Data Table shows the data you loaded and Feature statistics shows handy details on each column. We observe that lots of data are missing.

Fig 3: Data Table and Feature Statistics widgets
Fig 4: Feature Statistics widget properties for loaded data

Step 3: Imputation

Since we have 5.9% data missing, we will do something called as Imputation in statistics. Drag the Impute widget and connect it with the data. We do this by taking average values if missing value is numeric and most frequent value if it is categorical. In properties for Impute widget, select Average/Most frequent as method

Fig 5: Imputation

To check if the missing values are gone, drag another Feature Statistics widget and connect it to Impute

Fig 6: Verify the missing values are gone

Step 4: Build the model

Before we build a model we need to divide the training data for validation. For this we add a new widget Data Sampler and connect it to Impute. For the sake of simplicity, we stick to the default values.

Fig 7: Data Sampler widget

Now, we drag the Random Forest model and connect it to Data Sampler.

Fig 8: Random Forest widget

Next, we add Test & Score widget which is used to evaluate our learner’s (i.e. Random Forest’s) performance against the validation dataset we created from by Data Sample. In the widget, learner goes as model input, data sample as train data and remaining data as test data/validation set

Fig 9: Test and Score widget (1)

Click the properties of the Test and Score widget to evaluate metrics like RMSE and R-squared.

Fig 10: Test and Score widget (2)

At this stage, we have created a decent model for our training dataset.

Step 5: Prediction

Equipped with a model, let us now use the test data to predict the SalePrice.

Load the test.csv.

Fig 11: Load the test.csv

Drag Predictions widget and provide Random Forest and test data as inputs to it.

Fig 12: Predictions widget

Add another widget Select Columns to get only the predictions against id and use Data Table to view the results as below.

Fig 13: Final predicted Sale Price for houses

Conclusion

As can be seen, we literally created a random forest ML model from scratch without a single line of code. Once you become familiar with Orange tool, creating data visualisations or predictions for data mining projects does not take much time. This tutorial took only about 50 mins from ground zero.

If you liked my article, consider giving multiple thumbs-up and follow me on LinkedIn. Share your experiences using Orange in the comments section.

--

--