Part 2: Tutorial- Build a Random Forest model in Orange?
This is part 2 of ‘What is Orange’ series. It is highly encouraged to read Part 1 — What is Orange before proceeding with this tutorial
About the Tutorial
We will be using the Kaggle Housing Prices dataset to predict house prices using a Random Forest model. Download the train.csv and test.csv files from the kaggle site. It should be noted that the main objective of this tutorial is to learn Orange and quickly churn a basic model for prediction. Hence it may not have the best of imputation or feature engineering techniques.
Before we delve further, it’s important to understand a few basic conventions of Orange. Fig 1 shows a File loading widget along with 2 Data Table widgets.
- Every widget has an Input-side and an Output-side E.g. in Fig 1, input-side of Data Table(1) widget is connected to output-side of File widget
- All widgets can either be double-clicked or drag-and-drop to put on canvas area
- Two widgets can be connected by simply clicking the output and input sides of the respective widgets
Step 1: Load the train.csv data
Drag/Double-click the File widget into canvas-area, double click to open the properties and select the train.csv. That’s it, data is loaded and you can see the features at the bottom. The variable which we want to predict is SalePrice. Scroll down and select its role as target. Also, make Id as meta since it is just a tag for identification and has no role to play in prediction.
You can see that 1460 records are loaded and 5.9% have missing values
Step 2: Explore the loaded data
Drag 2 more widgets — Data Table and Feature Statistics and connect with our loaded data. Data Table shows the data you loaded and Feature statistics shows handy details on each column. We observe that lots of data are missing.
Step 3: Imputation
Since we have 5.9% data missing, we will do something called as Imputation in statistics. Drag the Impute widget and connect it with the data. We do this by taking average values if missing value is numeric and most frequent value if it is categorical. In properties for Impute widget, select Average/Most frequent as method
To check if the missing values are gone, drag another Feature Statistics widget and connect it to Impute
Step 4: Build the model
Before we build a model we need to divide the training data for validation. For this we add a new widget Data Sampler and connect it to Impute. For the sake of simplicity, we stick to the default values.
Now, we drag the Random Forest model and connect it to Data Sampler.
Next, we add Test & Score widget which is used to evaluate our learner’s (i.e. Random Forest’s) performance against the validation dataset we created from by Data Sample. In the widget, learner goes as model input, data sample as train data and remaining data as test data/validation set
Click the properties of the Test and Score widget to evaluate metrics like RMSE and R-squared.
At this stage, we have created a decent model for our training dataset.
Step 5: Prediction
Equipped with a model, let us now use the test data to predict the SalePrice.
Load the test.csv.
Drag Predictions widget and provide Random Forest and test data as inputs to it.
Add another widget Select Columns to get only the predictions against id and use Data Table to view the results as below.
As can be seen, we literally created a random forest ML model from scratch without a single line of code. Once you become familiar with Orange tool, creating data visualisations or predictions for data mining projects does not take much time. This tutorial took only about 50 mins from ground zero.
If you liked my article, consider giving multiple thumbs-up and follow me on LinkedIn. Share your experiences using Orange in the comments section.