In this lesson we will compare a linear regression model, KNN regression model and Random Forest. We will judge their success by looking at the Mean Absolute Error, Mean Squared Eror and Root Mean Squared Error.

The models we will be designing are for an income property management company that manages a six-unit apartment building. They want to make predictions as to how much the apartments will earn in the future.

The dataset is in the form of an Excel spreadsheet and it summarizes rents collected, necessary repairs, expenses, capital improvements and the net profit each month.

Day 1: Looking at the spreadsheet

Income property spreadsheet
As you can see by looking at the spreadsheet headings, there is one form of income: rents and eleven types of expenditures.

There is also a column which totals up the expenses for the month and the last column which subtracts the total expenditures from the rents showing the profitability each month.

Save the spreadsheet in your working folder. Call the file "Oceano3.xlsx".

We are going to use Python to construct these models.

We will work on the Linear Regression Model first. We need to import the libraries,load in the data from the spreadsheet and then print the first five lines or the head.

Open a new Python project, click copy text button and paste the contents into the first frame.

Run the code in the first frame.

This code imports libraries from Python, reads in the spreadsheet file and prints the first five lines of it.

You will have to adjust the location of the file in the code to your folder.

Click the + to add a new frame and paste in your code. Run the model.

This code reflects dividing the data into features and labels and the prints out the first five items. Notice that the Net Profit column has been dropped.

Now let's print the label: the Net Profit. Add a new frame to your project and key in the following: y.head(). Save your project and run it to see the first 5 net profit amounts.

Your screen should look like:

Click the + to add a new frame and paste in your code. Run the model.

This code divides our dataset into training and test sets. First we import the sklearn model's train, test, split code.

"After a machine learning algorithum has been trained, it needs to be evaluated to see how well it performs on unseen data. Therefore, we divide the dataset into two sets, ie, train set and test set. To split the data into training and test sets, you use the train_test_split() function from the Sklearn library."¹

The train set contains 80% of the items and the test set contains 20%. The random_state code keeps the random seed from changing each time you run your code.

Your screen should look like:

Twelve months were randomly selected. That is 20% of 60 months contained in the file.

Create a new frame and key in this code to see the net profit:
print(y_test).
Save and run your model. You should see the following.

Before the data is passed to the machine learning algorithums, we need to scale the data. Looking at our data set, you can see that some columns contain small values, while others contain very large values. If you do not scale the data, the larger numbers will have a larger impact on the algorithum

It is best to convert all values to a uniform scale. SkLearn has a function called StandardScaler that we will use.

Here is the code for this.

Click the + to add a new frame and paste in your code. Run the model. There is no output from these lines of code

Click the + to add a new frame and paste in your code. Run the model.

These lines import the linear regression model from sklearn, train the model and then make predictions.

Here is what the output is from these lines.

Here you can see that our model has selected 12 months at random out of the 60 months, printed out the actual amounts for these months and then showed us it's predictions for these months. As you can see the actual and predictions are quite close.

Now we need to see how well our model performed on the unknown test data. There are three metrics designed to show how well our model performed. They are Mean Absolute Error, Mean Squared Error and Root Mean Squared Error. The lower the number, the better our model is.

Here is the code to get for calculating error metrics.

Click the + to add a new frame and paste in your code. Run the model. The output from these lines of code appears below.

194.23 : Mean Absolute Error
40891.79: Mean Squared Error
202.21: Root Mean Squared Error

Predictions based on the Mean Absolute Error can be off as much as $194.23 Predicted net profit is $194.23 more or less than the actual net profit values. Predictions using Mean Squared Error: if the are zero, it means that the predictions are completely accurate. The MSE is always positive, though it can be 0 if the predictions are completely accurate. It incorporates the variance of the estimator (how widely spread the estimates are) and its bias (how different the estimated values are from their true values). Root Mean Squared Error is a unique way of finding the average.

Day 2: K-nearest neighbor"

KNN stands for K-nearest neighbors. KNN is a lazy learning algorithum, which is based on finding The Euclidian distance between different data points. It does not assume any relationship between the features.

Open a new Python project, click copy button and paste the contents into the the first frame.

Run the code in the first frame.

This code imports libraries from Python, reads in the spreadsheet file and prints the first five lines of it. Remember this is the same datset that we evaluated using the linear regression model.

You will have to adjust the location of the file in the code to your folder.

Your screen should look like the image above.

Click the + to add a new frame and paste in your code. Run the model. There is no output from these lines of code.

Click the + to add a new frame and paste in your code. Run the model. The output from these lines of code appears below.

Click the + to add a new frame and paste in your code. Run the model. There output from these lines of code appears below. It represents the randomly selected items and their associated projected net profits.

Click the + to add a new frame and paste in your code. Run the model. The output from these lines of code appears below.

These are the net profit predictions for the 12 randomly selected records.

Click the + to add a new frame and paste in your code. Run the model. The output from these lines of code appears below.

This shows the actual data and the predicted amounts of net income. Take a look at the difference between actual and predicted amounts. How close are they? How do they compare with the linear regression model?

Click the + to add a new frame and paste in your code. Save and run the model. The output from these lines of code appears below.

Absolute Mean Error should be 59.07
Mean Squared Error should be 6249.23
Root Mean Squared Error should be 79.05

Day 3: Graphing the data

Let's see how we can visualize some of the data from our spreadsheet. First we are going to create a bar graph showing rents and their frequency.

The code appears below.

Click the + to add a new frame and paste in your code. Save and run the model. The output from these lines of code appears below.

Click the + to add a new frame and paste in your code. Save and run the model. The output from these lines of code appears below.

How do we interpret this graph?

For five months the monthly rents totaled $8,500
For five months the monthly rents totaled $10,500
For five months the monthly rents totaled $11,000
For five months the monthly rents totaled $12,000
For twenty months the monthly rents totaled $12,500
For twenty months the monthly rents totaled $13,000

Now we are going to look at our expenses over a 60 month period and graph the results. The code appears in the box below.

Click the + to add a new frame and paste in your code. Save and run the model. The output from these lines of code appears below.

How do we interpret this graph of expenses.

For 18 months expenses were $900
For 10 months expenses were $1,000
For 11 months expenses were $1,200
For 5 months expenses were $4,800
For 5 months expenses were $5,500
For 5 months expenses were $1,600
For 6 months expenses were $1,783

If you total up the months, you will get 60 months.

The total of the expenses shown in the graph, total $109,598 wwhich is very close to the Excel total of $109,593.

Now let's look at the net profit generated over the 60 months.

Click the + to add a new frame and paste in your code. Save and run the model. The output from these lines of code appears below.

Here is how you can interpret the net profit graph.

5 months at $4,000
5 months at $4,800
11 months at $8,800
5 months at $10,800
6 months at $11,500
10 months at $11,800
2 months at $11,950
15 months at $12,000

A histogram is a graph that shows the distribution of numerical data. It is a type of bar chart that shows the frequency or number of observations within different numerical ranges, called bins. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The histogram provides a visual representation of the distribution of the data, showing the number of observations that fall within each bin. This can be useful for identifying patterns and trends in the data, and for making comparisons between different datasets.²

You do not get exact values because data is grouped into categories.

Examples

Now we are going to make a pie chart of all of the expenses. Here is the code

Click the + to add a new frame and paste in your code. Save and run the model. The output from these lines of code appears below.

The numbers in the np.array are the totals of these expenses from the Excel spreadsheet. When running the graphs, remember to start with the frame that loads in the libraries and the dataset.

If you want a legend for the pie chart, just delete the # in front of the codes.

Day 4: Random Forest

Now let's see how Random Forest Regression algorithum works on our dataset and to see if the model out performs the other two.

Start a new Python project. Click on the copy text button and paste the contents into the first frame. Change the folder to match where you have your spreadsheet file. Save and run the model. Your output from this frame should look like the image below.

Click the + icon to add a new frame. Click on the copy text button and paste the contents into the frame. Your output should look like the image below

Now let's see if we successfully isolated the NetProfit variables. Click on the + icon to add a new frame and key in the following line

y.head()

The output is the first five month's net profit.

Click the + icon to add a new frame. Click on the copy text button and paste the contents into the frame. Your output should look like the image below

Click the + icon to add a new frame. Click on the copy text button and paste the contents into the frame. Save and run. There is no output

Now it is time to make some predictions.

Click the + icon to add a new frame. Click on the copy button and paste the contents into the frame. Save and run. The predictions from the random forest model are shown below.

We can also print out the models' actual and predicted values. Let's do that in the next frame.

Click the + icon to add a new frame. Click on the copy text button and paste the contents into the frame. Save and run. The actual amounts and the predictions from the random forest model are shown below.

Compare the actual numbers with the predictions. How close are the numbers?

Now let's see how well our model performed.

Click the + icon to add a new frame. Click on the copy text button and paste the contents into the frame. Save and run. The errors are shown below.

Absolute Mean Error 60.59
Mean Squared Error 5059.46
Root Mean Squared Error 71.12

If you compare these numbers with KNN and linear regression models, you will find that the random forest model is probably the best one to use for our net profit projections.

The choice of algorithums depends your dataset. It is probably a good idea to use a number of algorithums and then compare and see which one yields the best result. If you are on a tight time line, it might be best to just use Random Forest.