Data Sets with Random Forest

Frame 8

Click the plus sign to create a new frame and type in the following information. from sklearn.model_selection import train_test_split

This imports sklearn from Python's library.

Sklearn contains tools for machine learning and statistical modeling.

Frame 9

Click the plus sign to create a new frame and type in the following information. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

This line divides the data into training and test sets. Eighty percent is used for training and 20 percent makes up the test set which is used for testing the algorithum.

Frame 10



  1. Copy code that reads creates a variable clf and fits it to our model.
  2. Click on the Plus sign on the menu bar (insert cell below) to add a new frame to the project.
  3. Click in that frame.
  4. Press CTRL V to paste text into Python.
  5. Click on file and save it as gradesExcel.ipynb

Now it is time to make some predictions from the test data.

Frame 11



  1. Copy code that will make predictions about which students should pass or fail.

  2. Click on the Plus sign on the menu bar (insert cell below) to add a new frame to the project.
  3. Click in that frame.
  4. Press CTRL V to paste text into Python.
  5. Click on file and save it as gradesExcel.ipynb
  6. Now go to frame one,click and run.
  7. After each click you will advance one frame, continue running each frame until you get prediction output that looks like the image below.



Let's look at our results.

If you do not specify the random state in code, then everytime you execute your program, a new random value is generated and the train and test data sets would have different values.

Confusion Matrix

To evaluate the performance of a random forest classifier, we will use a confusion matrix.

A confusion matrix is a mold or container that allows you to visualize the performance of the classification machine learning models. With this visualization, you can get a better idea of how your machine learning model is performing.

A matrix, in mathematics, is a rectangular array of quantities or expressions set out by rows and columns.

A confusion matrix shows true positives, true negatives, false positivies and false negatives.

In our model a true positive is one that was declared passing in the actual number and the prediction also.

A true negative, in our example, is one that shows a student failing in the actual data and also failing in the prediction.

A false positive is one in which the actual data showed the student failing and the prediction had them passing.

In our model a false negative means that the actual data shows a student failing and the prediction shows them passing.


Here is the confusion matrix for our Random Forest Classifier model.

  1. There four True Negatives.
  2. There 15 True Positives.
  3. There 2 False Positives.
  4. There is one False Negative.

The table below shows how these numbers were determined.

Record #PredictionActualType
8411TP
1011TP
7511TP
200TN
2410FP
10011TP
10700TN
711TP
1611TP
8600TN
6800TN
2210FP
4511TP
6011TP
7611TP
5211TP
1311TP
7311TP
8511TP
5411TP
10311TP
801FN

Here is the code that produces the confusion matrix.

Frame 12



The accuracy of the model was determined by adding True Positive (17) and True Negative (5) and dividing by the total number of randomly selected items (22) to get accuracy of 1.0.

Here is the code to find the accuracy of the model.

Frame 13


  1. Copy code that will determine the accuracy of the model
  2. Click on the Plus sign on the menu bar (insert cell below) to add a new frame to the project.
  3. Click in that frame.
  4. Press CTRL V to paste text into python.
  5. Click on file and save it as gradesExcel.ipynb
  6. Now go to frame one,click and run.
  7. After each click you will advance one frame, continue running each frame until you get the accuracy.
  8. You can also get a single predicted result by entering a student's scores in the prediction = clf.predict([[0,0,0,0]]) line of code.

Next we need to see which of the independent variable, features of importance, contribute the most to grade determination.

The variables responsible for determining the students' grades are attendance, citizenship, tests and homework. Which of these is the most important?

Here is the code needed to make the prediction.

The first line of this code allows the user to enter one student's grades to see if they should pass or fail.

Frame 14


  1. Copy code that will determine the features of importance
  2. Click on the Plus sign on the menu bar (insert cell below) to add a new frame to the project.
  3. Click in that frame.
  4. Press CTRL V to paste text into Python.
  5. Click on file and save it as gradesExcel.ipynb
  6. Now go to frame one,click and run.
  7. After each click you will advance one frame, continue running each frame until you get feature importances output that looks like the list below.

Results will be similar with these.

Most important criteria for DETERMINING PASS OR FAIL


As you can see they are ranked most important to least important.

Now let's analyze what this model has predicted here.

Remember the array numbers for the independent variables.

Array #VariableImportance
0 Attendance0.117716
1Citizenship0.405681
2Tests0.464878
3Homework0.011725

According to our results, the most important feature influencing grades awarded is tests.

The least important variable is homework.

The model also provides the capability of entering just one set of independent variables and when the model is run, it will predict if those values mean pass or fail.

The line is :prediction = clf.predict([[4.0,3.0,1,0]])

These numbers should produce a passing grade of 1.

Frame15

  1. Copy code that will determine the features of importance to make a graph
  2. Click on the Plus sign on the menu bar (insert cell below) to add a new frame to the project.
  3. Click in that frame.
  4. Press CTRL V to paste text into Python.
  5. Click on file and save it as gradesExcel.ipynb
  6. Now go to frame one,click and run.
  7. After each click you will advance one frame, continue running each frame until you get the graph of the features of importance.

After our model has determined the most important variables influencing the grade awarded, a visualization of that information would be helpful.

The bar graph below does just that.

  1. The red bar shows the most important feature: tests.
  2. The green bar shows how citizenship affects our grade.
  3. The yellow bar shows the influence of attendance on the grade.
  4. The blue bar shows the influence of homework.