Predictive analysis is the field of data science that involves making predictions about future events.

We have been given the task to predict a future payroll after giving all or our managers a raise in pay.

We have also been asked to see the impact of hiring additional employees for the upcoming holiday season.

In this lesson you will see how machine learning can help us make intelligent accounting decisions. Information from the past month's payroll, Bromley's Mens' Wear is going to be our dataset.

We first need to build the dataset and create a csv (comma separated value) file.

In real life our dataset would be much larger, perhaps a year or more worth of payroll records.

We will create a linear regression model to analyze the payroll data.

A linear regression model is a model that presumes a linear relationship between inputs and outputs.

An example of a linear relationship: hours studied and grades achieved: the more hours that a student studies, the better their grade

We hypothesize that there is a linear relationship between hours worked, pay rate and commission with gross pay.

If we either increase or decrease any of these independent variables that it will change our gross pay for any individual employee.

We are contemplating adding more employees for the holiday season.

We believe that our managers deserve a raise in pay.

We pay a minimum wage of $16.00 and commissions to our sales persons. Dawn Dunbar is our CEO. Hal Teague is the CFO. Dave Couch heads up our IT department and Paul Parsons is the head of HR.

Day 1 & 2: Prepare the data

First Name	Last Name	Hours worked	Pay Rate	Comm	Gross Pay
Bill	James	40.00	17.25	50.00	740.00
Jim	Kingman	38.00	18.25	25.00	718.50
Aaron	Clark	39.20	17.25	50.00	740.00
Dawn	Dunbar	40.00	16.50	0.00	660.00
Julie	Velasquez	40.00	18.75	100.00	850.00
Amiee	Lin	37.00	15.35	45.00	649.95
Ralph	Duncan	40.00	19.00	65.00	825.00
Shelley	White	39.70	19.25	45.00	813.20
Hal	Teague	38.00	16.00	0.00	608.00
Sam	Moser	39.00	17.00	35.00	698.00
Ruth	Baylor	40.00	18.00	68.50	788.50
Dave	Couch	35.00	17.00	0.00	595.00
Jerry	Meister	36.20	18.35	88.00	752.27
Lawrence	Thomas	40.00	19.00	100.00	860.00
Jane	Clyde	40.00	18.35	45.00	779.00
Dick	Weatherby	39.00	19.00	57.95	798.95
Steve	Lewis	40.00	20.00	150.00	950.00
Paco	Hinson	40.00	16.88	34.00	709.20
Blake	Herman	38.00	22.50	75.00	730.00
Paul	Parsons	40.00	17.00	0.00	680.00
Charles	Tilden	40.00	17.25	10.00	700.00
Bill	James	39.60	17.25	66.00	749.10
Jim	Kingman	36.00	18.25	36.00	693.00
Aaron	Clark	39.20	22.50	68.00	950.00
Dawn	Dunbar	39.00	16.50	0.00	687.50
Julie	Velasquez	37.00	18.75	110.00	803.75
Amiee	Lin	35.00	16.35	56.00	628.25
Ralph	Duncan	39.00	19.00	68.00	809.00
Shelley	White	38.70	19.35	44.00	792.85
Hal	Teague	37.50	16.00	0.00	600.00
Sam	Moser	38.00	17.00	44.00	690.00
Ruth	Baylor	39.50	18.00	99.00	810.00
Dave	Couch	36.00	17.00	0.00	634.50
Jerry	Meister	37.00	18.35	98.00	776.95
Lawrence	Thomas	39.00	19.00	110.00	851.00
Jane	Clyde	40.00	18.35	47.00	781.00
Dick	Weatherby	38.00	19.00	57.95	779.95
Steve	Lewis	35.90	20.00	166.00	884.00
Paco	Hinson	40.00	16.88	44.00	719.20
Blake	Herman	37.00	22.50	68.00	900.50
Paul	Parsons	38.00	17.00	0.00	679.00
Charles	Tilden	39.00	17.25	24.00	696.75
Bill	James	40.00	17.25	55.00	745.00
Jim	Kingman	39.00	18.25	25.00	736.75
Aaron	Clark	40.00	22.50	58.00	958.00
Dawn	Dunbar	40.00	16.50	0.00	693.00
Julie	Velasquez	40.00	18.75	150.00	900.00
Amiee	Lin	37.00	16.35	55.00	659.95
Ralph	Duncan	39.50	19.00	65.00	815.50
Shelley	White	39.70	19.35	44.00	812.20
Hal	Teague	38.00	16.00	0.00	608.00
Sam	Moser	39.00	17.00	35.00	698.00
Ruth	Baylor	40.00	18.00	77.00	797.00
Dave	Couch	35.00	17.00	0.00	639.00
Jerry	Meister	40.00	18.35	88.00	822.00
Lawrence	Thomas	40.00	19.00	98.00	858.00
Jane	Clyde	40.00	18.35	75.00	809.00
Dick	Weatherby	39.00	19.00	67.00	808.00
Steve	Lewis	40.00	20.00	125.00	925.00
Paco	Hinson	40.00	16.88	44.00	719.20
Blake	Herman	40.00	22.50	85.00	985.00
Paul	Parsons	40.00	17.00	0.00	692.00
Charles	Tilden	40.00	17.25	36.00	726.00
Bill	James	40.00	17.25	50.00	740.00
Jim	Kingman	38.00	18.25	25.00	718.50
Aaron	Clark	39.20	22.50	650.00	947.00
Dawn	Dunbar	40.00	16.50	0.00	660.00
Julie	Velasquez	40.00	18.75	100.00	850.00
Amiee	Lin	37.00	16.35	45.00	649.95
Ralph	Duncan	40.00	19.00	65.00	825.00
Shelley	White	39.70	19.25	45.00	813.20
Hal	Teague	38.00	16.00	0.00	608.00
Sam	Moser	39.00	17.00	35.00	698.00
Ruth	Baylor	40.00	18.00	68.50	788.50
Dave	Couch	35.00	17.00	0.00	595.00
Jerry	Meister	36.20	18.35	88.00	752.27
Lawrence	Thomas	40.00	19.00	100.00	860.00
Jane	Clyde	40.00	18.35	45.00	779.00
Dick	Weatherby	39.00	19.00	57.95	798.95
Steve	Lewis	40.00	20.00	150.00	950.00
Paco	Hinson	40.00	16.88	34.00	709.20
Blake	Herman	38.00	22.50	75.00	930.00
Paul	Parsons	40.00	17.00	0.00	680.00
Charles	Tilden	40.00	17.25	10.00	700.00

Open a text editor like notepad
Key in the heading for the data set: First Name,Last Name,hrsWorked,payRate,comm,grossPay
Each horzontal row of the table is a record in the data set. It represents a single employee, hours worked, pay rate, commissions and gross pay for one week.
Key in each line separating each piece of data with a comma. Bill,James,40.00,17.25,50.00,740.00
Each row is followed by a pressing the return key.
When finished save the file in a folder of your choosing. Use "payroll.csv" as the file name.
Write down the location of the file as you will need to know where this file is when loading it into your Python model.

Day 3: Creating the Python code to decide type of algorithum to use.

We need to determine what type of model we should construct.

We have assumed that a linear regression model is the right fit.

It is important to know how the relationship between the values of the X axis and the Y axis. We need to see if there is any relationship.

If there is no relationship, we cannot use linear regression as a model.

We believe that there is correation between pay rate and gross profit.

Let's find out.

r = the coefficient of correlation.

r values range from -1 to 1. 0 means no correlation. 1 and -1 means 100% correlated

We have listed the first 20 rates of pay from our dataset on the X axis.

The first 20 amounts of gross pay have been put on the Y axis.

Start a new python project.

Open a new Python project and paste the contents into the the first frame.

Run the code in the first frame.

The r value is 0.9025005300016853 indicating a very positive correlation between rate of pay and gross profit, therefore we can use linear regresssion as our model to make predictions.

We entered 16.35 into our function to predict what gross pay should be with this payrate.

Our model predicted that a pay rate of 16.35 should predict a gross pay of 663.2600816333731. Try entering other pay rates in our model.

One final test: We should graph this data to see if it will produce a straight line, indicating a strong linear relationship.

Day 4: Visualizing the data in our model

Open a new Python project and paste the contents into the the first frame.

Run the code in the first frame.

In the first lines of code we imported matplotlib and scipy.

Next we created an array of the numbers for pay rate from our dataset.

We created another array for the y axis consisting of gross pay.

Next, we executed a method that returns some key values of linear regression.

Now we created a function that uses slope and intercept values that represent where on the y axis the corresponding x value are to be placed.

Then we drew the scatterplot.

The line last draws the linear regression line.

Our graph shows a strong relationship between gross pay and rate of pay. We can use a linear regression model to make predictions.

Day 5:Analyzing the payroll using Python linear regression model.

Click on the link to get a copy of the worksheet.

Worksheet for this assignment
Importing Python libraries into our project

Open a new Python project and paste the contents into the the first frame.

This code imports the libraries needed for this model.

Click on the + to add a new frame.

Paste the code to add the libraries to the project.

Getting and printing the dataset file

Press the + sign to add a new frame.

Paste the code into a new frame.

The dataset represents last month's payroll. It is composed of four weeks. There are 21 employees. In real life, the dataset would be much larger, but we can still see how it works with a much smaller dataset

In this case, the whole dataset is displayed. Normally, with large datasets, the head, which consists of the first 5 lines, is displayed.

The location of the file is up to you. Remember where you saved your dataset file of employees.

You can see the folder that I used is C:\\Users\\jerry\\OneDrive\\Documents\\LinearRegressionAccounting.

Now we are going to put in some code that will describe our dataset.

Describing the dataset

Click on the + to add a new frame.

Paste the clipboard content into the frame.

Splitting the data set into independent variables and dependent variable

Click on the + to add a new frame.

Paste the clipboard content into the frame.

This frame separates the independent variables, (X) from the dependent variable (y).

Click the + key for a new frame and key in: from sklearn.model_selection import train_test_split

Click the + key for a new frame and type: X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.20, random_state=0)

Here is where we begin he process of training the model. We are using a sample size of 20% which results in 17 records. The random state designation makes it so that each time we run the model, the items selected wil be the same.

Click on the + key for a new frame and type: from sklearn.linear_model import LinearRegression

Save your program.

Click on the + and add these two lines in the frame:

linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)

Click on the + key for a new frame and key in:

coeff = pd.DataFrame(linear_regressor.coef_,X.columns, columns=['Coefficient'])
print(coeff)

This block of code shows which independent variable has the most influence on gross profit. Save and run. You should see the following

Coefficient
hrsWorked 18.777041
payRate 39.534752
comm 0.975115

Create a new frame and key in the following three lines of code. Save and run. These lines make predictions as to gross pay and also print out the actual numbers.

pred_y = linear_regressor.predict(X_test)
df= pd.DataFrame({'Actual':y_test,'Predicted':pred_y})
print(df)

Actual Predicted
30 690.00 687.925756
40 1000.00 998.852801
43 736.75 737.594054
50 1000.00 998.852801
22 693.00 691.989194
54 822.00 821.756810
2 947.00 948.376756
56 809.00 809.080316
26 628.25 617.598422
8 1000.00 998.852801
69 825.00 825.026756
13 860.00 859.155778
66 1000.00 998.852801
77 779.00 779.826869
16 950.00 947.446276
27 809.00 809.175060
75 752.27 750.404053

Seventeen records or rows were selected at random. Our data set has 84 rows. (84 * .20 = 17).

The actual data from each of these records is displayed along with the model's predictions.

If you compare the actual values with the predictions, you will see that the numbers are very close. Our model did an excellent job in predicting future values for each of these records.

print('Predicted gross pay for 20% of dataset')
sum(pred_y)

Create a new frame and key in the above two lines to get a total of the predicted value of seventeen gross pay amounts.

print('Actual gross pay for 20% of dataset') sum(y_test)

Create a new frame and key in the above two lines to get the actual total of the seventeen gross pay amounts.

Save and run. Record your answers to actual and predicted gross pay totals on your worksheet.

Click on the + to add a new frame.

Paste the clipboard content into the frame.

Save and run to get the result of a person working 40 hrs per week and being paid $25.00 per hour and no commission.

Our model's prediction is very close to the actual pay of 40 times 25 = 1000

This feature of our model is very helpful in determine what to pay employees. Try some other numbers.

Click on the + to add a new frame.

Paste the clipboard content into the frame.

These lines test the accuracy of our model.

Mean Absolute Error calculates the average difference between the calculated values and actual values. It is also known as scale-dependent accuracy as it calculates error in observations taken on the same scale. It is used as evaluation metrics for regression models in machine learning. It calculates errors between actual values and values predicted by the model. It is used to predict the accuracy of the machine learning model.¹
The mean squared error measures the average of the squares of the errors. What this means, is that it returns the average of the sums of the square of each difference between the estimated value and the true value.² The root mean square error (RMSE) is a metric that tells us how far apart our predicted values are from our observed values in a model, on average.
The root mean square error (RMSE) is a metric that tells us how far apart our predicted values are from our observed values in a model, on average.

Day 6: Giving the managers a raise

We need to modify the csv dataset file. We are going to give all managers a raise1. They all work 40 hours a week.

Dawn Dunbar CEO $30.00 per hour, 1200 gross pay
Hal Teague CFO $25.00 per hour, 1000 gross pay
Dave Couch IT $22.75 per hour, 910 gross pay
Paul Parsons HR $21.00 per hour, 840 gross pay

The original file looks like this.

First Name,Last Name,hrsWorked,payRate,comm,grossPay
Bill,James,40,17.25,50,740
Jim,Kingman,38,18.25,25,718.5
Aaron,Clark,39.2,22.5,65,947
Dawn,Dunbar,40,16.5,0,660
Julie,Velasquez,40,18.75,100,850
Aimee,Lin,37,16.35,45,649.95
Ralph,Dugan,40,19,65,825
Shelley,White,39.7,19.35,45,813.2
Hal,Teague,38,16,0,608
Sam,Moser,39,17,35,698
Ruth,Baylor,40,18,68.5,788.5
Dave,Couch,35,17,0,595
Jerry,Meister,36.2,18.35,88,752.27
Lawrence,Thomas,40,19,100,860
Jane,Clyde,40,18.35,45,779
Dick,Weatherby,39,19,57.95,798.95
Steve,Lewis,40,20,150,950
Paco,Hinson,40,16.88,34,709.2
Herman,Blake,38,22.5,75,930
Paul,Parsons,40,17,0,680
Charles,Tilden,40,17.25,0,690
Bill,James,39.6,17.25,66,749.1
Jim,Kingman,36,18.25,36,693
Aaron,Clark,39.2,22.5,68,950
Dawn,Dunbar,39,16.5,0,643.5
Julie,Velasquez,37,18.75,110,803.75
Aimee,Lin,35,16.35,56,628.25
Ralph,Dugan,39,19,68,809
Shelley,White,38.7,19.35,44,792.85
Hal,Teague,37.5,16,0,600
Sam,Moser,38,17,44,690
Ruth,Baylor,39.5,18,99,810
Dave,Couch,36,17,0,612
Jerry,Meister,37,18.35,98,776.95
Lawrence,Thomas,39,19,110,851
Jane,Clyde,40,18.35,47,781
Dick,Weatherby,38,19,57.95,779.95
Steve,Lewis,35.9,20,166,884
Paco,Hinson,40,16.88,44,719.2
Herman,Blake,37,22.5,68,900.5
Paul,Parsons,38,17,0,646
Charles,Tilden,39,17.25,24,696.75
Bill,James,40,17.25,55,745
Jim,Kingman,39,18.25,25,736.75
Aaron,Clark,40,22.5,58,958
Dawn,Dunbar,40,16.5,0,660
Julie,Velasquez,40,18.75,150,900
Aimee,Lin,37,16.35,55,659.95
Ralph,Dugan,39.5,19,65,815.5
Shelley,White,39.7,19.35,44,812.2
Hal,Teague,38,16,0,608
Sam,Moser,39,17,35,698
Ruth,Baylor,40,18,77,797
Dave,Couch,35,17,0,595
Jerry,Meister,40,18.35,88,822
Lawrence,Thomas,40,19,98,858
Jane,Clyde,40,18.35,75,809
Dick,Weatherby,39,19,67,808
Steve,Lewis,40,20,125,925
Paco,Hinson,40,16.88,44,719.2
Herman,Blake,40,22.5,85,985
Paul,Parsons,40,17,0,680
Charles,Tilden,40,17.25,36,726
Bill,James,40,17.25,50,740
Jim,Kingman,38,18.25,25,718.5
Aaron,Clark,39.2,22.5,65,947
Dawn,Dunbar,40,16.5,0,660
Julie,Velasquez,40,18.75,100,850
Aimee,Lin,37,16.35,45,649.95
Ralph,Dugan,40,19,65,825
Shelley,White,39.7,19.35,45,813.2
Hal,Teague,38,16,0,608
Sam,Moser,39,17,35,698
Ruth,Baylor,40,18,68.5,788.5
Dave,Couch,35,17,0,595
Jerry,Meister,36.2,18.35,88,752.27
Lawrence,Thomas,40,19,100,860
Jane,Clyde,40,18.35,45,779
Dick,Weatherby,39,19,57.95,798.95
Steve,Lewis,40,20,150,950
Paco,Hinson,40,16.88,34,709.2
Herman,Blake,38,22.5,75,930
Paul,Parsons,40,17,0,680
Charles,Tilden,40,17.25,10,700

Open notepad or other text editor..

Paste the clipboard content into the text editor.

Make the changes

Change the hourly rate, and gross pay for each manager. Remember there are four weeks of data, so there sixteen changes needed.

For example, Dawn Dunbar, the CEO's lines should look like "Dawn,Dunbar,40,30.00,0,1200".

Save your file as payrollMgr.csv.

In your python project, change the line that loads the file to reflect the new name for your comma separated file.

Run your Python project and answer any questions on the worksheet.

Let's see how accurate our model could predict. We know for example that Teague is being paid $25.00 for 40 hours and that totals $1,000.00.

Now change our individual prediction line "predictGrossPay = regr.predict([[40,25.00,0.00]])" to reflect 40,25,0.00 and run that frame. What does our prediction say he should make? Write your answer on the worksheet.

Change our individual prediction line for each of the remaining managers and see how close actual and predictions are and record your answers on the worksheet.

Let's see what our model will predict for four new employees. Change the individual prediction line to reflect 40 hours 16.00 per hour and 25.00 bonuses each.

Write your answers for the total predicted gross pay with their bonuses on the worksheet.

Day 7: Random Forest Regression Model

I wanted to see if a random forest model could do any better than our linear regression model.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np
import sys
sys.__stdout__ = sys.stdout

df = pd.read_csv('C:\\Users\\jerry\\OneDrive\\Documents\\LinearRegressionAccounting\\payRandomForest.csv')
print(df)

print(df.describe())

X = df.iloc[:,0:3].values# prints all rows with hours worked, payrate,commission 0, 1 and 2 columns
y = df.iloc[:,3].values
print(X)

print(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=0)

from sklearn.preprocessing import StandardScaler
rf_scaler = StandardScaler()
X_train = rf_scaler.fit_transform(X_train)
X_test = rf_scaler.transform(X_test)

from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=500, random_state=0)
rf_regressor.fit(X_train, y_train)
pred_y = rf_regressor.predict(X_test)

df= pd.DataFrame({'Actual':y_test,'Predicted':pred_y})
print(df)

regr = linear_model.LinearRegression()
regr.fit(X,y)
predictGrossPay = regr.predict([[40,30,0.00]])
print("Prediction of an individual's gross pay")
print(predictGrossPay)

from sklearn import metrics
print('Mean Absolute Error',metrics.mean_absolute_error(y_test, pred_y))
print('Mean squared Error:',metrics.mean_squared_error(y_test,pred_y))
print('Root Mean Squared Error ',np.sqrt(metrics.mean_squared_error(y_test, pred_y)))

Start a new Python model

I constructed my model using a series of frames. I double spaced the code to show where I separated the frames.

You can paste all of the code into one frame is you wish.

Let's see how this model compares to the linear regression one.

Random Forest & Linerar Regression Predictions

#	RF Actual	RF Predicted	#	LR Actual	LR Predicted
0	690.00	679.32600	30	690.00	687.664874
1	646.00	622.10600	40	840.00	840.585687
2	736.75	734.70270	43	736.75	737.345147
3	608.00	613.42750	50	1000.00	999.065242
4	693.00	730.01858	22	693.00	691.878898
5	822.00	791.11088	54	822.00	821.679961
6	947.00	947.29180	2	947.00	948.608595
7	809.00	786.00324	56	809.00	808.96256
8	628.25	662.91018	26	628.25	617.423958
9	608.00	613.42570	8	1000.00	999.065242
10	825.00	823.15650	69	825.00	824.932883
11	860.00	856.38200	13	860.00	859.172022
12	660.00	660.30770	66	1200.00	1197.164686
13	779.00	772.65750	77	779.00	779.614732
14	950.00	940.29400	16	950.00	947.704968
15	809.00	805.91200	27	809.00	809.125292
16	752.27	763.38664	75	752.27	750.458941

The table represents the random sample created by each model and its predictions compared with actual amounts in the dataset.

If you look at the difference between actual and prediction amounts, you will see that the linear regression model does a much better job.

For example,the first row of the table shows 690.00 actual amounts for both models. The random forest model is 11 points different 690-679, the linear model is 3 points off 690-687.

Linear Regression error Mean Absolute Error 1.6397182547768674 Metrics Squared Error: 8.630517131270052 Root Means Squared Error 2.937774179760938 Random Forest Errors Mean Absolute Error 12.31444000000001 Mean squared Error: 299.2255795241577 Root Mean Squared Error 17.298138036336677

Looking at the mean absolute error for the linear regression model, it can be concluded that on average, there is an error of 1.63 for predictions, which means that on average the predicted values are $1.63 more or less than the actual gross profit values.

This is a very good number for the linear regression model.

On the other hand, this error is $12.31 for the random forest model.

Other errors are much higher for the random forest model.

It can be determined, that the linera regression model is a better choice for our model.