The purpose of this unit is to give you some tools to analyze your company's internal data and use it to make business decisions to improve your company's sales and revenue.

We will be using Python and its libraries to accomplish these tasks.

You will need to have Python, Numpy, Pandas and Matplotlib installed on your computer. See previous Python lessons on our website for instructions on how to install these.

First printout the MS word file worksheet.

Data Analysis Worksheet

Day 1: Working with Numpy arrays for customers' orders

Copy the following code onto the clipboard and paste it into Spyder. Save the file with a .py extension in your working directory.

Now copy the customerOrders.txt file and save it in your working folder. It needs to be in the samme folder as your python code.

There 5 columns in the orders.txt file:a,b,c,d,e.

Numpy is imported

The file is read into the array.

Row 11, which contains the titles of the columns is skipped.

The array is unpacked, which means that each column is a separate array.

Numpy arrays only work with numbers so each customer's name has been given a number like 12, 13, 15, etc.

The invoice numbers are in the next column. 45609 is the invoice number or our first item in the file.

Product code numbers comprise the numbers in the next column.

The price of the product sold is the next column

Sales tax at 8% comprise the items in the next column.

The total of the price and the sales tax is found by adding colums d and e

You can also isolate individual items in the array.

For example using Mosberg 12-guage 8 shot pump shotgunprint(a[3]) prints the third item in column a, the first column. Remember start counting with 0.

Experiment by looking at other individual items.

Printing d+e adds the price and sales tax together.

Try adding this line of code to calculate the 8% sales tax for the 6th item, print (d[6]*.08)

What are the totals of sales tax and product price for each item?

The file is simple to make. It was created in a text-editor program. The tab key was pressed for the beginning of a new column. The file was saved with a .txt extension.

Day 2: Creating a text file using Notepad to be used with Python data analysis

Open Notepad or Notepad++ and key in the following information. Tab between columns except for the column headings which are separated by commas.

Item,Cost,Price,Markup%

2271 18.30 .92

2272 14.64 .91

2274 17.78 .85

2275 13.07 .93

2276 14.62 .90

2288 17.78 .89

Save your file and call it markup.txt

Modify your previous program so that it will will work with this new file.

Hints:You only have three columns, a,b,c

print (b*c) #multiple purchase price times markup

print(b*c +b)# muliple cost time markup percentage and then add markup amount and cost to obtain retail price

What is the selling price of all items?

Day 3:Payroll using csv file

Put the above code on the clipboard and paste it into Spyder text editor.

Save it in your working folder with a .py extension.

The file above is formatted in a csv mode. It contains payroll information for a company. The fields are title, first name, last name, yearly salay, payrate, hours worked in a month and gross pay.

Each employee is listed on a row, separating each field by a comma.

Put the payroll information on the clipboard and paste it into Notepad.

Save it as payroll.csv

Run the payroll.py program in Spyder. Your results should look like this.

Day 4: Isolating employees and giving them a pay raise

Add this line to your Python code at the end of the existing code.data[data.PayRate == 23.44] When you run this, it should give you a listing of all employees making 23.44 per hour.

Once you have isolated these employees, open your text file in Notepad and change 23.44 to 25.00 for all employees making 23.44 per hour. Save your file and run it through your Python code.

How many employees were affected, who are they and what is their new monthly gross pay?

Add this line at the end of the code to printout title and last name of all employees. print(data.loc[:,['Title','LastName']])

Day 5: Importing Excel Data and working with Pandas

With Pandas you can work with files in various formats: text files, comma separated value (csv) files and MS Excel files. The elements in a Pandas series are assigned indexes. The first element is assigned an index of 0, the second one 1 the third one 2 and so on. The last element is assigned an index of N-1 with N denoting the total number of items.

Most of you have probably worked with Microsoft's Excel spreadsheet. We are able to import an Excel spreadsheet into a Pandas dataframe. Many business records are stored in Excel sheets: Customer information, Accounting and financial records, etc.

We are going to work with some customer information for a sporting goods store.

First you need to download the Excel spreadsheet and save it into your working folder.

This spreadsheet is quite small. Normally customer records would have thousands of records. I am using this small sample, just to give you an idea what it would be like to use Python's Pandas library to do data analysis.

Customer Orders

Now open Spyder text editor and key in the following lines. Save the code using a file name with a .py extension. Run the file using F5 or F9

Answer the worksheet questions relating to this section.

import pandas as pd

Import sys

sys.__stdout__= sys.stdout

data = pd.read_excel('CustomerOrdersDept.xlsx')

print(data)

Add the following line to the end of your code. data.describe() Save and run.

This line describes the data base and gives very important information: Number of records, the mean or average sale, tax and total, the lowest and highest sales, tax and total.

Suppose you were looking for best sellers?

Add this line. data[data.ProductDescription == 'Nike Shorts']# search for individual items purchased

Save and run. What did you find out?

Suppose you wanted to see buying habits of a particular customer, add this line at the end of your code.

data[data.CustomerLastName == 'Hall']# find all purchases by Hall

When you run the code, you will find two orders by a customer named Hall. Check the record numbers and you can then look up the entire transaction on the spreadsheet.

To see all records by Last name, description and department, key in the following line.

print(data.loc[:,['CustomerLastName','ProductDescription','Department']])# printout all data customer last name, product description and department of the item

To see customer last names, items purchased and departments for records 0,3,6, add the following line.

print(data.loc[[0,3,5],['CustomerLastName','Department']])# select 0,3 5 array elements and display customer last name and department purchased item belongs to

Knowing department sales is an important part of data analysis. Python's Pandas library has a functioon to give us that information.

Add these lines of code, one for each department. Save and run.

data[data.Department == 'Apparel'] # show all apparel sales

data[data.Department == 'RollerSports'] # show all roller sports sales

data[data.Department == 'Baseball'] # show all baseball department sales

data[data.Department == 'SportingArms'] # show all gun sales

data[data.Department == 'Football'] # show all footballl sales

data[data.Department == 'Basketball'] # show all basketballl sales

data[data.Department == 'Soccer'] # show all soccer sales

data[data.Department == 'Outdoors'] # show alloutdoors sales

data[data.Department == 'Fitness'] # show all fitness sales

data[data.Department == 'WaterSports'] # show all water sports sales

data[data.Department == 'Golf'] # show all golf sales

data[data.Department == 'Knives'] # show all knives sales

data[data.Department == 'Tennis'] # show all tennis sales

The rows and columns information at the end of the list gives the total number of records or rows in that given department.

Answer the questions on the worksheet regarding department sales.

You can select very specific parts of each record using the following code examples.

print(data.Price[15]) print(data.Total[50])

Add these lines of code and try then out. Answer the worksheet questions pertaining to these two lines of code.

Add this line of code next.

print(data.Department)# prints all invoices and the departments associated with the sale.

To get customer last name, product description and price add this line of code.

print(data.loc[:,['CustomerLastName','ProductDescription','Price']])# printout all data customer last name, product description and amount spent on the item

You can also use the Pandas library to sort things. Let's say you want to see a sorted list form low to high of all prices of your products. Try this code.

PriceList=sorted(data.Price)

Suppose you are looking for regular and repeat customers, try sorting customners' last names with the following two lines of code.

If their name appears more than once they are repeat customers and you should reward tem for their repeat business.

CustomerList=sorted(data.CustomerLastName) print(CustomerList)

import pandas as pd
import sys
sys.__stdout__= sys.stdout
data = pd.read_excel('customerOrdersDept.xlsx')
print(data)
data.describe()               
data[data.ProductDescription == 'Nike Shorts']# search for individual items purchased
data[data.CustomerLastName == 'Hall']# find all purchases by Hall
print(data.loc[:,['CustomerLastName','ProductDescription','Department']])# printou all data customer last name, product description and department of the item
print(data.loc[[0,3,5],['CustomerLastName','Department']])# select 0,3 5 array elements and display customer last name and department purchased item belongs to
data[data.Department == 'Apparel'] # show all apparel sales
data[data.Department == 'RollerSports'] # show all roller sports sales
data[data.Department == 'Baseball'] # show all baseball department sales
data[data.Department == 'SportingArms'] # show all gun sales
data[data.Department == 'Football'] # show all footballl sales
data[data.Department == 'Basketball'] # show all basketballl sales  
data[data.Department == 'Soccer'] # show all soccer sales
data[data.Department == 'Outdoors'] # show alloutdoors sales
data[data.Department == 'Fitness'] # show all fitness sales
data[data.Department == 'WaterSports'] # show all water sports sales
data[data.Department == 'Golf'] # show all golf sales
data[data.Department == 'Knives'] # show all knives sales
data[data.Department == 'Tennis'] # show all tennis sales
#select certain price in the array.  Remember start counting with 0
print(data.Price[15])
print(data.Total[50])

print(data.Department)# prints all invoices andd the departments associated with the sale

print(data.loc[:,['CustomerLastName','ProductDescription','Price']])# printou all data customer last name, product description and amount spent on the item

PriceList=sorted(data.Price)
print(PriceList)

#sort list alphabetically by last name
CustomerList=sorted(data.CustomerLastName)
print(CustomerList)

The full-listing of the code appears in the box above.

Day 6: Data Visualization

Many times it is easier to analyze data if you can see it in a graphic form.

Pandas has some tools that make it quite easy to turn an Excel spreadsheet and csv files into graphs.

Let's take an example. One of the lessons on our web page.

Starting Your Own Business

In starting your own business, examining price is an important thing that needs to be addressed in the business plan.

When determining what price to put on an item, you need to know the costs involved. In this particular case, the product is a beef slider sold from a food truck.

The costs involved are the cost of the meat, the bun, the other ingredeients.

The graphs below were generated from the data in the Food Costs Spreadsheet.

Food Costs Excel spreadsheet

Save this file using the same name in your working folder.

This line graph shows the increase in the cost of ingredients to make a beef slider over the past year.

This graph is an unstacked bar graph showing the same infomation as the graph above.

This graph also show the same data in a slightly different form of bar graph.

Here is the code that creates these graphs.

Key the above information into Spyder, save it and execute the code.

What can you tell about the price of beef?, buns?, other ingredients?, total overall cost?

Experiment with changing the data in the spreadsheet to see how the visualization changes.

Now its your turn to create a spreadsheet and graph the information.

Copy and paste the above code into Spyder and save it with a .py extension in your working folder.

Now create your own Excel Spreadsheet. Open the Excel template.xls file and enter the following data into Excel and Save it as insideSlider.xlsx

The numbers represent sales of food truck items by the day of the week.

Template
300.00 200.00 100.00 75.00 150.00 150.00 166.00 136.00 78.00 136.00 175.00 165.00 88.00 67.00 136.00 110.00 167.00 90.00 50.00 100.00 169.00 150.00 99.00 65.00 110.00 190.00 175.00 125.00 71.00 135.00 325.00 210.00 146.00 80.00 200.00

Each row represents a day of the week. There are five food items represented by the colums:

Run your Python code and answer the following questions after examining the graph.

What day are the sales the best? How much?

Worst Day for sales?

Most consistent seller?

Second best sale day?

Best sales day for pork sliders?

Best day for beef sliders?

By now you can see the benefits of using graphs for analyzing data.

It is a useful tool in presenting the information to others in the organization.

Day 7: Data Wrangling

Many times the data that you receive to analyze is not in any useable form. One method to help arrange data in a meaningful format is sorting. Below is a Python program to help you you sort a data frame for a sporting goods store.

# -*- coding: utf-8 -*-
"""
Created on Thu Oct 10 12:18:01 2019

@author: jerrybelch
"""

Apparel=['Nike workout pants', 'volleyball short','Nike shorts','Nike workout pants','Adidas jacket and pants','Nike womens swim suits','Nike volleyball shorts','Charged cotton socks','Blitsing mens cap','Thompson Hill short sleve shirts']
Outdoor=['Rugged Exposure Prospector Tent','Coleman arch rock tent','Westport slant leg canopy','Lifetime volt 101 kayak','Trailhead folding cot','Shakespeare freshwater fishing set','Glowman flashlight','Sierra treking pole set']
Baseball=['Rawlings fast pitch glove','Wilson 150 series glove','Easton youth bat','Rawlings mach Youth baseball bat','Rip It batting helmut','Rawlings baseball glove','Easton baseball cleats','Rawlings batting gloves']
Basketball=['Lifetime portable basketball systems','Lifetime steel framed portable basketball systems','Wilson NCAA replica ball', 'Wilson: pro ball','Spaulding: pro ball','Steph Curry ball','Wilson jet evolution ball','Spaulding NBA ball']
Football=['Franklin sports junior footballs','Wilson composite football','Rawlings pro5 leather football','Shock Dr. mouthgard','Nike mouthgard','Riddell rib protector','Rawlings integrated football pants','Tiffrl speed flex helmut']
Soccer=['Adidas world cup soccer ball','Goletto soccer cleats' ,'Copa FXG soccer cleats','Metro soccer socks','Nike goalkeeper gloves','Neymar soccer cleats','SuperFly soccer cleats','Adidas shinguards']
SportingArms=['Mosberg 12-guage 8 shot pump shotgun','Ruger semi-auto ranch rifle','Ruger .22LR semi-auto rifle','Howard electronic hearing protection','Federal American Eagle .223 Remington Ammo box of 1000','USA Forged 9mm box of 150','Stack On 22 Gun Security Cabinet','Real Avid gun cleaning kit']
Fitness=['Pro-Form treadmill','Competition weight bench','Champ cardio dual trainer','Lab execise ball','Everlast heavy bag','Pro-style training gloves','Marcy utility weight bench','GoFit foam massage roller']
RollerSports=['Razor cruiser skooter','Sports Elite skateboards','Kore helmut','Bladerunner inline skates','Bullet speed skates','CRs69 inline skates','Moxi suede roller skates','Chicago womens rink skates']
AirGuns=['Gamo .177-cal pellet air rifle','Benjamin .177 or .22 break-barrel air rifle','Benjamin trail NP break-barrel pellet pistol','Rosman C11 semi-automatic CO2 bb pistol','CZ sp-01 shadow CO2 bb air pistol','Sag Sauer spartan 1911 CO2 bb pistol','Gamo whisper Fusion mach 1 .177 caliber air rifle with scope','Rosman American classic .177 or .22 pump']
Knives=['Femme Fatale folding Knive','Precision Point 3.5 inch assisted open folding knive','Kershaw select fire knive multi-tool','Marines survival knife','Kershaw select assisted open knives volt','Kershaw select assisted open knives zing','Kershaw select assisted open knives portal','Kershaw select assisted open knives oso']
Golf=['Wilson k-28 + golf package','Nike dura feel golf glove','CXR power golf balls box of 12','Addidas tech response golf shoes','Footloy spikeless golf shoes','Nike golf hat','Topflite game cart bag','Bushnell tour range finder']
Tennis=['Head graphite lite tennis racquet','Head rewards tennis racquet','Dunlop championship tennis balls 4-ball tube','Wilson hammer 6 hybred tennis racquet','Gamma progression II Pro 3 tennis stringer','Head cone Pro 3 pack tennis bag','Nike swoosh wristband','Nike headband']
WaterSports=['Body Glove full wetsuit','Lifetime dash junior kayak','Volt kayak','Body Glove neoprene ski vest','Seaview snorkling mask','Peak inflatible stand up paddle board','Gold Coast surfboard soft','Ronix wakeboard']

To sort the Apparel merchandise, all you have to do is add the following code at the end of the code listing.

Apparel.sort()
print(Apparel)

Add lines of code to sort all the other categories of merchandise.

Day 8: Auto Data Base Sort

In many of cases, the dataset has multiple fields, so how can you sort the data of a particular field?

Our example is an auto dealership that sells Toyota automobiles and trucks,

We are going to loook at data that contains the salesperson's name, the car sold and the value of that car.

Each line in the Dataframe represents a column.

We will sort the data by value first, then by salesperson.

We can sort in ascending or descending order.

"""
Created on Sat Oct 12 13:08:12 2019

@author: jerrybelch
"""

from pandas import DataFrame
Cars = {'SalesPerson': ['Susan','Jerry','Janet','Jane','Jacob','Dan','Pete','Erin','Emilie','Dick','Bruce','Aileen','Camy','Denise','Jeremy','Bill','Ron','James','Ward','Pam','Carole','Tony'],
        'CarSold': ['Toyota Camry','Land Cruiser','Toyota Prius','Corolla XlE','Camry L','Tacoma','Highlander','Land Cruiser','Tacoma RD Sport','Avalon','Highlander','Camry XLE','Prius','Rav 4','Camry LE','Tacoma Limited','Carolla SE', 'Camry XSE', 'Tundra ST','Sequoa SR5','Sequoia TRD','Highlander SE'],
        'Price':[25000,84000,17000,19700,23045,25400,40240,84000,32095,35500,36870,28650,23475,24600,24150,37350,18700,35000,35400,48600,51315,40240]

}
df=DataFrame(Cars,columns = ['SalesPerson','CarSold', 'Price'])
# print in regular order
print(df)
#print in ascending order

df.sort_values(by=['Price'], inplace=True,ascending=False)
print(df)

df.sort_values(by=['SalesPerson'], inplace=True,ascending=True)
print(df)

Put the information above on the clipboard and paste it into Spyder, save and run the code.

How many salespersones are there?

What is the most expensive vehicle sold? How much?

What is the least expensive vechicle?

Sorted by salesperson, who is the first on the list? the last?

Adding the condition of the car or truck to the dataset

Now add an additional feature to the existing dataset: the condition of the car, new or used. You can choose which ones are new or used.

Save your file and run it. You can see it is much easier to analyse the data after it has been sorted.

Automotive Data Base

Day 9: Dealing with missing data

Let's take a look at a company that deals with large amounts of data: a computer dating commpany.

First let's examine their questionnaire which is the source of the data.

Computer Dating Survey

The survey asks for last name, first name, sex, age and then 45 questions. The responses for all 45 questions are enterered into the computer dating software to find matches for the respondents. You can imagine the size of the file for thousands of clients and for 45 responses for each person. If the file is missing any data the results will not be accurate.

To combat errors, the data has been tabulated using an Excel spreadsheet. The responses were coded with a 4 for strongley agree, 3 for agree moderately, 2 for disagree moderately and 1 for disagree strongly.

Let's look at the spreadsheet. Excel spreadsheet with computer dating responses

For demonstration purposes, there are only 10 respondents and their responses to the 45 questions. A few pieces of data are missing. We will use Python's Pandas library to isolate these missing pieces of data.

import pandas as pd
import sys
sys.__stdout__= sys.stdout
Data = pd.read_excel('datingSurvey2.xlsx') #for an earlier version of Excel use 'xls'
df = pd.DataFrame(Data, columns= ['ID','LastName','FirstName','Sex','Age','Answer1','Answer2', 'Answer3', 'Answer4','Answer5','Answer6','Answer7','Answer8','Answer9','Answer10','Answer11','Answer12','Answer13','Answer14','Answer15','Answer16','Answer17','Answer18', 'Answer19', 'Answer20','Answer21','Answer22','Answer23','Answer24','Answer25','Answer26','Answer27','Answer28','Answer29','Answer30','Answer31','Answer32', 'Answer33','Answer34', 'Answer35', 'Answer36','Answer37','Answer38','Answer39','Answer40','Answer41','Answer42','Answer43','Answer44','Answer45' ])
df.LastName.notnull()
df.FirstName.notnull()
df.Sex.notnull()
df.Age.notnull()
df.Answer1.notnull()
df.Answer2.notnull()
df.Answer3.notnull()
df.Answer4.notnull()
df.Answer5.notnull()
df.Answer6.notnull()
df.Answer7.notnull()
df.Answer8.notnull()
df.Answer9.notnull()
df.Answer10.notnull()
df.Answer11.notnull()
df.Answer12.notnull()
df.Answer13.notnull()
df.Answer14.notnull()
df.Answer15.notnull()
df.Answer16.notnull()
df.Answer17.notnull()
df.Answer18.notnull()
df.Answer19.notnull()
df.Answer20.notnull()
df.Answer21.notnull()
df.Answer22.notnull()
df.Answer23.notnull()
df.Answer24.notnull()
df.Answer25.notnull()
df.Answer26.notnull()
df.Answer27.notnull()
df.Answer28.notnull()
df.Answer29.notnull()
df.Answer30.notnull()
df.Answer31.notnull()
df.Answer32.notnull()
df.Answer33.notnull()
df.Answer34.notnull()
df.Answer35.notnull()
df.Answer36.notnull()
df.Answer37.notnull()
df.Answer38.notnull()
df.Answer39.notnull()
df.Answer40.notnull()
df.Answer41.notnull()
df.Answer42.notnull()
df.Answer43.notnull()
df.Answer44.notnull()
df.Answer45.notnull()
df.head(5)
df.tail(5)
print(df.loc[0:44,['LastName','FirstName','Sex','Age']])
#Filter out rows with missing data and delete person with missing values
#df.dropna()

First copy and paste the Python code into Spyder and save the file using a .py extension.

Download the Excel spreadsheet file and save it in your working folder. Call it datingSurvey2.xlsx

Execute the Python code using F9 key one line at a time.

The first line imports the Pandas library

The next two lines help reduce errors in the loading of the file.

The next line opens the spreadsheet file and reads it into memory.

The next line creates a data frame with all of the column headings. These are the same column headings contained in the worksheet.

The next 49 lines are the ones looking for missing data in the Excel spreadsheet.

They check for missing data in the last name, first name, sex and age fields and for each of the 45 question responses.

If there is no missing data in that row, all answers are True. If there is a missing piece of data, it returns a False value.

print(df.loc[0:44,['LastName','FirstName','Sex','Age']]) line prints out the first four columns of the spreadsheet for all all repondents.

Now you know that you are also missing answers to some of the questions. import pandas as pd import sys sys.__stdout__= sys.stdout Data = pd.read_excel('datingSurvey2.xlsx') #for an earlier version of Excel use 'xls' df = pd.DataFrame(Data, columns= ['ID','LastName','FirstName','Sex','Age','Answer1','Answer2', 'Answer3', 'Answer4','Answer5','Answer6','Answer7','Answer8','Answer9','Answer10','Answer11','Answer12','Answer13','Answer14','Answer15','Answer16','Answer17','Answer18', 'Answer19', 'Answer20','Answer21','Answer22','Answer23','Answer24','Answer25','Answer26','Answer27','Answer28','Answer29','Answer30','Answer31','Answer32', 'Answer33','Answer34', 'Answer35', 'Answer36','Answer37','Answer38','Answer39','Answer40','Answer41','Answer42','Answer43','Answer44','Answer45' ]) df.LastName.notnull() Out[68]: 0 True 1 True 2 False 3 True 4 True 5 True 6 True 7 True 8 True 9 True Name: LastName, dtype: bool

Above are partial results of the execution of the code.

Pay close attention to the line that states df.LastName.notnull()

It is looking at all the information in the column that contains all the last names of the respondents.

Remeber that arrays start with the number 0 as the first one in the index.

You can see that the first two respondents' last names are there,

The third one states False, which means that there is no last name in the dataset for the 3rd person: Janet Female age 19.

Make note of that information, since we would have to contact Janet to get her last name before trying to get her a date using our computer dating software.

Continue Pressing F9 to check for additional missing data.

You should see that Shelley Jones has failed to provide her sex,(m/f).

Continue pressing F9 to check that all answers are there and that there is no missing data.

If you look carefully, you will note that Dan Abbot did not answer question 15

Since our dataset is small, it is much easier to see missing data on the spreadsheet.

By adding this line to the end of your code, you can see last name, first name, sex, age and the answers to question 15 print(df.loc[0:44,['LastName','FirstName','Sex','Age','Answer15']])

There are a number of things we could do to handle the missing data.

Contact the individual by email and request it.

Guess at the information.

Put in a middle of the road score to missing questions.

Delete the column where the missing data appears. This means that for all respondents, you would not know the answer to question 15 which states "Making a lot of money is a very important goal in my life".

Delete the entire row containing the repondents name. This means that the person with any missing data is dropped from the computer dating survey.

Adding "df.dropna()" at the end of the code deletes all rows with missing data.

Adding "df.dropna(axis=1)" delete the entire column with missing data.

Add the lines to delete missing data and re-run your code.