The purpose of this unit is to give you some tools to analyze your company's
internal data and use it to make business decisions to improve your company's sales and revenue.
We will be using Python and its libraries to accomplish these tasks.
You will need to have Python, Numpy, Pandas and Matplotlib installed on your computer.
See previous Python lessons on our website for instructions on how to install these.
First printout the MS word file worksheet.
Data Analysis Worksheet
Day 1: Working with Numpy arrays for customers' orders
Copy the following code onto the clipboard and paste it into Spyder. Save the file with a .py extension in your working directory.
Now copy the customerOrders.txt file and save it in your working folder. It needs to be in the samme folder as your python code.
There 5 columns in the orders.txt file:a,b,c,d,e.
Numpy is imported
The file is read into the array.
Row 11, which contains the titles of the columns is skipped.
The array is unpacked, which means that each column is a separate array.
Numpy arrays only work with numbers so each customer's name has been given a number like 12, 13, 15, etc.
The invoice numbers are in the next column. 45609 is the invoice number or our first item in the file.
Product code numbers comprise the numbers in the next column.
The price of the product sold is the next column
Sales tax at 8% comprise the items in the next column.
The total of the price and the sales tax is found by adding colums d and e
You can also isolate individual items in the array.
For example using Mosberg 12-guage 8 shot pump shotgunprint(a) prints the third item in column a, the first column. Remember start counting with 0.
Experiment by looking at other individual items.
Printing d+e adds the price and sales tax together.
Try adding this line of code to calculate the 8% sales tax for the 6th item, print (d*.08)
What are the totals of sales tax and product price for each item?
The file is simple to make. It was created in a text-editor program. The tab key was pressed for the beginning of a new column. The file was saved with a .txt extension.
Day 2: Creating a text file using Notepad to be used with Python data analysis
Open Notepad or Notepad++ and key in the following information. Tab between columns except for the column headings which are separated by commas.
2271 18.30 .92
2272 14.64 .91
2274 17.78 .85
2275 13.07 .93
2276 14.62 .90
2288 17.78 .89
Save your file and call it
Modify your previous program so that it will will work with this new file.
Hints:You only have three columns, a,b,c
print (b*c) #multiple purchase price times markup
print(b*c +b)# muliple cost time markup percentage and then add markup amount and cost to obtain retail price
What is the selling price of all items?
Day 3:Payroll using csv file
Put the above code on the clipboard and paste it into Spyder text editor.
Save it in your working folder with a .py extension.
The file above is formatted in a csv mode. It contains payroll information for a company. The fields are title, first name, last name, yearly salay, payrate, hours worked in a month and gross pay.
Each employee is listed on a row, separating each field by a comma.
Put the payroll information on the clipboard and paste it into Notepad.
Save it as
Run the payroll.py program in Spyder. Your results should look like this.
Day 4: Isolating employees and giving them a pay raise
Add this line to your Python code at the end of the existing code.
data[data.PayRate == 23.44] When you run this, it should give you a listing of all employees making 23.44 per hour.
Once you have isolated these employees, open your text file in Notepad and change 23.44 to 25.00 for all employees making 23.44 per hour. Save your file and run it through your Python code.
How many employees were affected, who are they and what is their new monthly gross pay?
Add this line at the end of the code to printout title and last name of all employees.
Day 5: Importing Excel Data and working with Pandas
With Pandas you can work with files in various formats: text files, comma separated value (csv) files and MS Excel files. The elements in a Pandas series are assigned indexes. The first element is assigned an index of 0, the second one 1 the third one 2 and so on. The last element is assigned an index of N-1 with N denoting the total number of items.
Most of you have probably worked with Microsoft's Excel spreadsheet. We are able to import an Excel spreadsheet into a Pandas dataframe. Many business records are stored in Excel sheets: Customer information, Accounting and financial records, etc.
We are going to work with some customer information for a sporting goods store.
First you need to download the Excel spreadsheet and save it into your working folder.
This spreadsheet is quite small. Normally customer records would have thousands of records. I am using this small sample, just to give you an idea what it would be like to use Python's Pandas library to do data analysis.
Now open Spyder text editor and key in the following lines. Save the code using a file name with a .py extension. Run the file using F5 or F9
Answer the worksheet questions relating to this section.
import pandas as pd
data = pd.read_excel('CustomerOrdersDept.xlsx')
Add the following line to the end of your code.
data.describe() Save and run.
This line describes the data base and gives very important information: Number of records, the mean or average sale, tax and total, the lowest and highest sales, tax and total.
Suppose you were looking for best sellers?
Add this line.
data[data.ProductDescription == 'Nike Shorts']# search for individual items purchased
Save and run. What did you find out?
Suppose you wanted to see buying habits of a particular customer, add this line at the end of your code.
data[data.CustomerLastName == 'Hall']# find all purchases by Hall
When you run the code, you will find two orders by a customer named Hall. Check the record numbers and you can then look up the entire transaction on the spreadsheet.
To see all records by Last name, description and department, key in the following line.
print(data.loc[:,['CustomerLastName','ProductDescription','Department']])# printout all data customer last name, product description and department of the item
To see customer last names, items purchased and departments for records 0,3,6, add the following line.
print(data.loc[[0,3,5],['CustomerLastName','Department']])# select 0,3 5 array elements and display customer last name and department purchased item belongs to
Knowing department sales is an important part of data analysis. Python's Pandas library has a functioon to give us that information.
Add these lines of code, one for each department. Save and run.
data[data.Department == 'Apparel'] # show all apparel sales
data[data.Department == 'RollerSports'] # show all roller sports sales
data[data.Department == 'Baseball'] # show all baseball department sales
data[data.Department == 'SportingArms'] # show all gun sales
data[data.Department == 'Football'] # show all footballl sales
data[data.Department == 'Basketball'] # show all basketballl sales
data[data.Department == 'Soccer'] # show all soccer sales
data[data.Department == 'Outdoors'] # show alloutdoors sales
data[data.Department == 'Fitness'] # show all fitness sales
data[data.Department == 'WaterSports'] # show all water sports sales
data[data.Department == 'Golf'] # show all golf sales
data[data.Department == 'Knives'] # show all knives sales
data[data.Department == 'Tennis'] # show all tennis sales
The rows and columns information at the end of the list gives the total number of records or rows in that given department.
Answer the questions on the worksheet regarding department sales.
You can select very specific parts of each record using the following code examples.
Add these lines of code and try then out. Answer the worksheet questions pertaining to these two lines of code.
Add this line of code next.
print(data.Department)# prints all invoices and the departments associated with the sale.
To get customer last name, product description and price add this line of code.
print(data.loc[:,['CustomerLastName','ProductDescription','Price']])# printout all data customer last name, product description and amount spent on the item
You can also use the Pandas library to sort things. Let's say you want to see a sorted list form low to high of all prices of your products. Try this code.
Suppose you are looking for regular and repeat customers, try sorting customners' last names with the following two lines of code.
If their name appears more than once they are repeat customers and you should reward tem for their repeat business.
The full-listing of the code appears in the box above.
Day 6: Data Visualization
Many times it is easier to analyze data if you can see it in a graphic form.
Pandas has some tools that make it quite easy to turn an Excel spreadsheet and csv files into graphs.
Let's take an example. One of the lessons on our web page.
Starting Your Own Business
In starting your own business, examining price is an important thing that needs to be addressed in the business plan.
When determining what price to put on an item, you need to know the costs involved. In this particular case, the product is a beef slider sold from a food truck.
The costs involved are the cost of the meat, the bun, the other ingredeients.
The graphs below were generated from the data in the Food Costs Spreadsheet.
Food Costs Excel spreadsheet
Save this file using the same name in your working folder.
This line graph shows the increase in the cost of ingredients to make a beef slider over the past year.
This graph is an unstacked bar graph showing the same infomation as the graph above.
This graph also show the same data in a slightly different form of bar graph.
Here is the code that creates these graphs.
Key the above information into Spyder, save it and execute the code.
What can you tell about the price of beef?, buns?, other ingredients?, total overall cost?
Experiment with changing the data in the spreadsheet to see how the visualization changes.
Now its your turn to create a spreadsheet and graph the information.
Copy and paste the above code into Spyder and save it with a .py extension in your working folder.
Now create your own Excel Spreadsheet. Open the Excel template.xls file and enter the following data into Excel and Save it as
The numbers represent sales of food truck items by the day of the week.
300.00 200.00 100.00 75.00 150.00
150.00 166.00 136.00 78.00 136.00
175.00 165.00 88.00 67.00 136.00
110.00 167.00 90.00 50.00 100.00
169.00 150.00 99.00 65.00 110.00
190.00 175.00 125.00 71.00 135.00
325.00 210.00 146.00 80.00 200.00
Each row represents a day of the week. There are five food items represented by the colums:
Run your Python code and answer the following questions after examining the graph.
What day are the sales the best? How much?
Worst Day for sales?
Most consistent seller?
Second best sale day?
Best sales day for pork sliders?
Best day for beef sliders?
By now you can see the benefits of using graphs for analyzing data.
It is a useful tool in presenting the information to others in the organization.
Day 7: Data Wrangling
Many times the data that you receive to analyze is not in any useable form. One method to help arrange data in a meaningful format is sorting. Below is a Python program to help you you sort a data frame for a sporting goods store.
To sort the Apparel merchandise, all you have to do is add the following code at the end of the code listing.
Add lines of code to sort all the other categories of merchandise.
Day 8: Auto Data Base Sort
In many of cases, the dataset has multiple fields, so how can you sort the data of a particular field?
Our example is an auto dealership that sells Toyota automobiles and trucks,
We are going to loook at data that contains the salesperson's name, the car sold and the value of that car.
Each line in the Dataframe represents a column.
We will sort the data by value first, then by salesperson.
We can sort in ascending or descending order.
Put the information above on the clipboard and paste it into Spyder, save and run the code.
How many salespersones are there?
What is the most expensive vehicle sold? How much?
What is the least expensive vechicle?
Sorted by salesperson, who is the first on the list? the last?
Adding the condition of the car or truck to the dataset
Now add an additional feature to the existing dataset: the condition of the car, new or used. You can choose which ones are new or used.
Save your file and run it. You can see it is much easier to analyse the data after it has been sorted.
Automotive Data Base
Day 9: Dealing with missing data
Let's take a look at a company that deals with large amounts of data: a computer dating commpany.
First let's examine their questionnaire which is the source of the data.
Computer Dating Survey
The survey asks for last name, first name, sex, age and then 45 questions. The responses for all 45 questions are enterered into the computer dating software to find matches for the respondents. You can imagine the size of the file for thousands of clients and for 45 responses for each person. If the file is missing any data the results will not be accurate.
To combat errors, the data has been tabulated using an Excel spreadsheet. The responses were coded with a 4 for strongley agree, 3 for agree moderately, 2 for disagree moderately and 1 for disagree strongly.
Let's look at the spreadsheet.
Excel spreadsheet with computer dating responses
For demonstration purposes, there are only 10 respondents and their responses to the 45 questions. A few pieces of data are missing. We will use Python's Pandas library to isolate these missing pieces of data.
First copy and paste the Python code into Spyder and save the file using a .py extension.
Download the Excel spreadsheet file and save it in your working folder. Call it
Execute the Python code using F9 key one line at a time.
The first line imports the Pandas library
The next two lines help reduce errors in the loading of the file.
The next line opens the spreadsheet file and reads it into memory.
The next line creates a data frame with all of the column headings. These are the same column headings contained in the worksheet.
The next 49 lines are the ones looking for missing data in the Excel spreadsheet.
They check for missing data in the last name, first name, sex and age fields and for each of the 45 question responses.
If there is no missing data in that row, all answers are True. If there is a missing piece of data, it returns a False value.
print(df.loc[0:44,['LastName','FirstName','Sex','Age']]) line prints out the first four columns of the spreadsheet for all all repondents.
Now you know that you are also missing answers to some of the questions.
Above are partial results of the execution of the code.
Pay close attention to the line that states
It is looking at all the information in the column that contains all the last names of the respondents.
Remeber that arrays start with the number 0 as the first one in the index.
You can see that the first two respondents' last names are there,
The third one states False, which means that there is no last name in the dataset for the 3rd person: Janet Female age 19.
Make note of that information, since we would have to contact Janet to get her last name before trying to get her a date using our computer dating software.
Continue Pressing F9 to check for additional missing data.
You should see that Shelley Jones has failed to provide her sex,(m/f).
Continue pressing F9 to check that all answers are there and that there is no missing data.
If you look carefully, you will note that Dan Abbot did not answer question 15
Since our dataset is small, it is much easier to see missing data on the spreadsheet.
By adding this line to the end of your code, you can see last name, first name, sex, age and the answers to question 15
There are a number of things we could do to handle the missing data.
Contact the individual by email and request it.
Guess at the information.
Put in a middle of the road score to missing questions.
Delete the column where the missing data appears. This means that for all respondents, you would not know the answer to question 15 which states "Making a lot of money is a very important goal in my life".
Delete the entire row containing the repondents name. This means that the person with any missing data is dropped from the computer dating survey.
Adding "df.dropna()" at the end of the code deletes all rows with missing data.
Adding "df.dropna(axis=1)" delete the entire column with missing data.
Add the lines to delete missing data and re-run your code.