Printout a copy of the worksheet as you work through this project. Answer the questions as you go through the lesson.

Predictive analytics encompasses a variety of statistical techniques from predictive modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events. - Wikipedia

In this lesson we will explore some of the concepts above.

Day 1: Downloading Pyton and Geany

Let's check to see if Python is installed on your machine. Open a Terminal window by typing Command in the search box and then click on The Command Prompt. Your screen should look like this.

To see if Python is installed, type Python

If it is installed you will see a screen, showing the version that is installed. Your screen should look like the one pictured below. if Python is installed.

Windows does not always come installed with Python, so you will need to download it to the downloads folder and then install it. To begin, hold down the shift key and right-click the desktop to open the command window. In the terminal window, enter python in lowercase letters. If it is not installed, you will probably get and error message telling you that it is not a recognized command.

To download Python go to Python downloads

Go to the computer downloads' folder and click on the application to install Python. Follow screen instructions to complete the process. Make sure that you click the Add Python to PATH box

For more detailed and excellent information, you may want to purchase a textbook, entitled "Python Crash Course by Eric Matthes." It is and excellent, well-written and well-edited textbook on Python.

To download a text-editor program that this textbook suggests for beginners, go to geany text editor

Under Downloads click on Releases and then on the latest version for Windows, which as this date was geany-1.33 setup.exe

In your downloads folder click on the setup.exe file for Geany and follow on-screen instructions to install the text-editor to your computer.

Configuring Geany

To configure the text editor, Geany, checkout the video link below

Configuring Geany text editor video

Day 2: Chi-square Analysis

Suppose you are the marketing specialist for a jewlery store. You have a brick and motar business as sell as an online presence. Your main target markets are Milleniels, ages 18-34, and Generation X, ages 35-50. You want to find out how to appeal best to each goup. You design a simple survey to detemine if there is an associaton between consumers' preference by age demographic for online or face to face purchases of fine jewelry

The survey asks:

Are you between the ages of 18 and 34? Yes Non
Are you between the ages 35- 50? Yes No
Are yo over 50 years of age? Yes No
How would you like to purchase a piece of fine jewlery? On-line, in store

Determining Sample Size)

Our dilemma today in doing the survey involves, how many people do we need to ask to complete our survey and have the results representative of the entire population of fine jewelry buyers?
It would be impractical and costly to ask all buyers.
The answer to that question is influenced by a number of factors, including the purpose of our study, the population size of customers, and the risk of getting an unrepresentative sample.
We also need to account for sampling error.
In addition to the above factors, we need to consider three criteria to determine the appropriate sample size.
Level of precision - sampling error is the range in which the true value of the population is estimated to be. This is often expressed as a percentage. For example, ± 5 percent in a survey of who should be president. If our findings are that one candidate is favored by 70% of the voters, then we can assume that between 65% and 75% are in favor of that candidate.
Level of confidence or risk. When a population is repeatedly sampled, the average value of the attribute obtained by those samples is equal to the true population value. The values are distributed normally about the true value, called the mean. Some will have a higher value and some results will have a lower value, In a normal population, 95% of of the survey values are within two standard deviations of the mean. This means that if a 95% confidence interval is chosen, thet 95 out of 100 samples will have the true population value. You can increase the validity of your study by increasing the confidence level to 99% or lowering it to 90%.

The degree of variability in the attributes or concepts being measured, refers to the distribution of attributes in the population. The more hetergenoous a population, the larger the sample size is required to obtain a given precision. The more homogenous a population is, a smaller sample can be taken. A proportion of .5 indicates the maximum variability if a population. It is often used in determining a more conservative sample.
To come with an estimate of variability, simply take a reasonable guess of the size of smaller concept you're trying to measure. For example if you estimate that 30% of the population buys their dog food at a pet store and 70% does not then your variable would be 30%
If degree of variability is too difficult to determine, use 50% as the degree of variability. This is a conservative approach.
There a number of ways for determining sample size:
Using an actual census - looking at all members in a population
Use a sample size of a similar study
There are published tables that tell you what sample size to use based on confidence level and precision levels.
You can use a number of formulas to determine sample size.
Our study is attempting to decide if age and method of purchasing our products are related .
For our study we will use the formula approach. The variables in the formula are included in the sample size calculator.
Use confidence interval of 95% - 1.96
Use ± 5 level of precision
Use 50% estimated percentage
Enter these variables into the sample size calculator to determine the sample size.

Sample Size Calculator

To determine whether the association between two qualitative variables is statistically significant , we conduct a chi-square analysis.

In order to compare categorical variables, the data can be summarized in a table which lists the options for one variable as the rows and the options for the other variable as the columns. This approach is calle crosstab because two variables are being tabulated at the same time and the frequency or the percentage of individuals in each subcategory are being counted.

For our example let us assume that after doing the survey we determined that 60% of the consumers were millenials and 40% were Generation X.

Let's assume that you completed your survey. There were 230 millenials and 154 Generation X. 212 or 55% preferred online purchases and 172 or 45% preferred face to face sales.

Here is a table summarizing the results.

Purchasing preferences	Milleniels	Generation X	Totals
Face to Face	?	?	172 (45%)
Online	?	?	212 (55%)
Totals	230(60%)	154(40%)	384

We first have to formulate a null hypothesis. It is represented as H0. Our null hypothesis stipulates that there is no sigificant difference between customer age and their preference for online or face to face purchases.

Stated another way: If there is no association between the two variables, age and buying preference, then the individuals would be uniformly distributed across the cells of the table.

Next, we formulate and alternative hypothesis, H1: The alternative hypothesis for chi-square test is always two sided. It is technically multi-sided because the differences may occur in both directions in each cell of the table. We will state in the alternative hypothesis that there is a significant difference in the distribution of purchasing preference between Milleniels and Gen X consumers,

Next we need to specify the expected value for each cell of the table when our null hypothesis is true. The expected value is what the values of each cell of the table would be if the values of each cell of the table would be if there was no association between the two variables.

The formula for computing the expected value required sample size, the row total times column total divided by the table total.

Expected Values

Purchasing preferences	Milleniels	Gen X	Totals
Face to Face	230*212/384 127	154*212/384 85	212 (55%)
Online	230*172/384 103	154*172/384 69	172 (45%)
Totals	230(60%)	154(40%)	384(100%)

To see if the data gives convincing evidence against the null hypothesis, we need to compare the observed( actual counts) with the expected counts assuming that H0 is true.

Let's assume that you completed your survey. There were 230 millenials and 54 Generation X Here is a table summarizing the results.

Observed Values

Purchasing preferences	Milleniels	Gen X	Totals
Face to Face	77	135	212
Online	153	19	172
Totals	230	154	384

Purchasing Preferences Age level Crosstabulation

Purchasing Pref.	Milleniels	Gen X	Total
Face to Face/Observed	77	135	212(55%)
Face to Face/Expected	127	85
Online/Observed	153	19	172(45%)
Online/Expected	103	69
Total	60%	40%	100%

Χ² = Σ(Observed - Expected)²
           _________________
                   Expected

Χ² = (77 - 127)² + (153 - 103)² + (135 - 85)² + (19 - 69)²         ^____________        ^_________        ^_________        ^_________
              127               103               85               69

Χ² = (19.68) +       (24.27) +           (29.41) +       (36.23)         ^____________        ^_________        ^_________        ^_________
              127               103               85               69

Χ² = 19.68 +       24.27 +           29.41 +       36.23

Χ² = 109.39

The chi-square statistic compares observed values to the expected values. The test number is used to ascertain whether the difference between the observed and expected valoues is stastically significant.

The final step is to determine if the value of chi-square, 126.71 is large enough to reject the null hypothesis. Next we need to look up a chi-squae distribution of critical values. These tables are available in most statistics books.

Determine your experiment's degrees of freedom. Degrees of freedom are a measure the amount of variability involved in the research, which is determined by the number of categories you are examining. The equation for degrees of freedom is Degrees of freedom = n-1, where "n" is the number of categories or variables being analyzed in your experiment.

We have only two categories being examined online and face to face purchases, therefore our degrees of freedon is 2-1 or 1 degree of freedon.

P value is a statistical measure that helps scientists determine whether or not their hypotheses are correct. P values are used to determine whether the results of their experiment are within the normal range of values for the events being observed. Usually, if the P value of a data set is below a certain pre-determined amount (like, for instance, 0.05), scientists will reject the "null hypothesis" of their experiment - in other words, they'll rule out the hypothesis that the variables of their experiment had no meaningful effect on the results. Today, p values are usually found on a reference table by first calculating a chi square value.¹

Use .05 as the p value. We used this amonut in caluclating sample size. Use 1 degree of freedom.

Chi-square Distribution table

v	0.90	0.95	0.975	0.99	0.999
1	2.706	3.841	5.024	6.635	10.828
2	4.605	5.991	7.378	9.210	13.816
3	6.251	7.815	9.348	11.345	16.266
4	7.779	9.488	11.143	13.277	18.467
5	9.236	11.070	12.833	15.086	20.515
6	10.645	12.592	14.449	16.812	22.458
7	12.017	14.067	16.013	18.475	24.322
8	13.362	15.507	17.535	20.090	26.125
9	14.684	16.919	19.023	21.666	27.877
10	15.987	18.307	20.483	23.209	29.588

Next we look at the table above for 1 degree of freedom and .05 confidence level or p value and we see that chi-square critical value is 3.841.

Our Chi-square is 109 which is much larger than the critical value of 3.841 so the null hypothesis can be rejected.

We will accept the alternative hypothesis: There is a significant difference in the distribution of purchasing preference between Milleniels and Gen X consumers.

Now let's use a Python program to calculate Chi-square. Put the code listed below and paste it into Spyder, Save and execute. Does Python give you the same chi-square number as your manual calculations?

# -*- coding: utf-8 -*-
"""
Created on Tue May 28 16:47:58 2019

@author: jerrybelch
"""

from scipy.stats import chisquare
chisquare([77,153,135,19], f_exp=[127,103,85,69])

The first group of numbers are the observed or actual results. The second group of numbers represent the expected frequencies. The first output result from Python code is the chi-square amount.

Usisng Microsoft's Excel to calculate Chi-Square

You can also use a spreadsheet to calculate chi-squae. I used Microsoft's Excel 365. Download the template and enter the obsserved frequencies in the appropriate cells. Chi-Square spreadsheet

What were your results? Did they agree with the manual and Python methods?

Intrepretation of the results

Since the test is significant, we should examine the data to learn the nature of the relationship.

Generation X preferred face to face sales over online sales 135 - 19. Over 7 to 1.
Millenials preferred online sales over face to face sales 153 to 77. Over 2 to 1.

What can you do with this information

Design your website online order cart with the younger group in mind making it device sensitive, (cell phones and tablets).
Have an older, experienced sales staff in your brick and motar store and give your walk-in customers lots of personal attention.

Day 3: Decision Trees

A decision tree is used to help a person make a prediction by asking a series of questions. Each question can only have two possible responses such as yes or no. You begin with the part of the decision tree called the root node. This is the problem that you are tying to solve. Some examples might include.

What type of business to start
Whether a new customer might buy a given product from an online store
What items might you suggest to a returning customer, that they might be interested in purchasing based on past purchases.

An example of a decision tree is pictured below.

The tree pictured here is designed to help us decide what type of business we should invest our money in: A food truck, a restaurant or a yogurt shop. You can see that based on information obtained from an outside source, we could make $290,500 by investing in a food truck business and we might lose $45,000 if things do not go well. The restaurant business potentially would give us $600,000 in profit if we do well, but poses a significant loss of $128,000. The yogurt shop could earn $400,000 or lose $75,000. A decision tree can help us determine which of the three investments is the best one. The main idea of decision trees is to find those descriptive features which contain the most information regarding the target feature (type of business to invest in).

expectedValueFoodTruck = (.60*290500) +(.40*-45000)

expectedValueRestaurant = (.52*600000) +(.48*-120000)

expectedValueYogurtShop = (.50*400000) +(.50*-75000)


print ("Expected value of food truck business is $" +str(expectedValueFoodTruck))
print ("Expected value of a restaurant business is $" + str(expectedValueRestaurant))
print ("Expected value of a yogurt shop business is $" +str(expectedValueYogurtShop))

Copy and paste the information above into Geany text editor. Here is how the algorithum works.

The expected value is calculated in such a way that it includes all possible outcomes for a decision. It has been determined by an outside source, if companies that sell this type of information that. (These numbers are approximations only)

Business success and failure rates for a food truck business are 60 pecent and 40 percent.
Business success and failure rates for a restaurant are 52 percent and 48 percent.
Business success and failure rates for a yogurt shop are 50 percent and 50 percent.
We first declare a variable for expected value for each.
Next we multiply the success rate of each by the amount of expected income.
Next we multiply the failure rate by the amount of expected loss.
Now we add these amounts together to come up with our final calculation.
Lastly, we print out the results.
When we execute the program, we can see the results.

Food truck $156,300
Restaurant $254,400
Yogurt shop $162,500

The best investment based our decision tree calculations, would be to invest our money into a restaurant.

Day 4: Decision Tree: Gini Index

Let's suppose that you are a company that sells video games to adult gamers. You want to find out what factors influence the purchases and playing of video games. You obtain information from a company that sells data and information about purchasing habits and usage of video gamers. You want to examine how age and gender and education affect video gaming time so as to direct your marketing efforts. An algorithum called the Gini Index is a useful tool to help us decide whether gender, age or education is the biggest influence on playing video games. The Gini Index says, if we select two items from a population at random they must be of the same class, and the probability for this is 1 if the population is pure. It works with categorical target variable "Success" or Failure". Success equates with playing video games and failure equates with not playing video games. It performs binary splits. The higher the value of the Gini index the higher homogeneity or the more importance of the variable. The formula is the sum of square of probability for success and failure (p^2 + q^2) We want to segregate our users based on the target variable (playing video games or not). Here is the information that we will work with.

Our sample size is 300 individuals: 100 females and 200 males.
50 percent play video games.
50 percent of the females play video games.
55 percent of the males play video games
The age groups we are examining are 18-29 year olds and 30-49 year olds.
140 individuals identified themselves as between 18 and 29 years of age.
160 identified themselves as between 30 and 49 years of age.
80 identified themselves as not having a high school diploma.
120 identified themselves as having a high school diploma.
100 said that they had a four year college degree or more.

#300 respndents in random sample
#100 females in sample and only 50% or 50 out of 100 of them play video games
#200 males in sample 130 out of 200  55% play video games
#140 in 18- 29 year olds
#160 in 30-49 year old group
#80 less than high school
#120 high school graduates no college in group
#100 college grads 4 or more years of college in study
#Split on Gender
female= (0.5)*(0.5) +(0.5) * (0.5)
female = round(female,3)
male = (0.55)*(0.55) + (0.45) * (0.45)
male = round(male,3)
weightedGenderSplit = (100/300)*0.55+(200/300)*0.505
weightedGenderSplit = round(weightedGenderSplit,3)

print("Gender Split")
print("Gini for sub-node Female " + str(female))
print("Gini for sub-node male " + str(male))
print("Weighted Gini for Split Gender " + str(weightedGenderSplit))
print("")

print("Age Split")
#age1 - 18-29 year olds 
#age2 - 30-49 year olds 
age1 = (0.81) * (0.81) + (0.19) * (0.19)
age2 = (0.60) * (0.60) + (0.40) * (0.40)
age1 = round(age1,4)
age2 = round(age2,4)

weightedAgeSplit = (140/300) * 0.6922 + (160/300) *(0.52)
weightedAgeSplit = round(weightedAgeSplit,4)

print("Gini for sub-node age1(18-29 years old) = " + str(age1))
print("Gini for sub-node age2(30-49 years old) = "  + str(age2))

print("Weighted Gini for split age = " + str(weightedAgeSplit))
print("")
#Education split
lessThanHighSchool = (.40) * (.40) + (.60) * (.60)
lessThanHighSchool = round(lessThanHighSchool,3)
highSchool = (.51) * (.51) + (.49) * (.49)
highSchool = round(highSchool,4)
collegeGrad = (.57) * (.57) + (.43) * (.43)
collegeGrad= round(collegeGrad,4)
weightedEducationSplit = (80/300) * .52 + (120/300) *.50 + (100/300)*.5098
weightedEducationSplit  = round(weightedEducationSplit,4)
print("Education split")
print("Gini split of non high school graduates " + str(lessThanHighSchool))
print("Gini split of high school graduates " + str(highSchool))
print("Gini split for college graduates " + str(collegeGrad))
print("Weighted Education Split " + str(weightedEducationSplit))
print("")
print("")

if weightedGenderSplit>weightedAgeSplit and weightedGenderSplit>weightedEducationSplit:
    print("The node split will take place on Gender")
elif weightedAgeSplit>weightedGenderSplit and weightedAgeSplit > weightedEducationSplit:
    print("The node split will take place on age")
elif weightedEducationSplit >weightedAgeSplit and weightedEducationSplit > weightedGenderSplit:
		print("The node split will take place on education")
else:
	print("no data available")

Put the above code unto the clipboard and then paste it into Geany. Save using a .py extension. Let us examine the code. The # allows us to make comments about the code. Here is where we show what the variables are. #Split on gender

female= (0.5)*(0.5) +(0.5)*(0.5)

This line represents the sum of square of probability for success and failure (p^2 + q^2)

The success is the part of the equation represents the number of females in our sample that say that play video games.

The failure part of the equation is the number of females that do not play video games. It is also 50%.

The probabilities are squared (multiplied by each other ) and then added together.

This line equates to .50

female = round(female,4) This line rounds off the variable female to three decimal places.

male = (0.55)*(0.55) + (0.45) * (0.45) This line shows that 55% of the males in our random sample say they play video games. This is the success part of the formula. If 55% saythey play then 45% do not play video games (1.00-.55 = .45)

The result of this line is (.3025 * .2025 = .505)

male = round(male,4) This line rounds off the answer to three decimals

weightedGenderSplit = (100/300)*0.55+(200/300)*0.505 This line produces a weighted average. There 100 females in our study and 200 males. The percentages mentioned in the lines above are multiplied and then added together.

weightedGenderSplit = round(weightedGenderSplit,4) This line rounds off the answer to three decimals

The next lines print the results from the lines above.

print("Age Split") This line prints the title of the next group Age

#age1 - 18-29 year olds This comment line describes the variable age1

#age2 - 30-49 year olds This comment line describes the variable age2

age1 = (0.81) * (0.81) + (0.19) * (0.19). This line contains the algorithm and shows that 81% of the respondents aged 18-29 in our survey indicated that they play video games. Conversly we can calculate that 19% do not play video games.

If you calculate the results of this equation, you wil get .81 *.81 = .6561 + .19 * .19 = .0361 .6561 + .0361 = .6922

age2 = (0.60) * (0.60) + (0.40) * (0.40) This line also contains the algorithm and shows that 60% of 30-49 year olds play video games and 40% do not play.

If you calculate the results of this equation you will get .60 *.60 = .36 = .40 * .40 = .16 .36 + .16 = .52

The next two lines round off the results for age1 and age2

weightedAgeSplit = (140/300) * 0.6922 + (160/300) *(0.52) This line calculates a weighted age split. 140 out of the 300 in the sample stated that they were in age group 18-29

We first calculate the percentage of 18-29 year olds (140 divided by 300) and then multiply by the amount we calculated for age1 split, .6922.

We then calcualte the percentage of 30-49 year olds (160/300) and multiply it by the value obtained for age2 split, 0.52.

Next we add the two results together ( .32 + .27)

Next line rounds off the weightedAgeSplit to three places.

#Education split

Our survey contains three groups based on education level: The first group, were non-high school graduates. Eighty out of our 300 sample indicated that they did not graduate from high school. Forty percent of this group play video games and 60% did not play.

lessThanHighSchool = (.40) * (.40) + (.60) * (.60)

highSchool = (.51) * (.51) + (.49) * (.49) This line is for the high school graduates. One hundred twenty indicated this on the survey and 51% indicated that they played video games while 49% did not play.

collegeGrad = (.57) * (.57) + (.43) * (.43) This line is for those completing 4 or more years of college. One hundred responded that they had a 4-year degree or more and 57% of those play video games.

.57 * .57 = .3249 .43 * .43 = .1849 .3249 + .1849 = .5098

weightedEducationSplit = (80/300) * .52 + (120/300) *.50 + (100/300)*.5098

This line calculates the weightedEducationSplit. Remember 80 were non-grads and .52 of them play video games. 120 were HS grads and 50% of them played video games. There were 100 college grads and .5098 of them played video games.

The next few lines print the results of these calculations.

The final lines compare results and based on the results, a recommended group is displayed (gender, age or education)

Execute the program. Let's look at the results.

Category	Percentage
Female	0.500
Male	0.505
Weighted Gender	0.52
18-29 year olds	0.6922
20 - 49 year olds	0.52
Weighted Age	0.6004
Non HS grads	0.52
High school grads	0.5002
College grads	0.509
Weighted College grads	0.5086

As you can see the most significant variable of the three is age. The 18-29 year olds play more video games than the older group.

This information can be used to redefine our target market, design video games for this segment of our market, target advertisements and email campaign directed toward this group.

We also found out that gender does not matter that much and neither does the educational level attained by the respondents in the study.

Day 5: Frequent Itemsets via Apriori Algorithm

The market basket analysis approach which uses the frequent Itemsets to find groups of items that occur together frequently in a data set.

Let's look at an example of a dataset. It is for a sporting goods store. Each line is called an itemset, and represents a sale to an individual.

SPORTS = [
   ['skates', 'Rawlings glove', 'survival knife','Spaulding ball'  , 'Nike shorts', 'Nike headband', 'flashlight'],
   ['treadmill', 'Rawlings glove', 'Spaulding ball', 'Nike shorts', 'Nike headband', 'survival knife','flashlight'],
   ['skates', 'range finder', 'Nike shorts', 'Nike headband','survival knife', 'wakeboard'],
   ['skates', 'survival knife', 'wakeboard', 'Nike shorts', 'flashlight'],
   ['wakeboard', 'Rawlings glove',  'Nike shorts', 'shotgun','survival knife', 'Nike headband'],
   ['survival knife','wakeboard', 'Rawlings glove',  'Nike shorts', 'shotgun', 'Nike headband'],
   ['skates','junior football',' mouthguard', 'heavy bag', 'survival knife','Nike shorts'],
   ['shinguards','survival knife',],
   ['baseball cleats', 'Rawlings glove','suvival knife', 'Nike shorts', 'Nike headband'],
   ['Razor skooter', 'Kore helmut''survival knife',],
   ['weight bench', 'training gloves','survival knife', 'Nike headband'],
   ['skates', 'Rawlings glove', 'Spaulding ball','survival knife', 'Nike shorts', 'Nike headband', 'flashlight']
]

The name of the dataset is SPORTS. There are total of 12, (0-11) customers. The first customer purchased the following items:skates, Rawlings glove, survival knife, Spaulding ball, Nike shorts, Nike headband, and a flashlight.

The goal is to find the same items in each of the individual sales. If you look for the same item in each of the 12 sales, you will see that.

Nike shorts were purchased by 9 out of 12 the customer. 9/12 = 75%
Nike headbands were purchased by 8 out of the 12 customers which equals 67%.
Glowman flashlight 4 out of 12 = 33%
Rawlings glove 6 out of 12 = 50%
Skates were purchased by 5 out of the 12 customers equalizng 42%.
Survival knife was a popular item with 83% itemsets in the whole dataset
Spaulding ball purchased by 3 out of the 12 = 25%
We are really most interested in those itemsets over 50% with two or more items in the itemset.
There are some items that represent multiple items purchased by the customers.
Let's look at one itemset that contains 2 different items: Nike shorts and survival knife.
Look at the dataset above. Both items appear for customers 1,2,3,4,5,6,7,and 9.
That is 8 out of 12 or 67% of our customers bought these two items together.
One cannot easily see why a buyer would buy these two items together, but it is certainly worth investigating.
Even knowing just that these two items were purchased together is important. We might want to bundle these two items together on the shelf or on our web site or offer discounts for this dual purchase.
If we look at the dataset more closely, we can see other dual purchases for multiple customers. Like Nike headbands and Nike shorts, Rawlings glove and Nike headband, Nike shorts and skates, and others.
We could also calculate the number of times each of these transactions occurred to find the percentage. That is a lot of time-consuming work.
With only 12 transactions in the dataset, we can just visually examine it for the information we are looking for.
Suppose our dataset had 10,000 transactions, It could take forever.
That is why we need Python programming language an algorthum and the computer to help us find these relatioships in our customers' data.
The items in each sale are separated by commas. The comma is known as the delimiter. It may be another character such as a semicolon. Each itemset in the whole dataset is enclosed in square brackets. A comma is put at the end of each transaction or row. Comma Separated Value files can be created in any text editor . They can also be created using the save as option in Microsoft Excel. The entire dataset is enclosed in square brackets. The last row of the data set does not have a comma after the square bracket.

Data mining is a process used by companies to turn raw data into useful information. By using software to look for patterns in large batches of data, businesses can learn more about their customers, develop more effective marketing strategies, increase sales and decrease costs.

There a number of algorithms available to us. We will focus our attention on the Apriori algorithm.

The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.

It makes several passes over the dataset, increasing the size of itemsets that are being counted each time. It filters out irrelevant itemsets by using the knowledge gained in the previous passes.

The Apriori algorithm can be downloaded along with many other algorithums in the Anaconda package.

The Apriori algorithum produces the same results as other algorithms. It is quite easy to see that working with only 12 sales and a few items in each sale, that this task is quite simple. Imagine examining 100,000 sales.

What does knowing this information do to help us determine how to better market our products? Knowing what to do with related products is simple. If a customer buys a baseball glove, and a ball you might recommend that they buy a baseball and baseball hat. These associations would be simple to see in a sales transction

What is the association between rink skates and Nike shorts? That is a good question. If we see that there are a large number of these sales, we need to investigate.

We might make the assumption that a baseball pitcher that bought a glove might need a headband to soak up the sweat.

Day 6 Anaconda - includes ll libraries in download

Anaconda Distribution. With over 6 million users, the open source Anaconda Distribution is the fastest and easiest way to do Python and R data science and machine learning on Linux, Windows, and Mac OS X. It's the industry standard for developing, testing, and training on a single machine. Anaconda is a Python-based data processing and scientific computing platform. It has built in many very useful third-party libraries. Installing Anaconda is equivalent to automatically installing Python and some commonly used libraries such as Numpy, Pandas, Scrip, and Matplotlib, so it makes the installation so much easier than regular Python installation.

Look on-line to see how to install Anaconda for your operating system. One of the packages that comes with Anaconda is Spyder, which is the IDE text editor that I used for this lesson.

# -*- coding: utf-8 -*-
"""
Created on Fri Apr 12 15:36:47 2019

@author: jerrybelch
"""

# -*- coding: utf-8 -*-
"""
Created on Thu Mar 14 11:00:16 2019

@author: jerrybelch
"""
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
SPORTS = [
    ['skates', 'Rawlings glove', 'survival knife', 'Spaulding ball', 'Nike shorts', 'Nike headband', 'flashlight'],
    ['treadmill', 'Rawlings glove', 'Spaulding ball', 'Nike shorts', 'Nike headband', 'survival knife', 'flashlight'],
    ['skates', 'range finder', 'Nike shorts', 'Nike headband', 'survival knife', 'wakeboard'],
    ['skates', 'survival knife', 'wakeboard', 'Nike shorts', 'flashlight'],
    ['wakeboard', 'Rawlings glove',  'Nike shorts', 'shotgun', 'survival knife', 'Nike headband'],
    ['survival knife','wakeboard', 'Rawlings glove',  'Nike shorts', 'shotgun', 'Nike headband'],
    ['skates','junior football',' mouthguard', 'heavy bag', 'survival knife','Nike shorts'],
    ['shinguards', 'survival knife'],
    ['baseball cleats', 'Rawlings glove', 'suvival knife', 'Nike shorts', 'Nike headband'],
    ['Razor skooter', 'Kore helmut', 'survival knife'],
    ['weight bench', 'training gloves', 'survival knife', 'Nike headband'],
    ['skates', 'Rawlings glove', 'Spaulding ball', 'survival knife', 'Nike shorts', 'Nike headband', 'flashlight']
]


te = TransactionEncoder()
te_ary = te.fit(SPORTS).transform(SPORTS)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

apriori(df, min_support= 0.40, use_colnames=True)
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets
frequent_itemsets[(frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >=0.5)]
frequent_itemsets[(frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >=0.6)]

Load Spyder. Copy and paste the information above into Spyder. Save the program in your working directory. Make sure you add a .py extension to identify your work as a python file. Spyder Tutorial

The sports dataset is a python list. The list is a most versatile data type available in Python which can be written as comma-separated values (items) between square brackets.
Next we import pandas library which is a data frame that is two dimensional. The data is aligned in rows and columns.
The following lines of code came from an article by Sebastian Raschka http://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/
The TransactionEncoder object, transforms this dataset into an array format suitable for typical machine learning APIs.
By changing the min_support, we can fine tune the information we are looking for

Now let's go thru the code one line at a time. First highlight the three first lines ,14,15,16 and then the entire Sports dataset and then press the F9 key.

The F9 key executes the code one line at a time which makes it easier to see your mistakes.

It should look like this in the console window:

Now move the cursor down and highlight lines 33,34,35,36 and press F9 again. Your console will contain the following information.

Press F9 key again.

Your console should look like the above.


apriori(df, min_support = 0.40, use_colnames=True)
Out[10]: 
     support                                           itemsets
0   0.666667                                    (Nike headband)
1   0.750000                                      (Nike shorts)
2   0.500000                                   (Rawlings glove)
3   0.416667                                           (skates)
4   0.833333                                   (survival knife)
5   0.583333                       (Nike headband, Nike shorts)
6   0.500000                    (Nike headband, Rawlings glove)
7   0.583333                    (Nike headband, survival knife)
8   0.500000                      (Nike shorts, Rawlings glove)
9   0.416667                              (skates, Nike shorts)
10  0.666667                      (Nike shorts, survival knife)
11  0.416667                   (survival knife, Rawlings glove)
12  0.416667                           (skates, survival knife)
13  0.500000       (Nike headband, Nike shorts, Rawlings glove)
14  0.500000       (Nike headband, Nike shorts, survival knife)
15  0.416667    (Nike headband, survival knife, Rawlings glove)
16  0.416667      (survival knife, Nike shorts, Rawlings glove)
17  0.416667              (skates, Nike shorts, survival knife)
18  0.416667  (Nike headband, survival knife, Nike shorts, R...

This is where the apriori algorithum is called. It looks at the dataset and shows only those items that have minimum support of 40%. That means that the items shown appear at least 40% in the whole data set.

For example, Nile headbands are in 66% of all transactions(itemsets). Nike shorts come in at 75%. Nike shorts and survival knife register 66%.

Press F9 key once more to see the following.

      support                                           itemsets  length
0   0.666667                                    (Nike headband)       1
1   0.750000                                      (Nike shorts)       1
2   0.500000                                   (Rawlings glove)       1
3   0.416667                                           (skates)       1
4   0.833333                                   (survival knife)       1
5   0.583333                       (Nike headband, Nike shorts)       2
6   0.500000                    (Nike headband, Rawlings glove)       2
7   0.583333                    (Nike headband, survival knife)       2
8   0.500000                      (Nike shorts, Rawlings glove)       2
9   0.416667                              (skates, Nike shorts)       2
10  0.666667                      (Nike shorts, survival knife)       2
11  0.416667                   (survival knife, Rawlings glove)       2
12  0.416667                           (skates, survival knife)       2
13  0.500000       (Nike headband, Nike shorts, Rawlings glove)       3
14  0.500000       (Nike headband, Nike shorts, survival knife)       3
15  0.416667    (Nike headband, survival knife, Rawlings glove)       3
16  0.416667      (survival knife, Nike shorts, Rawlings glove)       3
17  0.416667              (skates, Nike shorts, survival knife)       3
18  0.416667  (Nike headband, survival knife, Nike shorts, R...       4

The display above just adds the number of items in the order. - the length parameter. For example, Item 17 has 3 items in the sale: skates, Nike shorts and a survival knife.

Press F9 key once more and the information in our dataset is filtered down to those with 50% support, length 2. support itemsets length 5 0.583333 (Nike headband, Nike shorts) 2 6 0.500000 (Nike headband, Rawlings glove) 2 7 0.583333 (Nike headband, survival knife) 2 8 0.500000 (Nike shorts, Rawlings glove) 2 10 0.666667 (Nike shorts, survival knife) 2 frequent_itemsets[ (frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >=0.6)] Out[15]: support itemsets length 10 0.666667 (Nike shorts, survival knife) 2

Let's analyze the information obtained in this last filtering. We are zeroing in on what we are looking for.

Nike shorts and Nike headband both appear 58% of the time. These two purchases together seem logical.
Nike headbands and Rawling gloves were both purchased by 50% of our customers. This association does not seem logical.
Nike headbands and the survival knife both were purchased by 58% of the time. This association does not just jump out at us either.
Nike shorts and Rawling gloves both were found in 50% of all transactions
Nike shorts and the survival knife both were purchased by 67% of our customers. Why? This association is not logical and needs investigating. We probably need to talk to some of the customers that made these purchases. The information gained from this can help us direct our marketing efforst to promote these items togther.
Press F9 again to see 60% support with 2 items.
Nike shorts and the survival knife , two items, were purchased 66% of the time.

Day 7: creating your own data set

McDonalds has on-line ordering for most of its stores now. You will create a dataset out of the orders listed below.

Put the above dataset on your clipboard.
Open up Spyder and paste the code into the editor.
Save your file with a .py extension in your working folder.

# -*- coding: utf-8 -*-
"""
Created on Sun Apr 14 13:01:55 2019

@author: jerrybelch

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
FOOD = [
        ['Apple Slices','Baked Apple Pie', 'Coffee'],
        ['Apple Slices', 'Oatmeal Raisen Cookie'],
        ['Coke', 'Quarter Pounder', 'Hot Cakes',  'Fries', 'Big Mac', 'Fries','Chicken McNuggets'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Hamburger', 'Quarter Pounder', 'Egg McMuffin', 'Hot Cakes', 'Fries', 'Big Mac', 'Chicken McNuggets'],
        ['Coke', 'Cheeseburger', 'Fries', 'Big Mac', 'Egg McMuffin', 'Hash Browns'],
        ['Coke', 'Filet-O-Fish', 'Egg McMuffin', 'Hash Browns', 'Fries', 'Chicken McNuggets', 'Big Mac'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Egg McMuffin','Hash Browns', 'Quarter Pounder', 'Egg McMuffin', 'Quarter Pounder', 'Fries', 'Hot Fudge Sundae', 'Big Mac'],        
        ['Coke', 'Quarter Pounder', 'Hot Cakes', 'Egg McMuffin', 'Hash Browns', 'Fries', 'Big Mac', 'Chicken McNuggets'],
        ['Hamburger', 'Quarter Pounder', 'Egg McMuffin', 'Hot Cakes', 'Fries', 'Big Mac', 'Hash Browns', 'Chicken McNuggets'],
        ['Coke', 'Cheeseburger', 'Fries','Hash Browns', 'Egg McMuffin', 'Big Mac'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Coke', 'Filet-O-Fish', 'Egg McMuffin', 'Hash Browns', 'Fries', 'Chicken McNuggets', 'Big Mac'],
        ['Egg McMuffin', 'Quarter Pounder', 'Quarter Pounder', 'Egg McMuffin', 'Fries', 'Hot Fudge Sundae', 'Big Mac', 'Coke'],
        ['Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Egg McMuffin', 'Hash Browns', 'Fries', 'Coke'],
        ['Big Mac', 'Fries','Egg McMuffin', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Egg McMuffin', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Egg McMuffin', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Egg McMuffin', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries',' Egg McMuffin', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Bacon Ranch Grilled Chicken Salad', 'Egg McMuffin', 'Sprite', 'Big Mac', 'Fries', 'Hash Browns', 'Sausage Burrito'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Double Cheeseburger', 'Hamburger', 'Egg McMuffin', 'Happy Meal', 'Hash Browns', 'Fries', 'Baked Apple Pie', 'Coke'],
        ['Bacon Ranch Chicken Salad', 'Egg McMuffin', 'Fries', 'Sausage McGriddles', 'Hash Browns', 'Diet Dr.Pepper'],
        ['Sausage McMuffin with Egg', 'Hash Browns', 'Egg McMuffin', 'Coffee','Big Mac', 'Fries', 'Coke', 'Big Mac'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Filet-O-Fish', 'Big Mac', 'Fries', 'Hash Browns', 'Egg McMuffin', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Hot Cakes', 'Hash Browns', 'Happy Meal Cheesburger', 'Egg McMuffin', 'Big Mac', 'Fries', 'Happy Meal Cheesburger', 'Baked Apple Pie'],
        ['Big Breakfast', 'Hash Browns', 'Egg McMuffin', 'Coffee'], 
        ['Egg McMuffin',  'Coffee', 'Hash Browns', 'Egg McMuffin',  'Coffee', 'Hash Browns'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Chicken McNuggets', 'Egg McMuffin', 'Quarter Pounder', 'Big Mac', 'Coke', 'Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Sausage McMuffin with Egg', 'Big Mac', 'Fries', 'Coke', 'Hash Browns', 'Coffee'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Double Cheesburger', 'Fries', 'Big Mac', 'Quarter Pounder', 'Big Mac', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Double Big Mac','Egg McMuffin', 'Hash Browns', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Hot Fudge Sundae', 'Oatmeal Raisen Cookie'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Kidde Cone', 'Vanilla Cone', 'Fries', 'Hot Carmel Sundae'],
        ['Bacon Ranch Grilled Chicken Salad', 'Big Mac', 'Fat Free Chocolate Milk Jug'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Double Cheeseburger', 'Hamburger', 'Fries', 'Happy Meal', 'Baked Apple Pie', 'Coke'],
        ['Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Big Mac', 'Fries', 'Coke', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Sausage McMuffin with Egg', 'Big Mac', 'Fries', 'Coke', 'Coffee'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Quarter Pounder', 'Big Mac','Fries', 'Big Mac', 'Coke', 'Coke'],
        ['Double Big Mac', 'Fries', 'CheeseBurger', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Big Mac', 'Fries', 'Coke', 'Coke'],
        ['Bacon Ranch Grilled Chicken Salad', 'Sprite', 'Big Mac', 'Fries', 'Coke', 'Sausage Burrito'],
        ['Double Cheeseburger', 'Hamburger', 'Big Mac', 'Happy Meal', 'Baked Apple Pie', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Bacon Ranch Chicken Salad', 'Egg McMuffin', 'Sausage McGriddles', 'Diet Dr.Pepper'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Sausage McMuffin with Egg', 'Big Mac', 'Fries', 'Coke', 'Coffee'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Filet-O-Fish', 'Big Mac', 'Fries', 'Big Mac', 'Fries', 'Coke', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Hot Cakes', 'Happy Meal Cheesburger', 'Big Mac', 'Happy Meal Cheesburger', 'Baked Apple Pie'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Breakfast', 'Hash Browns', 'Coffee'], 
        ['Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Sausage McMuffin with Egg', 'Coffee'],
        ['Big Mac', 'Fries', 'Big Mac', 'Fries', 'Coke', 'Coke'],
        ['Double Cheesburger', 'Fries', 'Big Mac', 'Quarter Pounder', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Double Big Mac', 'Egg McMuffin', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Hot Fudge Sundae', 'Oatmeal Raisen Cookie'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Kidde Cone', 'Vanilla Cone', 'Hot Carmel Sundae'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Bacon Ranch Grilled Chicken Salad', 'Fat Free Chocolate Milk Jug'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Double Cheeseburger', 'Hamburger', 'Happy Meal', 'Baked Apple Pie', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Sausage McMuffin with Egg',' Coffee'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac',  'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns','Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Quarter Pounder', 'Big Mac', 'Coke', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Double Big Mac', 'Fries', 'CheeseBurger', 'Big Mac', 'Coke'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Bacon Ranch Grilled Chicken Salad', 'Sprite', 'Big Mac', 'Sausage Burrito'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Double Cheeseburger', 'Hamburger', 'Happy Meal', 'Baked Apple Pie', 'Coke'],
        ['Bacon Ranch Chicken Salad', 'Egg McMuffin', 'Sausage McGriddles', 'Diet Dr.Pepper'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns', 'Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Sausage McMuffin with Egg', 'Coffee', 'Big Mac'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Filet-O-Fish', 'Big Mac', 'Fries','Coke'],
        ['Hot Cakes', 'Happy Meal Cheesburger', 'Big Mac', 'Happy Meal Cheesburger', 'Baked Apple Pie'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Breakfast', 'Hash Browns', 'Coffee'], 
        ['Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Big Mac', 'Coke'],
        ['Sausage McMuffin with Egg', 'Coffee'],
        ['Double Cheesburger', 'Fries', 'Big Mac', 'Quarter Pounder', 'Big Mac', 'Coke'],
        ['Double Big Mac', 'Egg McMuffin', 'Coke'],
        ['Hot Fudge Sundae', 'Oatmeal Raisen Cookie'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Kidde Cone', 'Vanilla Cone', 'Hot Carmel Sundae'],
        ['Bacon Ranch Grilled Chicken Salad', 'Big Mac', 'Fat Free Chocolate Milk Jug'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Double Cheeseburger', 'Hamburger', 'Happy Meal', 'Baked Apple Pie', 'Coke'],
        ['Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Coke'],
        ['Sausage McMuffin with Egg', 'Coffee'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Quarter Pounder', 'Big Mac', 'Big Mac', 'Coke', 'Coke'],
        ['Double Big Mac', 'Fries', 'CheeseBurger', 'Coke'],
        ['Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Coke'],
        ['Bacon Ranch Grilled Chicken Salad', 'Sprite', 'Sausage Burrito'],
        ['Double Cheeseburger', 'Hamburger', 'Big Mac', 'Happy Meal', 'Baked Apple Pie', 'Coke'],
        ['Bacon Ranch Chicken Salad', 'Egg McMuffin', 'Sausage McGriddles', 'Diet Dr.Pepper'],
        ['Egg McMuffin', 'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Sausage McMuffin with Egg', 'Fries',  'Coffee'],
        ['Big Mac', 'Fries'],
        ['Filet-O-Fish', 'Big Mac', 'Fries','Coke'],
        ['Hot Cakes', 'Fries', 'Happy Meal Cheesburger', 'Big Mac', 'Happy Meal Cheesburger', 'Baked Apple Pie'],
        ['Big Breakfast', 'Fries', 'Hash Browns', 'Coffee'], 
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Big Mac', 'Fries'],
        ['Egg McMuffin', 'Fries',  'Coffee', 'Hash Browns'],
        ['Big Mac', 'Fries', 'Big Mac', 'Chicken McNuggets', 'Quarter Pounder', 'Coke'],
        ['Sausage McMuffin with Egg', 'Fries', 'Coffee'],
        ['Double Cheesburger', 'Fries', 'Big Mac', 'Quarter Pounder', 'Coke'],
        ['Double Big Mac', 'Egg McMuffin', 'Fries', 'Coke'],
        ['Big Mac', 'Fries'],
        ['Hot Fudge Sundae', 'Fries', 'Oatmeal Raisen Cookie'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Kidde Cone', 'Vanilla Cone', 'Fries',  'Hot Carmel Sundae'],
        ['Bacon Ranch Grilled Chicken Salad', 'Big Mac', 'Fries',  'Fat Free Chocolate Milk Jug'],
        ['Egg McMuffin', 'Coffee', 'Fries', 'Hash Browns'],
        ['Big Mac', 'Fries'],
        ['Double Cheeseburger', 'Hamburger', 'Fries',  'Happy Meal', 'Baked Apple Pie', 'Coke'],
        ['Big Mac', 'Chicken McNuggets', 'Fries', 'Quarter Pounder', 'Coke'],
        ['Sausage McMuffin with Egg', 'Coffee', 'Fries'],
        ['Big Mac', 'Big Mac', 'Fries', 'Coke'],
        ['Quarter Pounder', 'Fries', 'Big Mac', 'Coke', 'Coke'],
        ['Double Big Mac', 'Fries', 'CheeseBurger', 'Big Mac', 'Coke'],
        ['Big Mac','Big Mac', 'Fries', 'Fries', 'Coke'],
        ['Big Mac', 'Fries', 'Coke'],
        ['Big Mac', 'Fries'],
        ['Big Mac', 'Big Mac', 'Fries', 'Fries', 'Coke']
]

Day 8: Adding more code to the project and adjusting support

Now its time to add more code to the project.

Highlight the text in the box below.
Use CTRL C to put the code unto the clipboard.
Paste the code just below the dataset information.
Save your program with a .py extension.
Highlight all the way down to the end of the dataset lines.
Press F9 to run selection or current line.
Click on the next line and press F9 again.
Press F9 four more times until you see the first output information in the console.
This looks a lot like a spreadsheet. It is called a data frame
It shows that the spreadsheet contains 217 rows, individual orders, and 36 columns, total item types or different products.
te = TransactionEncoder() te_ary = te.fit(FOOD).transform(FOOD) df = pd.DataFrame(te_ary, columns=te.columns_) df

Day 9: Isolating top sellers and frequent data sets

Highlight and put on the clipboard the next six lines of code. Paste it just below our existing code and save the project. Select the same lines as before and press F9 key. the project and look at the results - output in the console's window.

te = TransactionEncoder()
te_ary = te.fit(FOOD).transform(FOOD)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
frequent_itemsets = apriori(df, min_support=0.25,use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x :len(x))
frequent_itemsets[ (frequent_itemsets['length'] == 1) & (frequent_itemsets['support'] >=0.25)]
frequent_itemsets[ (frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >=0.25)]
frequent_itemsets[ (frequent_itemsets['length'] == 3) & (frequent_itemsets['support'] >=0.25)]

Let's analyze the meaning of the code.

Apriori is the algorithm that we are using from Python's library. We want to use it with df, our data frame and we want to find minimum support of 25%.

We are looking for all items in the data set that appear at least 25 percent of the time.

If you look in the console window you should see the following results. [218 rows x 37 columns] frequent_itemsets = apriori(df, min_support=0.25,use_colnames=True) frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x :len(x)) #frequent_itemsets frequent_itemsets[ (frequent_itemsets['length'] == 1) & (frequent_itemsets['support'] >=0.25)] Out[149]: support itemsets length 0 0.573394 (Big Mac) 1 1 0.316514 (Coffee) 1 2 0.545872 (Coke) 1 3 0.376147 (Egg McMuffin) 1 4 0.573394 (Fries) 1 5 0.334862 (Hash Browns) 1 frequent_itemsets[ (frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >=0.25)] Out[150]: support itemsets length 6 0.490826 (Coke, Big Mac) 2 7 0.500000 (Fries, Big Mac) 2 8 0.252294 (Coffee, Egg McMuffin) 2 9 0.270642 (Hash Browns, Coffee) 2 10 0.472477 (Fries, Coke) 2 11 0.316514 (Hash Browns, Egg McMuffin) 2 frequent_itemsets[ (frequent_itemsets['length'] == 3) & (frequent_itemsets['support'] >=0.25)] Out[151]: support itemsets length 12 0.444954 (Fries, Coke, Big Mac) 3 13 0.252294 (Hash Browns, Coffee, Egg McMuffin) 3 Finding frequent itemsets
The way to find frequent itemsets is the Apriori algorithm.
1. The Apriori algorithm needs a minimum support level as an input and a data set. The algorithm will generate a list of all candidate itemsets with one item.
2. The transaction data set will then be scanned to see which sets meet the minimum support level.
3. Sets that don’t meet the minimum support level will get tossed out.
4. The remaining sets will then be combined to make itemsets with two elements.
5. Again, the transaction dataset will be scanned and itemsets not meeting the minimum support level will get tossed.
6. This procedure will be repeated until all sets are tossed out.
Day 10: Deciding on sales strategy

Now that you have displayed the results, what are you going to do with the information?
If we look at single item best sellers, we can see that
- Big Macs comprise 57% of your single items sales
- Fries also comprise 57% of your sales.
- Coke with 54%.
- Coffee makes up 31% of sales.
- Hash browns comprise 33%
- Egg McMuffin 37% of your sales.
You might want to pair these best selling items with lower selling volume items offered at a discount. For example, you could offer Sprite, Dr. Peper or fat free chocolate milk at a reduced price when a customer buys fries a larger size of fries. You could also offer a salads at a reduced price for those who purchase a Big Mac.

For the breakfast crowd, you could pair up Egg McMuffins with a McGriddle item, or a Happy Meal or the kids.
Looking at sales of two items:
- Fries and Big Mac comprise 50% of your sales,
- Coke and Big Mac account for 49%
- Fries and Coke 47%.
With two items, you might consider a promotion that supersizing one of the items. Like with every Big Mac purchased, supersize the Drink or the fries.
Looking at the sale of three items, you can see that three items:
- Big Mac, Coke, and fries are a significant part of your sales, 44%.
Pairing food from your menu selections with drinks and side dishes is a proven way to improve sales and profits. Fifty three percent of fast food restaurants consider combo meals very attractive.
You might want to feature a combo meal in your menus. Offer upgrade items from combo meals menus like larger portions of french fries or larger drinks.

Go to your worksheet and answer the questions ralating to frequent itemsets.

What about creating a combo meal with these three items and promoting it with coupons, or a discount.
Day 11: Adding to the dataset

Create a new dataset using the following information.
1. Big Mac, Coke, Fries
2. Egg McMuffin, Coffee, Hash Browns
3. Sausage McMuffin with Egg, Coffee, Hash Browns
4. Sausage Biscuit, Coffee
5. Sausage McGriddles, Coffee
6. Sausage Burrito, Coffee
7. Big Mac, Coke, Fries
8. Egg McMuffin, Coffee, Hash Browns
9. Sausage McMuffin with Egg, Coffee, Hash Browns
10. Sausage Biscuit, Coffee
11. Egg McMuffin, Coffee, Hash Browns
12. Egg McMuffin, Coffee, Hash Browns
13. Sausage McGriddles, Coffee
14. Sausage Burrito, Coffee
15. Hotcakes and Sausage, Coffee
16. Egg McMuffin, Coffee, Hash Browns
17. Sausage McMuffin with Egg, Coffee, Hash Browns
18. Sausage Biscuit, Coffee
19. Egg McMuffin, Coffee, Hash Browns
20. Sausage McGriddles, Coffee
21. Sausage Burrito, Coffee
22. Egg McMuffin, Coffee, Hash Browns
23. Cheeseburger, Coke, Fries, Baked Apple Pie
24. Double Cheeseburger, Diet Sprite, Fries, Vanilla Cone
25. Big Mac, Coke, Fries
26. Egg McMuffin, Coffee, Hash Browns
27. Side Salad, Dr. Pepper
28. Chicken McNugggets, Fries, Coke
29. Happy Meal
30. Egg McMuffin, Coffee, Hash Browns
31. Filet-O-Fish, Fries, Dr. Pepper
32. Quarter Pounder, Fries, Coke
33. Double Cheeseburger, Fries, Coke
34. Sausage Burrito, Vanilla Cone
35. Happy Meal
36. Egg McMuffin, Coffee, Hash Browns
37. Big Mac, Coke, Fries
38. Big Mac, Coke, Fries
39. Egg McMuffin, Coffee, Hash Browns
40. Sausage McMuffin with Egg, Coffee, Hash Browns
41. Big Mac, Coke, Fries
42. Sausage Biscuit, Coffee
43. Big Mac, Coke, Fries
44. Big Mac, Coke, Fries
Remember that each item is enclosed in single quotes and items are are separated by commas.
First name the dataset FOOD = [
Each transaction is enclosed in square brackets followed by a comma
Each itemset is enclosed in single quotes.
Each itemset is indented 4 spaces. Use the tab in Spyder to indent.
The last item does not have a comma following the ]
Do not forget the ] to end the dataset
Check your work against the original format of the dataset.

The existing code needs to be included. Just substiute your dataset for the current one.
Save and execute.
Check for errors and correct.
Answer questions of the worksheet.