Predictive Analytics Tutorial

Day 3: Decision Trees

A decision tree is used to help a person make a prediction by asking a series of questions. Each question can only have two possible responses such as yes or no. You begin with the part of the decision tree called the root node. This is the problem that you are tying to solve. Some examples might include.

An example of a decision tree is pictured below.

The tree pictured here is designed to help us decide what type of business we should invest our money in: A food truck, a restaurant or a yogurt shop. You can see that based on information obtained from an outside source, we could make $290,500 by investing in a food truck business and we might lose $45,000 if things do not go well. The restaurant business potentially would give us $600,000 in profit if we do well, but poses a significant loss of $128,000. The yogurt shop could earn $400,000 or lose $75,000. A decision tree can help us determine which of the three investments is the best one. The main idea of decision trees is to find those descriptive features which contain the most information regarding the target feature (type of business to invest in).

Copy and paste the information above into Geany text editor. Here is how the algorithum works.

The expected value is calculated in such a way that it includes all possible outcomes for a decision. It has been determined by an outside source, if companies that sell this type of information that. (These numbers are approximations only)

  1. Food truck $156,300

  2. Restaurant $254,400

  3. Yogurt shop $162,500

The best investment based our decision tree calculations, would be to invest our money into a restaurant.

Day 4: Decision Tree: Gini Index

Let's suppose that you are a company that sells video games to adult gamers. You want to find out what factors influence the purchases and playing of video games. You obtain information from a company that sells data and information about purchasing habits and usage of video gamers. You want to examine how age and gender and education affect video gaming time so as to direct your marketing efforts. An algorithum called the Gini Index is a useful tool to help us decide whether gender, age or education is the biggest influence on playing video games. The Gini Index says, if we select two items from a population at random they must be of the same class, and the probability for this is 1 if the population is pure. It works with categorical target variable "Success" or Failure". Success equates with playing video games and failure equates with not playing video games. It performs binary splits. The higher the value of the Gini index the higher homogeneity or the more importance of the variable. The formula is the sum of square of probability for success and failure (p^2 + q^2) We want to segregate our users based on the target variable (playing video games or not). Here is the information that we will work with.

Put the above code unto the clipboard and then paste it into Geany. Save using a .py extension. Let us examine the code. The # allows us to make comments about the code. Here is where we show what the variables are. #Split on gender

female= (0.5)*(0.5) +(0.5)*(0.5)

This line represents the sum of square of probability for success and failure (p^2 + q^2)

The success is the part of the equation represents the number of females in our sample that say that play video games.

The failure part of the equation is the number of females that do not play video games. It is also 50%.

The probabilities are squared (multiplied by each other ) and then added together.

This line equates to .50

female = round(female,4) This line rounds off the variable female to three decimal places.

male = (0.55)*(0.55) + (0.45) * (0.45) This line shows that 55% of the males in our random sample say they play video games. This is the success part of the formula. If 55% saythey play then 45% do not play video games (1.00-.55 = .45)

The result of this line is (.3025 * .2025 = .505)

male = round(male,4) This line rounds off the answer to three decimals

weightedGenderSplit = (100/300)*0.55+(200/300)*0.505 This line produces a weighted average. There 100 females in our study and 200 males. The percentages mentioned in the lines above are multiplied and then added together.

weightedGenderSplit = round(weightedGenderSplit,4) This line rounds off the answer to three decimals

The next lines print the results from the lines above.

print("Age Split") This line prints the title of the next group Age

#age1 - 18-29 year olds This comment line describes the variable age1

#age2 - 30-49 year olds This comment line describes the variable age2

age1 = (0.81) * (0.81) + (0.19) * (0.19). This line contains the algorithm and shows that 81% of the respondents aged 18-29 in our survey indicated that they play video games. Conversly we can calculate that 19% do not play video games.

If you calculate the results of this equation, you wil get .81 *.81 = .6561 + .19 * .19 = .0361 .6561 + .0361 = .6922

age2 = (0.60) * (0.60) + (0.40) * (0.40) This line also contains the algorithm and shows that 60% of 30-49 year olds play video games and 40% do not play.

If you calculate the results of this equation you will get .60 *.60 = .36 = .40 * .40 = .16 .36 + .16 = .52

The next two lines round off the results for age1 and age2

weightedAgeSplit = (140/300) * 0.6922 + (160/300) *(0.52) This line calculates a weighted age split. 140 out of the 300 in the sample stated that they were in age group 18-29

We first calculate the percentage of 18-29 year olds (140 divided by 300) and then multiply by the amount we calculated for age1 split, .6922.

We then calcualte the percentage of 30-49 year olds (160/300) and multiply it by the value obtained for age2 split, 0.52.

Next we add the two results together ( .32 + .27)

Next line rounds off the weightedAgeSplit to three places.

#Education split

Our survey contains three groups based on education level: The first group, were non-high school graduates. Eighty out of our 300 sample indicated that they did not graduate from high school. Forty percent of this group play video games and 60% did not play.

lessThanHighSchool = (.40) * (.40) + (.60) * (.60)

highSchool = (.51) * (.51) + (.49) * (.49) This line is for the high school graduates. One hundred twenty indicated this on the survey and 51% indicated that they played video games while 49% did not play.

collegeGrad = (.57) * (.57) + (.43) * (.43) This line is for those completing 4 or more years of college. One hundred responded that they had a 4-year degree or more and 57% of those play video games.

.57 * .57 = .3249 .43 * .43 = .1849 .3249 + .1849 = .5098

weightedEducationSplit = (80/300) * .52 + (120/300) *.50 + (100/300)*.5098

This line calculates the weightedEducationSplit. Remember 80 were non-grads and .52 of them play video games. 120 were HS grads and 50% of them played video games. There were 100 college grads and .5098 of them played video games.

The next few lines print the results of these calculations.

The final lines compare results and based on the results, a recommended group is displayed (gender, age or education)

Execute the program. Let's look at the results.

CategoryPercentage
Female0.500
Male0.505
Weighted Gender0.52
18-29 year olds0.6922
20 - 49 year olds0.52
Weighted Age0.6004
Non HS grads0.52
High school grads0.5002
College grads0.509
Weighted College grads0.5086

As you can see the most significant variable of the three is age. The 18-29 year olds play more video games than the older group.

This information can be used to redefine our target market, design video games for this segment of our market, target advertisements and email campaign directed toward this group.

We also found out that gender does not matter that much and neither does the educational level attained by the respondents in the study.


Day 5: Frequent Itemsets via Apriori Algorithm

The market basket analysis approach which uses the frequent Itemsets to find groups of items that occur together frequently in a data set.

Let's look at an example of a dataset. It is for a sporting goods store. Each line is called an itemset, and represents a sale to an individual.

The name of the dataset is SPORTS. There are total of 12, (0-11) customers. The first customer purchased the following items:skates, Rawlings glove, survival knife, Spaulding ball, Nike shorts, Nike headband, and a flashlight.

The goal is to find the same items in each of the individual sales. If you look for the same item in each of the 12 sales, you will see that.

Data mining is a process used by companies to turn raw data into useful information. By using software to look for patterns in large batches of data, businesses can learn more about their customers, develop more effective marketing strategies, increase sales and decrease costs.

There a number of algorithms available to us. We will focus our attention on the Apriori algorithm.

The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.

It makes several passes over the dataset, increasing the size of itemsets that are being counted each time. It filters out irrelevant itemsets by using the knowledge gained in the previous passes.

The Apriori algorithm can be downloaded along with many other algorithums in the Anaconda package.

The Apriori algorithum produces the same results as other algorithms. It is quite easy to see that working with only 12 sales and a few items in each sale, that this task is quite simple. Imagine examining 100,000 sales.

What does knowing this information do to help us determine how to better market our products? Knowing what to do with related products is simple. If a customer buys a baseball glove, and a ball you might recommend that they buy a baseball and baseball hat. These associations would be simple to see in a sales transction

What is the association between rink skates and Nike shorts? That is a good question. If we see that there are a large number of these sales, we need to investigate.

We might make the assumption that a baseball pitcher that bought a glove might need a headband to soak up the sweat.