Creating Your Own DataSets

Day 2 : Creating your own dataset in Python

Now it is time for you to create your own dataset in Python.

First decide on a category of products that people buy either in store or on line.

Here are a few suggestions.

Once you have decided on a category, Open Spyder, Key in the name of your dataset, type an equal sign, key in a [ and press enter.

Begin each customer's order with [ and end with ],

Put each item ordered in single quotes.

Keep saving the project as you go.

Once you get a reprentative amount of items for the itemset, use CTRL + C to put a section on the clipboard and then paste those at the end of the dataset.

Create at least 200 individual customer orders with multiple purchases.

When creating the dataset, make sure that there are multiple customers purchasing the same items to get useable results.

Add the additional Python code to the project.

The code is the same as the oldie's music code.

Run you program.

Day 3: Pandas DataFrames, Creating the dataset using csv format using Notepad

Instead of coding the dataset directly into Python code, the dataset can be read into the program using a comma separated value file.

A csv file can easily be created using Spyder, Notepad, Notepad++ or any othe text editor.

I used the same subject matter, Billboard's popular songs and created the dataset using Notepad text editor.

The file appears below. Put it on the clipboard and make a copy of it for yourself. Remember to use the .csv extension when saving the file.

Notice that the file has a heading: Song, Artist, Year, GEneres, Number.

The file is arranged in columns with Song, Artist, Year, Genres, Number. After the column headings, each song is listed on a line of their own followed by the name of the Artist, the Year, the Genres, Number, which is how high it scored on Bilboard's chart for 1961.

Save the file and call it "oldies.csv"

The next information needed is the Python code, which you can see below. Put it on the clipboard and paste it into Spyder and save with a .py extension.

In Python there are a few terms that are helpful in understanding how the program works. We will use different code for this next example. Here are some key terms.

Sets or itemsets are an unordered collection of items. Every element is unique. There are no duplicates. A set is created by placing all the items inside curly braces{} separated by commas or by reading it ito the program using csv files or Excel. An example of a set is like the one above. using Song,Artist, Year, Genres, and Number

Data Frames are like tables. They are organized in rows and columns. Data Frames can load data through a number of different data structures and files, including lists and dictionaries, csv files, Excel files and database recods.

Running the Code

Highlight all the code and press F5. After running the code, you can also highlight certain sections of code and press F9 to see the output for that line of code.

Look at the first results in the console. These results were the output of the line that says Print(dataset)

Answer the following questions.

    What was the number one song in 1961?

  1. How many songs are in the dataset?

  2. What song was number 21?

You will notice that the entire data set does not appear in the console, just lines 0-29.

print(dataset.loc[30:69])# prints rows 30 to 69 the missing ones from above.

Answer these questions for rows 30 to 69.

  1. Little Sister was number?

  2. What song was hit number 49?

    dataset.sample(10) prints at random ten lines.

What ten songs did you get when you looked at this block of output in the console?

You can can also search the dataset for specific information like a particular song.

The line that states dataset[dataset.Song=='Traveling Man'] will search the dataset and output the line that Traveling Man appears on.

Who sang the song and what genres was it?

dataset[dataset.Artist=='Chubby Checker'] produces the following output.

dataset[dataset.Artist=='Chubby Checker']#print out an artist Out[4]:
Pony TimeChubby Checker1961Rock and RollNumber 7
The FlyChubby Checker1961Rock and RollNumber 70
Let's Twist AgainChubby Checker1961Rock and RollNumber 94

The .loc function, slices part of your data for additional scrutiny.

print(dataset.loc[:,['Song','Artist']] outputs songs and artists: the first 30, 0 -29 then 70 -99

Looking at the next two lines you see that you can slice out lists containing songs and number on the chart and artist and genres.

You can aditionally look for all genres that are classified as Doo-Wap, but using the following code. dataset[dataset.Genres=='Doo-Wap']

Similarily you can obtain output for other genres using the next six lines of code.

The last line of Python code prints out the top ten songs which happen to be the first ten rows of data in the dataset.

print(dataset.loc[[0,1,2,3,4,5,6,7,8,9]]) # print specific rows The top ten
Index Song Number
0 Tossing and Turning Number 1
1 I Fall to Pieces Number 2
2 Michael Number 3
3 Crying Number 4
4 Runaway Number 5
5 My True Story Number 6
6 Pony Time Number 7
7 Wheels Number 8
8 Raindrops Number 9
9 Wooden Heart Number 10

[10 rows x 5 columns]

Day 4: Creating Your Own Dataset using csv format.

Pick a topic and create a Dataset containg at least 100 rows and five columns.

Change the python code to:

  1. Read your csv file, print the Dataset.
  2. Print any missing rows in the above printout.
  3. Print a random sample of 5 items.
  4. Search for a particular item in the first column.
  5. Search for a particular item in the second column.
  6. Printout the first two columns.
  7. Print out the first and last columns of the dataset.
  8. Search for an items in the fourth column.
  9. Print out a series of rows.