In this lesson your will learn how to deal with missing data in datasets and how to encode categorical data.
With the huge amount of data available, more and more data scientists are finding ways to use the data for business use. Most of the data is too raw to use it just the way it is. The data needs to be processed before it can be used to train statistical models that can make predictions. We will be working with a dataset that consists of pleasure boats. If you have been using other modules on our web page, you might want to look at one of the previous lessons on searching and sorting data sets.
Missing values are those in the dataset that do not contain any values. There can be a number of reasons why datasets are missing values. Customers filling in a survey are apt to leave out answers to the survey questions. Calculation issues can also lead to missing data. Many of the advanced machine learning algorithums, like Skikit Learn do not work with missing values. These values either need to be removed or converted into numbers.

There are a number of methods that we can use to deal with missing values in a dataset.

Check the source to get missing data
Drop the missing value by
1. Removing the rows with the missing data
2. Removing the whole column with the missing data
3. Replacing missing values with mean, median, mode values
4. Convert missing categorical data with numbers

Sorting and searching data sets

I used the boats dataset. Let's get started by importing the libraries that we need for this project.

Day 1: Libraries, importing the dataset, finding missing values

Finding missing length

Open a new Python project, click copy text button and paste the contents into the first frame.

Run the code in the first frame.

This code imports libraries from Python.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in the first two frames.

The code reads in the spreadsheet file and prints the first 35 records in the dataset.

You will have to adjust the location of the file in the code to your folder where you saved the cvs spreadsheet file.

Your output should look like the image below.

If you look carefully, you will notice that for records 3, 12 and 34, that the length of the boat say NaN, which stands for not a number. When the dataset was created, for some reason, the length was not entered for these three vessels.

Our data set is very small, it would not be practical to locate missing data in this manner for large datasets. The next line of code is a better choice for finding missing data.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in the first three frames.

The output from this line is below.

The output shows that only the length column contains missing values.

Now let's find the mean and median for all the missing values in the length column.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in the first four frames.

The output from this line tells us that the median for all of the boats in the dataset is 22.0 feet. The mean for the length is 21.44685106383 feet.

Now we are going to add two columns to our Pandas DataFrame: one for the median length and the other for the mean length.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in the first five frames.

The first 35 records in the dataset are printed. The original length column is still in the dataset as well as the median and mean length columns.

Before using the dataset in a machine learning application, we need to determine whether we should use the mean or median values in place of the original length numbers.

To accomplish this we are going to graph the length, median and mean length variables.

Day 2: Graphing length, median and mean lengths

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

Your output should look like the graph below.

As you can see the graph's shape, looks like a bell. This is also called a normal distribution curve. The mean, 21.44 is in the center of the curve. Larger values for length fall on right side of the mean and smaller values fall on the left side of the mean. If the graph has a bell shape and is not skewed, then it is best to replace the missing values with the mean. As you can see the graph is quite narrow, which means that the values are not very far from the mean. If the bell was wider at the base, then the values would be farther from the mean.

The code plots the figure and graphs length, median length and mean length. The legend is printed also. Median line is red and the mean line is green. The original length line is blue.

Mean or median values caould be used on numerical data in cases where the missing data is randomly ommited.

If the graph is skewed, not symetrical, then we should use the median to replace missing values.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

The length column is brought back. Original lengths are still in there, but the NaN values are replaced with the mean of 21.

The median length and mean length columns are no longer needed and they are dropped. Axis = 1 indicates dropping a column. Axis=0 means dropping a row. The the first 35 records are printed out

Day 3: Replacing missing categorical data: power, type

Categorical data that is missing cannot be replaced with the mean or median, since the categorical data is not numeric, We can use the mode, however. The mode is the most often occuring data.

Missing power of boats

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

We are loading a different version of our boats data base. It contains missing categorical data instead of missing numerical data.

Look at the printout and see if you can locate the NaN values in categorical columns.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

Here is the output from this frame.

Now let's graph the data to find the mode for power and type of boat.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

Here is the output from this frame.

From our graph we can see that the most frequenly appearing boat is an inboard/outboard. This is the mode of this categorical distribution.

Now we can see that the most frequent boat in the dataset is an inboard/outboard, followed by outboards, inboards and jet boats.

Now we need to replace the NaN power column with "IO".

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

Here is the output from this frame.

Look at the output from the lines of code and verify that the NaN in the power column has been replaced with "IO".

Now we need to do the same thing for type of boat. We have runabouts, fishing boats,wake boats, etc.

Copy the following code in the textarea box below.

Type of boats

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

Here is the output from this frame.

runabout dtype: object

As you can see from the output, runabouts are the largest group of boats in the dataset.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

In the Python program, you should see that all of NaNs in the type column were replaced with 'runabout' which is the mode.

Missing boat dealers

In some cases when you have missing categorical data in a dataset, you can create an arbitary category. Let's call it "missing". If, for example, we are missing a few boat dealers, we could fill in the NaN with "missing".

First we need to find out which ones we are missing information about the dealer. We are going to use another copy of the boats dataset.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

You will have to adjust where the file comes from.

Run the code in all of the frames. and look for the ones that have NaN for the name of the dealer

Now let's look at the dataset for missing data in all columns. Create a new frame and key in this line boats_data.isnull().mean()

The output should loook like the following.

You can see that there are about 2% of the dealer names missing from our dataset.

We are going to convert these NaNs to the word "missing".

Create a new frame and type the following two lines, save and run this block of code:

boats_data.dealer.fillna('Missing', inplace=True)
boats_data.head(30)

You can see that the NaN values for dealers has been replcaed with the word "missing"

Now let's create a bar graph of this data. Use the folowing lines:

boats_data.dealer.value_counts().sort_values(ascending=False).plot.bar()
plt.xlabel('dealer')
plt.ylabel('Number of boats')

You should get the following result.

Day 4: Categorical data encoding schemes

Models based on statistical algorithums, like machine learning, work with numbers. Most datasets contain numbers and categorical data. In many case, you can not just drop the column containing categorical data.

The techniques used to convert categorical data into numbers is called categorical data encoding.

One Hot Encoding

Let's get started. We need to import the necessary libraries and then load the dataset.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

You will have to adjust for the folder.

Run the code in all of the frames.

Your output should like the text below:

We need to assign each category value with an integer. Let's look at the type of boat. Previously we assigned the missing data to the runabout boat type. However, that variable is important when considering buying a boat.

We have six types of boats:

runabout
center console
ski_wakeboard
bow_rider
fishing

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

Your output should like the text below:

Now create a new frame and key in this line to print the unique values of each type of boat.

print(boats_data['type'].unique())

The output should be:

['runabout' 'center_console' 'ski_wakeboard' 'bow_rider' 'fishing']

The easiest way to convert a column into one hot-encoded column is to use the Pandas get_dummies() method.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

Your output should like the text below:

Looking at the output, you will see that the first 5 boats are runabouts and they are assigned the value of 1. All of the other boat types in the row are assigned a 0.

Now we are going to display the type name and the one hot-encoded version of the type column in the same dataframe.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

Your output should like the text below:

One hot-encoding for missing values

It is also possible to use one hot-encoding if the dataset is missing values in the column that you are trying to convert.

Here is the code needed. We are using a new dataset boats4.csv for this part of the lesson.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Adjust for the location of the file.

Run the code in all of the frames.

Your output should like the text below:

You can see that we are missing type of boat for the SeaRay. There is also another missing value in a row that is not visible on the screen. If you put the number 20 in the boats_data head() function you will be able to see the other missing value.

Click on the + key to create a new frame then click copy text button and paste the contents into the frame.

Run the code in all of the frames.

Your output should like the text below:

You will notice, the SeaRay boat, index number 1, the second line in the image, above has been put in the NaN column.