Day 1: Introduction

Natural language processing uses machine learning to analyze text or speech. It refers to the branch of artificial intelligence concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.
Examples of NLP

Email filters
Search Results
Smart assistants
Predictive text
Language translation
Digital phone calls
Data Analysis
Text analysis

The project we will be working on uses Natural Language Processing with sentiment analysis by enabling computers to understand and intrepret language. Sentiment analysis is the processing of identifying and extrating opinions and emotions from text data,

NLP algorithums can be used to analyze text data and identify patterns in language that indicate positive or negative sentiment.

The datasets that we will use involves comments made by Virtual Enterprize judges about student business plans and product reviews. I made up the comments and how they rated the plan: good or bad. The file is a comma separated values file that can either be created in Excel or a text editor.

We are going to develop a Python Sentiment Analysis Model. The model is a supervised, binary text classifier that classfies the impressions into positive and negative.

The following shows the steps that we will take.

Import the necessary Python libraries
Import the dataset on judges' opinions of business plans.
Print the head of the dataset
Show the shape of the dataset
Create a pie chart of the dataset showing positive and negative impressions
Split the dataset into text showing their impressions and a 0 of 1 indicating if the liked or disliked the plan.
Cleaning the data : removing symbols and numbers.
Covert the text into numbers.
Removing STOP words: words like our, me, they, are, was, few, this and that . Words that have nothing to do with negative or positive text.
Training the model: using the training data, the algorithum will learn the relationship between text reviews and the opinions.
We will use Random Forest Classifier as the algorithum to classsify our data.
We will evaluate our model's performance using confusion matrix and accuracy report.
Lastly, we will use our model to make predictions on new text.

Getting the business plan review file.

Click on the link. Check your downloads folder. Look at the file and then save it in your working directory as a .csv file.

Business Plan Impressions file
If you expand the first column you can see what the judges said about the business plans that they judged. The number under Impression indicates that they either liked it, 1 or disliked it, 0.

You can also open the businessPlanImpressions.csv using Notepad. In Notepand you can see each comment, a comma and the number assigned by the judges.

Day 2: Coding the Application

Importing the Libraries

Open a new Python project, click copy text button and paste the contents into the first frame.

Save your project and call it "businessPlan". Run the code in the first frame.

This code imports libraries from Python.

Getting the dataset

Add a new script frame, click copy text button and paste the contents into the frame.

You will need to adjust the folders to where the file is stored.

Run the code in the first two frames.

This code reads a comma separated value file into a pandas dataset and displays a few of the first line.

The output should look like the information below.

Now let's look at the shape of the dataset. Key in this code in a new script input,

businessPlan_dataset.shape

When you run you program you should get (80,2). This means there 80 rows and two columns.

Graphical Representation
Let's create a pie chart showing the judges' overall opinions of the business plans. Use the following code.

Add a new script frame, click copy text button and paste the contents into the frame.

You will need to adjust the folders where you copy of the file is stored.

Run the code in the first two frames.

The output from this script should produce a pie chart that looks like the image below.

The first line of code sets the default figure size for matplotlib plots to 8 inches in width and 10 inches in height.

This will ensure that all subsequent plots generated using matplotlib will have the same size.

If you want to change the size of a specific plot, you can do so by calling the fig.set_size_inches() method and passing in the desired width and height as arguments.

The next line of code in this script, sets the colors for each slice of the pie.

The colors can be set using a hexcode. The first color is gold. The second color is green. "g" is a shorcut for the color.

Others shortcuts include:

'r' for red
'g' for green
'b' for blue
'c' for cyan
'm' for magenta
'b' for black
'y' for yellow
'w' for white

The next line sets the labels for each piece of the pie/

The next line sets the title of the graph.

The last line of this script uses the data from the businessPlan_dataset. It counts the 0's and 1's given as to the impression the judges had and gets a total for each. That is how it determines how big to make each slice.

Now the code tells us what kind of graph we want and how to format the numbers, labels and colors it attaches to each slice.

A title is also added to the pie chart.

What kind of information can you come up with using the pie chart?

Creating Labels and vaiables

Add a new script frame, click copy text button and paste the contents into the frame.

The independent variable, ImpressionText is assigned to X and the label or dependent variable 0 or 1 is assigned to the letter y. There is no output generated from this script.

Day 3: Cleaning text, stopwords, predictions

Cleaning the text

Cleaning the text means to remove special characters, numbers and multiple empty spaces from the text.

Add a new script frame, click copy text button and paste the contents into the frame.

There is no output from this script.

Before Cleaning

The text that appears below shows numbers, special characters and extra spaces. The items to pay attention to are: review 73, 78 and 79. Originally these had extra spaces, numbers and special characters.

Add a new script frame, click copy text button and paste the contents into the frame.

The output shows the dataset before it is cleaned.

After cleaning

Add a new script frame, click copy text button and paste the contents into the frame.

The output shows that numbers, special characters and extra space have been removed.

Stop Words

The stopwords are a list of words that are very very common but don’t provide useful information for most text analysis procedures. While it is helpful to understand the structure of sentence, it does not help you understand the semantics of the sentences themselves.¹

Here’s the code to generate a list of most commonly used stop words.

Add a new script frame, click copy text button and paste the contents into the frame.

Your output should look like:

Stopwords script

The following script shows you how this function works. I made a new Python project and put the file into a dictionary format. # Import stopwords with nltk. from nltk.corpus import stopwords import pandas as pd stop = stopwords.words('english') comments =[('Executive summary was incomplete. lacking financial information.','negative'), ('Marketing plan was excellent in that it addressed product, price, place and promotion','positive'), ('They made no mention of the management structure of the company.','negative'), ('The SWOT analysis was excellent.','positive'), ('The mission statement was dreadfull.','negative'), ('There was no section to explain the reason for seeking a business loan.','negative'), ('The plan sucessfully included a section on three months worth of operating income.','positive'), ('There was no mention on how to concentrate the company marketing efforts.','negative'), ('A plan on how to expand the business was well done.','positive'), ('There was no mention on how social media could be used.','negative'), ('It was not clear what type of business was.','negative'), ('The plan authors missed the fact that the business should be designated as an S corporation.','negative'), ('The organizational chart was well done.','positive'), ('Since it was an existing company, a history of the company should have been included.','negative'), ('There was no backround on key personnel.','negative'), ('The report did not contain the location of the virtual busines.','negative'), ('The section on the channels of distrubution fell short.','negative'), ('The report lacked a section on how the products would be stored.','negative'), ('The delivery options: UPS and FedEx were explained well.','positive'), ('A list of job responsibilities for each person in the organization needed to accomplish goals was excellent.','positive'), ('Explanation of the pricing strategies was pathetic.','negative'), ('The plan did not address the fact that the Human Resources Department was responsible for the news letter.','negative'), ('There was no mention of the employee month award.','negative'), ('The business plan lacked instructions on how to finish assignments on time.','negative'), ('The plan did a good job of looking at its competitors.','positive'), ('Identifying strengths and threats in short and long term planning were inadequate.','negative'), ('There was no mention on how to interest potential customers and clients in their product.','negative'), ('Mission statement ws horrible.','negative'), ('Mission statment did not include clear objectives.','negative'), ('There was not mention on why the company needed a business loan','negative'), ('There was an excellent section on how they were planning on expanding the business.','positive'), ('No profit goals were set.','negative'), ('Breakeven analysis was well done.','positive'), ('Market share section was lacking detail.','negative'), ('The business plan was missing a good description of the company.','negative'), ('Organizational chart was excellent.','positive'), ('Decrription of the market place was very good.','positive'), ('Plan lacked details about the number of employees and their job titles.','negative'), ('The business plan had an excellent section on internal operations.','positive'), ('The plan included a compehensive list of additional assets need to run the company.','positive'), ('Channels of distribution was not very weel explained.','negative'), ('Organization chart was poorly done.','negative'), ('The plan had an excellent section on how to keep all employees on track.','positive'), ('Plan lacked consequences for employees who did not perform their jobs.','negative'), ('The market analysis section was incomplete.','negative'), ('The plan did an excellent job of identifying strengths and threats.','positive'), ('The products section seemed unclear to me.','negative'), ('The placement section on marketing seemed vague.','negative'), ('The pricing element in the marketing mix did not address profit margins.','negative'), ('No discounts were included in the pricing element of he marketing mix.','negative'), ('The plan did not address calculation of profit for each item.','negative'), ('It would appear that the only mechanisim in place for promoting the company was their website.','negative'), ('The target market was articulated well.','positive'), ('Market segments were clearly delineated.','positive'), ('Demographics were poorly defined.','negative'), ('There was an excellent section on psycho-graphics.','positive'), ('Market segments were not fully explained.','negative'), ('The breakeven analysis was well done by using a spreadsheet program.','positive'), ('The plan confused fixed and variable expenses.','negative'), ('The cost of goods sold section was way off.','negative'), ('The plan beginning balance sheet was inaccurate.','negative'), ('The trial balance did not balance.','negative'), ('The business loan was correctly listed under liabilities on the balance sheet.','positive'), ('The business loan amount contained ten months of operating expenses instead of three.','negative'), ('The amount of rent expense was way too high.','negative'), ('The loan was for ten years at ten percent. That is incorrect.','negative'), ('The authors of the plan should have used bizstats.','negative'), ('The utilities were corrctly calculated based on the number of employees in the company.','positive'), ('The interest rate for the business loan was incorrect.','negative'), ('They did a good job on the accounting section of the business plan.','positive'), ('The amount of the business contract was well done.','positive'), ('There was no sample payroll included to show how the businees contract was calculated.','negative'), ('Monthly expenses were clearly calculated.','positive'), ('The salary expense of was incorectly calculated.','negative'), ('The plan lacked any mention of payroll taxes.','negative'), ('There was no projected income statement.','negative'), ('Cashflow analysis was excellent.','positive'), ('The plan lacked an adequate section on cash flow.','negative'), ('The strengths section of the SWOT analysis was well done.','positive'), ('The plan was missing supporting documentation.','negative')] test = pd.DataFrame(comments) test.columns =['Impression','class'] # Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply. test['Impression_without_stopwords'] = test['Impression'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) print(test)

Start a new Python project then click copy text button and paste the contents into the frame. Save it and call it "stopWords" and run it.

Your output should look like:

Python gives us with a number of data structures such as:

lists []
tuples ()
sets {}
dictionaries { :}

Day 5: Sentiment Intensity Analyzer

This code segment does not use a supervised model. There are no labels to indicate whether it was positive, neutral or negative.

The Sentiment Intensity Analyzer works by using VADER, which is list of words that have a sentiment association with each of them. If you have more positives, the sentence is more positive. I you have more negatives, it is more negative. Scores range from -1 to 1. The analyzer is between 70%-80% accurate. Google Cloud Natural Language API and Amazon's Comprehend are similiar analyzers. Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker.² Why sentiment analysis? It has business applications: In the marketing field, companies use it to develop their strategies, to understand customers’ feelings towards products or brands, how people respond to their campaigns or product launches and why consumers don’t buy some products. VADER computes four scores for each text: the negative, neutral, positive, and compound scores. The negative, neutral, and positive scores indicate the proportion of the text that falls into each category. The compound score is a normalized sum of all the word scores, ranging from -1 (most negative) to 1 (most positive). A compound score of 0 indicates a neutral sentiment. Slang words are recognized: sux, friggin, kinda, LOL, WTF. All caps gives the word extra emphasis. Punctuation is important. An exclamation mark gives added emphasis. VADER selects positive, neutral and negative words in a sentence and assigns each of these words a value. For example OK is 1.9, and great is 3.1. Horrible is - 2.5, a frowing emoji is a -2.2, the word "sucks" or "sux" is a -1.5. [2, 1, 1, 3, 2, 4, 2, 2, 1, 1] is how the 1.9 was determined for the word "good". Add up the numbers. You should get 19. The decimal point was then moved one place to the left to determine the score for this word in the vader dictionary. Degee modifiers or booster words, adjectives and they increase sentiment intensity. For example, extremely good or marginally good illustrate this point. Check out the following link. Check out lines 3331, 3422, and 4895. githup vader sentiment analysis word values
Four scores are given to each sentence: Positive, Neutral, Negative, Compound. Here are the parameters. Remember the range is from -1 to +1.

Positive sentiment >=.05
Neutral sentiment > =-05 and <.05
Negative sentiment < - =.05

Sentence classification

The following code examines each word in the sentence and classifies it as negative, neutral or positive.

Start a new Python3 project import nltk from nltk.tokenize import word_tokenize, RegexpTokenizer from nltk.sentiment.vader import SentimentIntensityAnalyzer sentence = "Worst product I have ever purchased. It was terrible. The color was horrible. The style was OK." tokenized_sentence = nltk.word_tokenize(sentence) sid = SentimentIntensityAnalyzer() pos_word_list=[] neu_word_list=[] neg_word_list=[] for word in tokenized_sentence: if (sid.polarity_scores(word)['compound']) >= 0.1: pos_word_list.append(word) elif (sid.polarity_scores(word)['compound']) &#ln= -0.1: neg_word_list.append(word) else: neu_word_list.append(word) print('Positive:',pos_word_list) print('Neutral:',neu_word_list) print('Negative:',neg_word_list) score = sid.polarity_scores(sentence) print('\nScores:', score)

Start a project then click copy text button and paste the contents into the frame.

Save your project call it "sentenceClassification".

The output is below.

Positive: ['OK']

Neutral: ['product', 'I', 'have', 'ever', 'purchased', '.', 'It', 'was', '.', 'The', 'color', 'was', '.', 'The', 'style', 'was', '.']

Negative: ['Worst', 'terrible', 'horrible']

Scores: {'neg': 0.417, 'neu': 0.468, 'pos': 0.114, 'compound': -0.8302}

Let's see how the compound score was calculated. First we need to assign numerical values to the positive and negative words. The positive word, OK, translates to 1.14. The negative words, horrible is -2.5 and terrible is -2.0 and worst is -2.6. If we add these four numbers we get -5.96

The formula for calculating the compound score is then -5.96/sqrt(-.5.96 * -5.96 + 15)

Next we do the arithmetic inside the parenthesis(-5.96 *-5.96 + 15) = 50.52

Now we get the square root of 50.52. That equals 7.10774226.

Lastly we calculate 5.96/7.1077426 = -0.8385189 to come up with our compound score. My manual calculation of -.8385 is very close to our model's prediction of -0.8302.

The number 15 was used to normalize the score to be between -1 and 1 using an alpha that approximates the max expected value.

In sentiment intensity analysis, the alpha value is a normalization parameter used to calculate the compound polarity score.

The compound score is a single measure of polarity that ranges from -1 (most negative) to 1 (most positive)1.

The formula for normalizing the score is as follows:

norm_score=(score×score)+alphascore

The alpha value approximates the maximum expected value of the sum of sentiment scores of the constituent words in a sentence.

In the case of Vader Sentiment Analysis, an alpha value of 15 is used.

Nine product reviews

Now we are going to use sentiment intensity analysis for product reviews.

You should notice that review one is the same as sentence we were just looking at.

Pay close attention to reviews 5,6 and 7. The only difference is punctuation. Review 5 ends with a period. Review 6 ends with an exclamation point and review 7 has two explanation points. Check out their compound scores.

Start a new Python3 project. We are going to analyze nine product reviews.

from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

review1 = "Worst product I have ever purchased. It was terrible.  The color was horrible. The style was OK."
review2 = "Product was excellent.  It performed exactly as advertised.  I highly recommend this product"
review3 = "Your service is unreliable. This is incredibly frustrating.  If you do not fix this, cancel my service immediately"
review4 = "I have contacted your service department for the last time. Your service is unreliable. This is incredibly frustrating.  If you do not fix this, cancel my service immediately!"
review5 = "What kind of lousy business are you running?  You are worthless."
review6 = "What kind of lousy business are you running?  You are worthless!"
review7 = "What kind of lousy business are you running?  You are worthless!!"
review8 = "My Internet service is down for the tenth time this month. Please fix it immediately."
review9 = "WTF my friggin Internet service is down again this month!  Fix it immediately or I will CANCEL my subscription!"

review1_score = analyzer.polarity_scores(review1)
print("Score for Review #1: {}".format(review1_score))
review2_score = analyzer.polarity_scores(review2)
print("Score for Review #2: {}".format(review2_score))
review3_score = analyzer.polarity_scores(review3)
print("Score for Review #3: {}".format(review3_score))

review4_score = analyzer.polarity_scores(review4)
print("Score for Review #4: {}".format(review4_score))

review5_score = analyzer.polarity_scores(review5)
print("Score for Review #5: {}".format(review5_score))

review6_score = analyzer.polarity_scores(review6)
print("Score for Review #6: {}".format(review6_score))

review7_score = analyzer.polarity_scores(review7)
print("Score for Review #7: {}".format(review7_score))

review8_score = analyzer.polarity_scores(review8)
print("Score for Review #8: {}".format(review8_score))

review9_score = analyzer.polarity_scores(review9)
print("Score for Review #9: {}".format(review9_score))

Start a project then click copy text button and paste the contents into the frame.

Save and then run your project.

The output is below.

Analyzer's Performance

Let's look at the first review. The words "worst" and "terrible" and "horrible" were included. If we check the results, it was rated 0.417 negative, 0.468 neutral and 0.114 positive. The compound score was 0.-8302. The model did a pretty good job on this review.

The second review has two positive words: excellent and recommend. There are no negative words. The rest is neutral .606. The compound score comes out to .7574, which makes this a positive review.

Remember that adjectives, words that modify nouns, increase the score. The word "highly" is the modifier on review 2. The neutral score includes puncuation and stop words in this type of model.

Day 6: Coming up with your own reviews.

Using your model that you just made, come up with 4 positive reviews, 4 negative reviews and 2 neutral reviews.

Day 3: Cleaning text, stopwords, predictions

Day 4: Sentiment Analysis with natural language tool kit (NLTK)