Sentiment analysis using n-gram
In today’s data science world sentiment analysis becomes a very fast-growing and hot topic, but what is sentiment analysis??
A very simple way to define sentiment analysis is “it is a process to extract polarity (e.g. a positive or negative opinion) from the textual data.” Here polarity is not static it is dynamic and depends on the field it is used for.
Understanding customer’s emotions/reviews for the product/service is very important for any growing business on the addition to that customers express their feelings and experience on social media about the product more now as compared to ever before.
Let’s try to understand it with a very basic example, suppose a company launched a new product in the market and we have collected the reviews from different websites, for a normal person this is just some file with a lot of reviews about a product but this reviews contains a lot of information about what the user thinks about the product and that information is called Sentiment, extracting this information from the text and made a sense out of it called sentiment analysis.
This information can be used in a various way, it can be used to know:
• Current standing of product in the market.
• User’s view of the product.
• Areas of improvement to generate more revenue.
• Target marketing/selling.
• Reputation management.
• Market research.
Social media monitoring tools like Brandwatch Analytics make the analysis process quicker and easier than ever before, thanks to real-time monitoring capabilities.
There are various ways for sentiment analysis but in this post, we will see one specific way called n-grams. In word n-grams n denotes the pair of words our model should consider for sentiment analysis.
Let’s try to understand the concept of n-gram with some examples.
1. This book is bad.
2. This book is not bad.
If the n=1, it’s called unigram and the analysis will be done by single words.
The first sentence will be, “This”,” book”,” is”,” bad” and we can say it is a negative polarity because of the word “bad” (we can create a list of positive and negative words to check words wise polarity). Example 2 is a positive polarity because of the word “ not bad” but the second example is a little different it has “not” and ”bad”. we can understand it’s a positive review but our model treats both words separately and can misclassify this review.
Now from here, the things become more interesting if we increase the number of n=2, it’s called bigram, and if n=3 trigram you got the idea.
Same as above in trigram the pair of 3 words are formed like, “This book is”,” book is not”, ”is not bad”. From the point of sentiment analysis, we can use “is not bad” and give it a score for more accurate results.
This is the very basic of n-grams and how it can help for sentiment analysis. Here I used the Movie review dataset, it contains 25000 movie reviews 12500 are positive and the next 12500 are negative. To create n-grams I used python library sklearn but there are other options are available like nltk.
First I did basic preprocessing like removing stop words, convert to lower case, remove extra space, and removing special characters if any, preprocessing is most important because it removes the unnecessary data from your corpus. After that lemmatized the reviews, it converted the words to the same root word here stemming can be used but I found lemmatize works better with this data.
With the processed data in hand, we can train our model and classify it using SVM and Logistic regression, there are multiple ways to create the model but I choose these two for the beginning, with the above model I was able to achieve 89–90% accuracy.