CSE 5334 Naive Bayes Sentiment Classifier

6 minute read

This is a blog post for Assignment 3 from my Data Mining class.

Jupyter Notebook can be found here.

Introduction

The goal of this assignment is to learn about the Naive Bayes Classifier (NBC).

I don’t think I can explain it better than that. With that being said, let’s figure out Naive Bayes Classifier. To be more specific, we are using Naive (or Naïve, sorry!) Bayes Classifier to classify reviews based on labeled sentiment, 1 for positive and 0 for negative.

Naive Bayes Classifier

To understand Naive Bayes Classifier, we can simply look into why it’s called Naive Bayes Classifier.

There are three parts, which is suggested by the name:

  • Naive - the assumption (keyword alert!) that the classifier made to enable Bayes Theorem integration.
  • Bayes - as in Bayes Theorem, a method of using prior knowledge to predict posterior probabilities.
  • Classifier - an algorithm that organize data into appropriate categories depending on their “features”.

Bayes Theorem

To understand Bayes Theorem (briefly), you need to first understand conditional probability, and how it behaves if events are independent.

\[P(A \vert B)P(B) = P(A \cap B) = P(B \cap A) = P(B \vert A)P(A)\]

Rearranging and we got Bayes Theorem!

\[P(A \vert B) = \frac{P(B \vert A)P(A)}{P(B)}\]

But I replace A, B with H, E for Hypothesis and Evidence

\[P(H \vert E) = \frac{P(E \vert H)P(H)}{P(E)}\]

where

  • \(P(H \vert E)\) is the posterior, the probability of hypothesis H given evidence E
  • \(P(E \vert H)\) is the likelihood, the probability of evidence E given hypothesis H
  • \(P(H)\) is the prior, the probability of hypothesis H
  • \(P(E)\) is the probability of evidence E (also known as marginal probability, which \(P(H)\) is as well)

The extended case allows us to incorporate multiple evidence, hence the scalability.

\[P(H \vert E_1, E_2, \dots, E_n) = \frac{P(E_1, E_2, \dots, E_n \vert H)P(H)}{P(E_1, E_2, \dots, E_n)}\]

We can get the prior, but how do we get the probability of multiple evidence appearing (or being true) at the same time?

Naive Assumption

The assumption is that all evidence are conditionally independent, which greatly simplifies the calculation for our joint evidence. It is naive simply because this assumption may or may not be true, but we just take it to be true. Mathematically it does this

\[P(E_1, E_2, \dots, E_n \vert H) = P(E_1 \vert H)P(E_2 \vert H) \dots P(E_n \vert H)\]

Which, substituted into the numerator, give us

\[P(H \vert E_1, E_2, \dots, E_n) = \frac{P(E_1 \vert H)P(E_2 \vert H) \dots P(E_n \vert H)P(H)}{P(E_1, E_2, \dots, E_n)}\]

But that doesn’t solves the denominator does it. Well, we also have a way around it too. But we’ll come to this later.

Putting It All Together

Let’s do it on our actual movie review classification task. Given review R, classify into label \(\in [0, 1]\) for negative and positive review, respecitively. For the math, \((+)\) for positive and \((-)\) for negative for readability. We’ll call this S (short for sentiment)

Review R: “this movie sucks! hate it!” (we’ll be ignoring unnecessary stuff like punctuations, etc. more on this later)

Probabilities:

\[P(\, + \vert \, \text{this movie sucks hate it}) = \frac{P(\text{this}\vert +)P(\text{movie}\vert +)P(\text{sucks}\vert +)P(\text{hate}\vert +)P(\text{it}\vert +)P(+)}{P(\text{this movie sucks hate it})}\] \[P(\, - \vert \, \text{this movie sucks hate it}) = \frac{P(\text{this}\vert -)P(\text{movie}\vert -)P(\text{sucks}\vert -)P(\text{hate}\vert -)P(\text{it}\vert -)P(-)}{P(\text{this movie sucks hate it})}\]

Notice that the denominator is the same, so, mathematically speaking, we just care about the larger numerator here. Hence, Bayes application in our case will just be simplified to

\[S = \underset{S \in [+, -]}{\text{argmax}} \, P(S) \prod_{i \in sentence} P(\mathcal{w}_i \vert S)\]

But Wait, What About The Zero?

Q: What zero?

A: The zero when a word didn’t appear in the reviews.

Basically, if a word did not appear in either category of review, the probability will collapse to zero, since \(P(\mathcal{w}_i \vert \text{category}) = 0\), and since the entire numerator is multiplication, the entire probability collapses to 0, no matter what other evidence are there. So we don’t want zeroes to pops up in our calculation. To solve this, we introduce smoothing.

Smoothing

The process of calculating the likelihood \(P(\mathcal{w}_i \vert \text{category})\) is just counting the number of time \(\mathcal{w}_i\) appears and divide it by number of words in the category. But if the word never appears, this probability is zero.

\[P(\mathcal{w}_i \vert \text{category}) = \frac{count(\mathcal{w}_i, \text{category})}{\sum_w count(\mathcal{w}, \text{category})} = 0 \enspace \text{if} \enspace count(\mathcal{w}_i, \text{category}) = 0\]

Smoothing make it so that if the word did not appear, it is replaced by a small number.

\[P(\mathcal{w}_i \vert \text{category}) = \frac{count(\mathcal{w}_i, \text{category}) + \alpha}{\sum_w count(\mathcal{w}, \text{category}) + \alpha \cdot \Vert V \Vert} \neq 0 \enspace \text{if} \enspace count(\mathcal{w}_i, \text{category}) = 0\]

Where alpha is our smoothing factor, and \(\Vert V \Vert\) is the length of our vocabulary i.e. number of unique words in all of the reviews.

Results

As discussed before, we will strip the reviews to just words (no number, no punctuations), filter out some common words, derive a set of highly predictive words, applied smoothing, and test our Naive Bayes Classifier with train-dev-test cross-validation.

Assignment Requirement

\(P(\text{the})\) and \(P(\text{the} \vert \text{positive})\)? I assumed they mean for the entire vocabulary list, but I don’t see the need for the singular word probability?

# depending on randomization of train_set this can be different
# but always approx. the same values to 0.5 and 0.5
P['the'] = 0.49333
P['the' | Positive] = 0.51556

Naive Bayes 5-Fold accuracy without smoothing on a 0.8 train set gives the following

5-Fold cross-validation accuracy
5-Fold cross-validation accuracy

Applying smoothing pushes the accuracy to about 81%. Finding the best alpha takes a few run.

smoothing testing alpha > 1 smoothing testing alpha < 1
Trial runs to determine best \(\alpha\) to achieve highest accuracy

I also derived the list of top positive and top negative words, per requirement. This is done by counting occurence of words for both positive and negative category, then dividing to find out what kind of ratio of appearance they have. Example: the word “wonderful” appears at least 17 times in positive category than it does in the negative category.

# positive words, the number is occurence ratio

[('wonderful', 17.0),
 ('excellent', 14.0),
 ('beautiful', 11.0),
 ('loved', 10.0),
 ('liked', 10.0),
 ('played', 8.0),
 ('job', 8.0),
 ('game', 7.0),
 ('subtle', 7.0),
 ('makes', 7.0)]

# negative words

[('bad', 21.0),
 ('stupid', 12.0),
 ('awful', 11.0),
 ('worst', 10.0),
 ('avoid', 8.0),
 ('mess', 8.0),
 ('sucks', 8.0),
 ('horrible', 6.0),
 ('holes', 6.0),
 ('crap', 6.0)]

Hyper-parameters Tuning

In this case it’s just finding the best smoothing \(\alpha\) and set the probability for highly predictive words to a high number (I used 1) for their respective category, while reducing it for the opposing category. In short,

for word in positive_words:
    self.likelihood_positive[word] = 1
    self.likelihood_negative[word] = 1e-5
            
for word in negative_words:
    self.likelihood_positive[word] = 1e-5
    self.likelihood_negative[word] = 1

Conclusion

After tuning the probability using top predictive words and a few runs to find the best smoothing \(\alpha\), we arrived at the final accuracy of about 85%. Note that this is on randomized train/test set. In some of the previous trial runs, I’ve got it to spit out 92-94% accuracy, so I’d say our model is pretty good!

Reference

[1] Proof of Bayes Theorem
[2] Combining Evidence using Bayes’ Rule
[3] Bayes Theory in Sentiment Analysis for Naive Bayes
[4] Speech and Language Processing 3rd Draft
[5] Getting Sentiment Score for Words in Python