CSE 5334 Main Project: General Sentiment Classifier

5 minute read

This is a blog post for Main Project from my Data Mining class.

Kaggle Notebook can be found @Kaggle Notebook

Jupyter Notebook can be found @GitHub repository

Proposal can be found @GitHub repository

Previously on assignment 3 we implemented a basic Naive Bayes Classifier that achieves a base accuracy of around 70% exclusively on the IMDB movie review. We didn’t really establishes a baseline for comparison, as well as did not try it out on other sentiment set i.e. Amazon and Yelp.

This time, we will compare how our Naive Bayes was able to reach base accuracy of 80% on mixed reviews, meaning our predictive power does not care what kind of review it is (food review, movie review, product review, etc.), it only focus on if the review is positive or negative.

We will also compare our classifier with SKLearn’s classifiers, such as SVM, Logistic Regression, Gaussian and Multinomial Naive Bayes.

Mixed Dataset

I practically concatenated all reviews (from Amazon, IMDB, and Yelp) into one gigantic dataset of 3000 reviews. The distribution, remarkably, is still almost balanced, even after split into randomized train-test set.

train set label distribution
train set label distribution

Naive Bayes with Expanded Lexicon

From [1], we expanded our list of postive and negative words, which, instead of 20 that we are absolutely certain that are positive from previous assignment, we now have a few thousands. This way, our Naive likelihood is drastically more accurate, result in a much better base accuracy of almost flat 80%. (this is aggregated over about 20 runs, averaged.)

bnbc = BetterNaiveBayesClassifier(train_set)
predictions = bnbc.predict(test_set['review'])
acc = accuracy_score(predictions, test_set['label'])
print(f'Accuracy of Better Naive Bayes is {acc*100.0:.3f}%')

# OUTPUT - of this particular run
Accuracy of Better Naive Bayes is 77.500%

SKLearn Models

Upcoming are the models implemented under SKLearn, see if ours compares to theirs.

Vectorized Preprocessing

To simplify the process (and follow their TF-IDF tutorial, really), I vectorized the input. Of course, I also tried to “improve” it.

# Original Vectorizer
vectorizer = TfidfVectorizer(min_df = 5,
                             max_df = 0.8,
                             sublinear_tf = True,
                             use_idf = True)

The tokenizer used also implemented Stemming and Lemmatization (from nltk)

def tokenizer(review):
    words = [w for w in review.translate(str.maketrans('', '', punctuation + digits)).lower().split() if w.isalpha()]
    stems = [stemmer.stem(word) for word in words]
    lemms = [lemmatizer.lemmatize(word) for word in stems]
    return words

And the new vectorizer looked like this

my_vectorizer = TfidfVectorizer(use_idf=True, 

Support Vector Machine (SVM)

Using the vectorized inputs, I trained a SVM using both the base vectorizer and my own vectorizer.

SVM_linear = svm.SVC(kernel='linear')
SVM_linear.fit(train_vectors, train_set['label'])
preds_svm_linear = SVM_linear.predict(test_vectors)

report = classification_report(test_set['label'], preds_svm_linear, output_dict=True)
print('positive: ', report['1'])
print('negative: ', report['0'])

Which, using SKLearn reports, gives the following

positive:  {'precision': 0.7601351351351351, 'recall': 0.78125, 'f1-score': 0.7705479452054795, 'support': 288}
negative:  {'precision': 0.7927631578947368, 'recall': 0.7724358974358975, 'f1-score': 0.7824675324675324, 'support': 312}

When ran again using my version of vectorized inputs, the results are

positive:  {'precision': 0.7763157894736842, 'recall': 0.8194444444444444, 'f1-score': 0.7972972972972973, 'support': 288}
negative:  {'precision': 0.8243243243243243, 'recall': 0.782051282051282, 'f1-score': 0.8026315789473683, 'support': 312}

So while it is sligtly better, it is hard to say if there is any actual improvement. So far, the only improvement that I can see is the training time of about 1 second. Compare to about a minute for my Naive Bayes implementation, a few % lower in accuracy definitely worth the short training time.

Logistic Regression

In a similar manner to SVM, I also wanted to see if Logistic Regression is any better. (Spoiler: NO.)

regressor = LogisticRegression(random_state = 0)
regressor.fit(train_vectors, train_set['label'])
preds_logis_regres = regressor.predict(test_vectors)

report = classification_report(test_set['label'], preds_logis_regres, output_dict=True)
print('positive: ', report['1'])
print('negative: ', report['0'])

positive:  {'precision': 0.7985865724381626, 'recall': 0.7847222222222222, 'f1-score': 0.7915936952714535, 'support': 288}
negative:  {'precision': 0.804416403785489, 'recall': 0.8173076923076923, 'f1-score': 0.8108108108108107, 'support': 312}

The numbers looks suspiciously close to SVM, but I am uncertain how it got here. I was hoping for some discrepancy, but apparently, in application power, they are quite similar for our use case?

SKLearn Naive Bayes

Finally, the question is

Is our Naive Bayes Classifier Better?

And the short answer is


Gaussian NB

Same training manner. Different results. Actually worse than ours.

gnb = GaussianNB()
gnb.fit(train_vectors.toarray(), train_set['label'])
preds = gnb.predict(test_vectors.toarray())

report = classification_report(test_set['label'], preds, output_dict=True)
print('positive: ', report['1'])
print('negative: ', report['0'])

positive:  {'precision': 0.6863636363636364, 'recall': 0.5243055555555556, 'f1-score': 0.594488188976378, 'support': 288}
negative:  {'precision': 0.6394736842105263, 'recall': 0.7788461538461539, 'f1-score': 0.7023121387283237, 'support': 312}

Multinomial NB

Actually quite close to both SVM and Logistic Regression.

mnb = MultinomialNB()
mnb.fit(train_vectors.toarray(), train_set['label'])
preds = mnb.predict(test_vectors.toarray())

report = classification_report(test_set['label'], preds, output_dict=True)
print('positive: ', report['1'])
print('negative: ', report['0'])

positive:  {'precision': 0.7678018575851393, 'recall': 0.8611111111111112, 'f1-score': 0.8117839607201309, 'support': 288}
negative:  {'precision': 0.855595667870036, 'recall': 0.7596153846153846, 'f1-score': 0.8047538200339558, 'support': 312}

I don’t quite understand the difference between my method of counting word and “tilting” the probability in favor of sentiment labeled (positive and negative) words v.s. how vectorizing and TF-IDF compare or related. But, it’s still fairly accurate for it’s purpose.

There is also a live demo where you can paste external text for the Naive Classifier to predict. For the most part, it is quite accurate (100% accuracy, so far) for some copy pasted Amazon reviews.

Performance Comparison and Consideration

While my model base performance is relatively good (80% accuracy on test set), it took awhile to train. Comparatively, the SKLearn models took about 1 seconds each, for roughly similar results.

There is something to be said about the shuffling of data, and how limited the original dataset is, and how even more limited it is after filtering. There is also a problem with reviews like “10/10” which are numeric based that my method completely ignored.

Pickling for Later

Best models (mine and SVM) is pickled, and is relatively easy to use. Mine just take a list of sentences and spits out 0 for negative and 1 for positive. If you want to try it out yourself, just load it in with pickle.

import pickle

pickle.dump(bnbc, open('better_naive_bayes_classifier.pickle', 'wb'))
pickle.dump(SVM, open('SVM_80_acc.pickle', 'wb'))

This way, to deploy it, you can just load it using pickle, and pass any sentence to it.


[1] Opinion Mining, Sentiment Analysis, and Opinion Spam Detection
[2] Sentiment Classification
[3] SKLearn Feature Extraction
[4] SKLearn Naive Bayes
[5] SKLearn Logistic Regression
[6] SKLearn SVM
[7] Python Pickle