Sentiment analysis of customer reviews

In this project, I perform sentiment analysis of reviews by customers of an E-commerce women's clothing brand. After cleaning and pre-processing the reviews, I used two vectorisation approaches: bag-of-words and TF-IDF to do word vectorisation. I trained the reviews on 4 classification models to determine the best performing model.

Data

  • The data for this project was taken from Kaggle.
  • The data-set is labelled in two classes: "Recommended: 1" and "Not Recommended: 0"

Pre-processing

    Cleaned and analysed each available feature, checked their variations using histogram with the recommended and non-recommneded labels for the dataset closely.
    The data was then checked for duplicates and missing values.

    To clean each review, the content of the review was:

  • Tokenized
  • Removed of any usernames (starting with @), and punctuations
  • Passed through Lemmatization
  • Removed off rare-words and stop words

I then made the wordclouds for both recommended and non-recommended reviews to check words with extreme emotions.

Application of classification models

  • To do a predictive analysis and train ML models, the data was split into test and training data.
  • The training data was converted in a set of vectors using two different methods:
  • 1. Count Vectorizer [or bag-of-words model]
    2. TF-IDF Vectorizer
  • Each of these sets of training data are then used to train four ML classification models: Logistical Regression, Multinomial Naive Bayes, Support Vector Machine (SVM), and Random Forest.
  • K-Fold cross validation was performed on each model (with different metrics) with test data.
  • F1 Score, Recall, and Average precision scores were calculated for each test set.
  • The mean of F1 Score, Recall, and Average precision scores are used to determine the best perfoming model, which turned out to be Logistic Regression for all the metrics.
  • Visualised the average scores for each of these models.
  • Prediction for some sample reviews was performed using these models.

  • Link to Github Repository