Sentiment Analysis Using BERT

AmbitiousSoul
4 min readMar 2, 2022

Nearly 9 in 10 consumers rely on online reviews. Online reviews are important because they help boost companies’ sales. Moreover, one of the most important things for businesses is to ensure that their customers are satisfied with their services or products. Therefore, content reviewing has the same value as review writing. However, many businesses have limited budgets and time to assess all the reviews and respond to them promptly. Therefore, in this research, we introduce and apply some techniques and help companies to analyze reviews efficiently.

The data is from TripAdvisor, the world’s largest travel website. We use the dataset from Kaggle which is about the hotel reviews in the year 2020. You can download the data from here Kaggle and follow me step by step in this post.

The dataset consists of 2 columns: Review in text and Rating (1–5)

The goal is to use Bidirectional Encoder Representations from Transformers (BERT) for the sentiment analysis.

Step1: we first install the transformers and import all the necessary packages.

Figure 1: importing packages

Step2: We then import the data. We can use df.head() to look at the first five rows of the data frame to see how the data looks. In this step, we create a binary variable called label and separate the review data based on their ratings into two classes. For ratings equal and above 3, we give label 1 and for any ratings below 3, we give label 0.

To ease our work, we first work with the first 500 reviews and call it batch_1.

The result above shows that the data is balanced. So, we do not need to do any data preparation for our model.

Step3: In this step, we import pre-trained DistilBERT model and tokenizer together.

Step4: Here, we’ll tokenize and process all sentences together as a batch. I have printed the tokenized values. As you see the output is all words ids.

Step5: Do the padding. The dataset is currently a list (or pandas Series/DataFrame) of lists. Before DistilBERT can process this as input, we’ll need to make all the vectors the same size by padding shorter sentences. We selected the maximum length of the review sentences as 250 characters. Therefore any words beyond this threshold will be padded (shown as 0)

Step6: Use the attention mask. The attention mask has the same length as padding. Attention mask creates array of 0s (pad token) and 1s (real token).

Step7: We now create an input tensor out of the padded token matrix, and send that to DistilBERT.

Step 8: Divide data into train and test to evaluate the performance of our model and use logistic regression to evaluate the model performance.

Conclusion: the accuracy of our model is 0.8 which is pretty good. Please comment under this post if you have feedback.

In my next post, I will run this model on the bigger dataset with more details included in the model.

--

--