Fine-Tuning Language Models for Sentiment Analysis

Scott Duda
12 min readDec 23, 2020
Photo by Julian Hochgesang on Unsplash

Sentiment analysis refers to classification of a sample of text based on the sentiment or opinion it expresses. Whenever we write text, it contains some encoded information that conveys the attitude or feelings of the writer to the reader. This information is “encoded” in our understanding of how language is used as a tool of expression, and while this may seem like a very subjective concept, the goal of using machine learning and natural language processing to categorize a piece of text’s sentiment is to turn it into an objective task.

At its most basic level, sentiment analysis is used to classify text into one of two sentiments: positive or negative. A positive statement is associated with positive feelings or happiness, while a negative statement indicates the presence of negativity or anger/sadness. Often, a third class (neutral) may be used as well to classify statements that do not convey positive or negative emotions. Some statements simply convey information and do not necessarily contain any strong emotional context.

Sentiment analysis can be used in a wide array of applications. Companies across a range of industries use text product reviews and survey responses to analyze the sentiment of their customer base, using this data to improve company performance. Investors rely heavily on objective metrics to determine how to allocate funds, but they also take subjective data provided by financial news articles and statements into consideration. Sentiment analysis can be used in the financial sector to help predict market performance using textual data, improving investment performance.

In natural language processing, sentiment analysis is often treated as a supervised learning problem that requires sets of statements whose sentiment has been identified and hand-labeled by human annotators. Labeled data is then used to train a machine learning algorithm or a neural network, such as an LSTM or GRU. Unlabeled text is fed to the model, and a sentiment classification is produced as output. We will be taking a similar approach, but rather than training a model from the ground up we will be fine-tuning a pre-trained language model using our project-specific data. Code used for this project can be found at the accompanying GitHub repository.

Project Goal & Data

This project focuses on sentiment classification for a dataset consisting of 4,840 financial phrases. The dataset that was used for this project was originally used for detection of semantic orientations within the phrases, and the resulting publication can be found here: Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts. A copy of the dataset can be found here.

Statements included in the dataset were hand-annotated by a team of 16 researchers and master’s students at the Aalto University School of Business in Helsinki, Finland. Annotators classified each statement according to its interpreted sentiment using three possible sentiment labels: positive, negative, or neutral.

Raw text data were cleaned by removing extraneous spaces. Punctuation was maintained since the pre-trained transformer models that we will be fine-tuning are able to handle punctuation during the tokenization process. Removal of punctuation was evaluated, but it did not have a significant impact on prediction accuracy. The length of each statement ranged from a minimum of one word to a maximum of 52 words.

The graph below illustrates the number of examples for each sample class in our dataset. The dataset has over five times as many neutral statements as negative statements and twice as many neutral statements as positive statements. This class imbalance will be accounted for by using stratification when splitting our data into train/validation/test sets or into folds for cross validation. Stratification attempts to mitigate the effects of class imbalance by sampling to preserve class frequency.

Transformers and Language Modeling

The Transformer model architecture was initially proposed in the paper Attention is All You Need as a method for performing machine translation tasks, and it has since been adapted to create high-power language models for other applications. Prior to the development of the Transformer architecture, language models typically implemented recurrent mechanisms such as LSTMs. Transformers eliminate the use of recurrent methods and rely instead on attention mechanisms for modeling dependencies between input and output sequences.

Attention mechanisms allow a model to relate different positions within a sequence to one another. The Transformer architecture implements attention mechanisms in the absence of recurrent mechanisms to establish dependencies between input and output sequences. Attention mechanisms use an alignment score function to quantify the relevance of each item in a list to another item. A higher score indicates a greater degree of relevance, and the model can use this score to take context information from sequence items with higher relevance when developing predictions. There are a number of different types of attention mechanisms that have been developed, but Transformers make use of scaled dot-product attention.

BERT

The BERT model was developed by a Google research team and was originally described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. BERT (Bidirectional Encoder Representations from Transformers) is designed to be used as a pre-trained model that can be fine-tuned. By applying additional output layers to the pre-trained model, users can create models that perform a variety of language modeling tasks, including question answering, language inference and sentiment analysis. The model can be trained using problem-specific data to fine-tune its performance for specific applications.

Prior to BERT, models for use as pre-trained language models were developed with a constraint of unidirectionality. For these models, attention between tokens could only be evaluated in one direction since tokens are only able to attend to previous tokens in the self-attention layers of the Transformer. BERT attempts to remove this constraint by incorporating a bidirectional Transformer.

BERT’s bidirectionallity is achieved a masked language model pre-training objective. This objective randomly masks tokens in a text example and and attempts to predict their original ID based on the surrounding context. In doing so, the model is able to learn context from both the left and right sides of the masked ID, merging both forward and reverse context into a single context value and adding bidirectionallity to the Transformer. This type of learning task has been described previously and is known as a Cloze task.

Source: Wikipedia (Modified)

BERT was pre-trained using unlabeled data and two unsupervised learning tasks: Masked language model and next sentence prediction. The masked language model task was described in the previous paragraph. The next sentence prediction task was incorporated to provide the model with more information about the relationships between sentences. While training on this task, the model sees an even mix of examples where the correct sentence from the training data follows the first sentence and where a random sentence has been inserted following the first sentence. Training on this task increases model performance on Question/Answer (QA) and Natural Language Inference (NLI) tasks.

The BERT model architecture consists of a multilayer bidirectional Transformer encoder. The encoder architecture used is identical to the encoder described in Attention is All You Need. These encoders are simply stacked on top of one another to increase model complexity.

Two sizes of BERT were pre-trained: BERT-Base (110M parameters) and BERT-Large (340M parameters). BERT-Base consists of 12 Transformer encoder blocks, a hidden size of 768, and 12 attention heads, while BERT-Large consists of 24 encoder blocks, a hidden size of 1024, and 16 attention heads. Four this project, we will be fine-tuning BERT-Base.

Text used for training BERT was acquired from BookCorpus and English Wikipedia (text passages only). BookCorpus is no longer distributed, but homemade versions have been put together based on the original dataset creators’ published methods (scraping 11,308 e-books from Smashwords, each over 20,000 words).

DistilBERT

Following the release of the BERT model, researchers at Hugging Face began looking at ways of reducing the overall size of the model and increasing its speed while maintaining as much of its original language modeling capability as possible. The method that was selected for reducing overall model size is known as knowledge distillation. Knowledge distillation is performed by training a smaller model (the “student”) to reproduce the behavior of a larger model or ensemble of models (the “teacher”). The model developed using this approach is described in DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.

Source: Internet Archive Book Images (Flickr)

The architecture used for the DistilBERT “student” model differs slightly from BERT “teacher” model. Token-type embeddings and pooling are removed, and the total number of layers used to build the DistilBERT model is decreased from BERT by a factor of 2. Dimensionality is preserved between the two models, allowing for DistilBERT model weights to be initialized using a layer from BERT.

By training the DistilBERT “student” model using a BERT “teacher” model, researchers were able to create a powerful model with a reduced file size and increased speed.

RoBERTa

Researchers at Facebook revisited the pre-training process that was used to develop BERT and identified a number of areas that could be improved. Their efforts resulted in the development of “robustly optimized” version of BERT that they named RoBERTa. Details for how RoBERTa was developed can be found in RoBERTa: A Robustly Optimized BERT Pretraining Approach.

Modifications to the BERT pre-training process that were used to train RoBERTa included:

  • Longer model training times using larger batches and more data
  • Elimination of the next sentence prediction objective task
  • Longer sequences for training
  • Changing the pattern used for masking of training data dynamically

Fine-Tuning a Pre-Trained Language Model

To fine-tune a pre-trained language model, a user must load the pre-trained model weights, then insert an additional layer on top of the pre-trained model to convert the output depending on the model’s objective. Since we are performing sentiment classification, we will be inserting a linear layer on top of the pre-trained model that is configured to output probabilities for each of the classes we are trying to predict. We will also add dropout to improve generalization.

Import and Clean up Data, Set Device, Set Random Seeds

First, we will import and clean up our data, set the device we will be using for model training (GPU if available, CPU if not), label encode our sentiment class labels and set values for random seeds to reduce the variability of the results each time the code is run. The documentation for PyTorch indicates that complete reproducibility is not always achievable, even using random seeds. However, setting random seed values will help limit variability.

The language models we will be fine-tuning are available as part of the transformers library from Hugging Face.

Set Hyperparameters, Split Data, Define Datasets & Dataloaders

Next, we will define hyperparameters that will be used for model training. These constant values will be shared between each of the three models we will be fine-tuning (BERT, DistilBERT, and RoBERTa). The longest statement in our dataset contains 52 words, so the maximum input sequence length (MAX_LENGTH) was set to 64.

Hyperparameters such as batch size, dropout probability, weight decay and learning rate were selected based on guidance provided in the paper, On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines. In addition to training a single model using a training and validation dataset, we will also be performing k-fold validation using a total of 10 folds.

The full dataset was split into training (80%), validation (10%), and test (10%) datasets, and the splits were stratified based on the sentiment class frequencies observed. Stratifying the splits helps ensure that the frequency with which each class appears in the full dataset is preserved in the training/validation/test datasets.

To train our models, we will need to create PyTorch Datasets and Dataloaders to act as containers for our training train/validation/test datasets. Our StatementDataset inherits from PyTorch’s Dataset module and is designed to tokenize each statement as it is accessed. Our Dataloader will be used to retrieve data from our StatementDataset and divide it into batches based on the specified batch size.

Define Functions for Fine-Tuning and Evaluating Models

The functions presented below will be used to fine-tune and evaluate our language models. These functions have been designed to accommodate two scenarios: training a singular model and training several models using k-fold validation. When performing k-fold cross validation, we combine our training and validation sets to create a larger dataset from which separate hold-out sets and training folds can be generated.

Each model will also be evaluated in terms of prediction accuracy. For models trained using k-fold cross validation, the reported accuracy represents the accuracy of an ensemble model created using the models trained on each fold.

Fine-Tune BERT Model for Sentiment Classification

As an example, the code included below was used to fine-tune the BERT language models. Similar code was developed for fine-tuning of DistilBERT and RoBERTa models. The complete code set can be found at the GitHub repository for this project.

Evaluation Metrics

To evaluate how well each model performed, we will look at a summary of performance metrics created using classification_report from sklearn.metrics. This summary includes the calculated precision, recall, and F1 score for each predicted class as well as the the overall accuracy of the classifier. The summary reports the number of samples for each class as well in a column titled “support.” Additionally, the summary provides weighted and micro averages for precision, recall and F1 scores since these values are calculated for each class when dealing with multi-class classification.

When performing multi-class classification, precision is determined by looking at the proportion of correct classifications for each class out of the total number of predictions of that class. For example, if our classifier predicted that 10 statements in a subset of financial statement data are neutral but only 5 of those statements are correctly classified, the classifier’s precision for the neutral class would be 50% (5/10).

Recall for multi-class classification problems represents the proportion of correctly classified items within a class out of the true total number of times that class appears in a dataset. For example, if our classifier correctly predicted 5 statements as neutral but the dataset contained a total of 8 neutral statements, the recall for the neutral class would be 62.5% (5/8).

The F1 score represents the harmonic mean of the precision and recall values calculated for each class in a multi-class classification problem. The F1 score can be calculated using the following equation:

Results

The metrics described above were calculated for each fine-tuned language model. This includes both a single model created using each pre-trained language model as well as an ensemble created using each model trained during 10-fold cross validation.

The screenshot below shows the output of classification_report from sklearn.metrics for a single BERT model as well as a confusion matrix showing how many predictions (both correct and incorrect) fell into each class.

Model fine-tuning was repeated using DistilBERT and RoBERTa to compare performance. A summary of results from fine-tuning a single language model and from creating a 10-fold cross validation ensemble of fine-tuned models using each of the three models described in this article (BERT, DistilBERT, and RoBERTa) is shown below. The maximum value of each column is bolded.

Observed accuracies for each language model were comparable, ranging from 0.866–0.890. The highest observed accuracy (0.890) was produced using a 10-fold cross validation ensemble of fine-tuned BERT models. Performance of the other language models did not improve when 10-fold cross validation was implemented. However, the model predictions produced by the 10-fold cross validation RoBERTa ensemble model produced the highest observed values of precision for negative and neutral statements as well as the highest observed values of recall for negative and positive statements.

Training times for BERT and RoBERTa models were comparable. DistilBERT model training was nearly twice as fast, with training times approaching half of those observed when training BERT or RoBERTa models. Since both DistilBERT models were able to perform relatively well compared to the slower training models, the additional benefit of faster model training and lighter weight models may be a crucial advantage for some applications.

Thanks for Reading!

Please give this article some claps if you found it interesting or useful! Code used for this project can be found at this GitHub repository.

References

  1. Devlin, J., Chang, M., Lee, K., & Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).
  2. Liu, Y., Ott, M., Goyal, N. , Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  3. Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P. 2014. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4), 782–796.
  4. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv, Vol. abs/1910.01108.
  5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I. 2017. Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

--

--