IAMECON helps JUST build loan performance prediction model
January 27, 2021
Estimating Market Shares w/o Sales Data
May 11, 2021
IAMECON helps JUST build loan performance prediction model
January 27, 2021
Estimating Market Shares w/o Sales Data
May 11, 2021
Show all

Extracting Service Quality from Online User Reviews:
A Comparison of NLP Techniques

Recently, we were tasked with an interesting Machine Learning problem. As part of a larger research project, we wondered if state-level Occupational Licensing laws actually improved the quality of service as perceived by customers. To measure this effect, we collected 745,468 full-text Google reviews for 328,964 unique establishments and analyzed whether reviews about these establishments could trace us back to the state laws. In particular, our task is to find whether additional educational and training requirements actually reflect in customer online reviews. To answer this question, we had a challenging Natural Language Processing (NLP) task, classifying reviews as either technically related or service related.

ReviewClassification
“Good quality work on all levels of the electrical field.”Technical
“Nice and hospitable, not to mention accommodatingService
“Knowledgeable and professional. When you want it done right the first time!Technical
“This is a great friendly place to get your hair doneService
“They do fast and amazing work. Would recommend anyone who needs a/c repair to use them.”Technical
A wonderful atmosphere to have yourself pampered with awesome prices! Highly recommend!”Service

To start with, we manually labeled a dataset of around 3,971 reviews as talking about either technical or service oriented aspects of the business. The distribution of these labels as shown below is not largely imbalanced and hence we can confidently use standard metrics like accuracy. The reviews are relatively short (an average of 49.4 words) and use mostly informal English.

Technical (1)Service (2)Both (3)Neither (4)Total
Training95540016262053186
Test20710941950785

Traditional NLP Models Perform Moderately:
Before going into Deep Learning models, we first establish a baseline performance with simpler and more traditional NLP models. When it comes to using traditional ML models such as SVMs, Logistic Regression or Random Forests on text data, the first hurdle is choosing an appropriate embedding algorithm. For instance, Logistic Regression expects a fixed-size vector for each piece of text (review in our case), and a naive approach of just feeding in the characters or words usually doesn’t work that well because of the non-linear nature of these spaces.

A common solution is to use bag-of-words embeddings, which generate vectors based on whether words in a predefined dictionary appear in the text. A more advanced version of the bag-of-words approach is an algorithm called TF-IDF (Term Frequency – Inverse Document Frequency) which takes into account the frequency of the word both in the text and overall in all the samples. Using TF-IDF embeddings, we are able to try out various ML models in Scikit-Learn and measure their performance. Overall, Complement Naive Bayes and Logistic Regression performed the best with +87% accuracy. The best results we achieved from each type of model are summarized below.

#ModelAccuracy
1Complement Naive Bayes87.66%
2Logistic Regression87.03%
4Multinomial Naive Bayes84.49%
5Support Vector Classifier84.18%
6Random Forest81.96%
7Count Based76.90%
8Gaussian Naive Bayes67.09%

Hyperparameter selection is crucial:
Note however that there are many hyperparameter choices that can be made with these models. To make sure that we were giving the models a fair chance and not using a subpar set of hyperparameters, we conducted a relatively extensive grid search of the hyperparameter space. For Random Forests. using a 16-core 3950x processor, the grid search over 63,000 sets of hyperparameters took 6 hours (315,000 fits in total with 5-fold cross validation) . We find that Support Vector Classifiers and Multinomial Naive Bayes are more sensitive to the hyperparameter choice for this task than other models and overall most models performed similarly.

Modern NLP Models Perform Better:
The next step was building a more modern NLP model. One model that’s been successfully applied to NLP problems like ours is the Long Short-Term Memory (LSTM) model, an improved variant of recurrent neural networks (RNNs). Originally proposed by Sepp Hochreiter and Jürgen Schmidhuber in 1995, LSTMs started to become the state-of-art in many NLP tasks in the early 2010s due to availability of much cheaper computing resources and general architectural innovations in neural networks.

For our task, we used a 2-layer Bidirectional LSTM with 64 hidden units and LayerNorm in after the LSTM layers. The classification layer was a single 128 ⇒ 1 Linear map which after going through a sigmoid layer becomes the final prediction. The inputs were 768-dimensional BERT embeddings of length 512. Sequences with less than 512 tokens were padded and the longer ones were truncated.

The choices made for the architecture of the neural network came from both prior experience and some hyperparameter search. First was the choice of embeddings. Due to the small size of the dataset, we suspected that training our own embeddings would not be a great idea, therefore we used pretrained BERT embeddings to feed into the model. These embeddings came from the HuggingFace implementation of BERT which made the process of generating embeddings quite easy, albeit a bit slow. These embeddings worked pretty well for this problem because not only are they trained on a large corpus of English text, but also are subword level and context-aware.

We fight against the Overfitting from multiple angles:
Due to the size of the dataset (n=3186) and the relatively large size of the embedding vectors (768), overfitting is a challenge in this problem. One initial observation is that even small networks are overfitting and larger ones do not increase the generalization. To mitigate this problem, we use LSTM layers with 64 hidden units (out of networks with up to 1024 units and 40M parameters) and observe that it is the sweet spot. Deeper networks also do not help the validation performance and two layers give the best results.

To improve further fight overfitting and improve accuracy, we also tried three different regularization layers: BatchNorm, LayerNorm and Dropout. Out of the three, only BatchNorm after the LSTM layers and LayerNorm in between the layers increased the validation accuracy. Dropout after or in between the LSTM layers decreased both the training and validation accuracy. In addition, the training procedure had an L2-regularization factor of 0.01.

Transformer Models do not improve performance for our dataset:
Finally, we also tried the more cutting-edge Transformer models to see if they gave any improvement. Specifically, we tried multiple configurations of Bert, Albert and Roberta models from the HuggingFace library. In general, transformers were a bit harder to work with and our implementation did not perform any better than our best LSTM model. Further work is needed to more comprehensively evaluate the performance of transformers on this task.

Overall, we were able to obtain very satisfactory performance levels on the task, by using Deep Learning models, and this will in turn help us measure the impact  of Occupational Licensing on perceived quality of services. Stay tuned to learn more about our final results soon!

Disclaimer
An earlier version of this study was funded by the Small Business Administration (SBA). This discussion is based on analysis undertaken after the completion of the research project in conjunction with SBA, with purposes of extending the existing work.