Feeling Good about your Drugs?

A Machine Learning Study by Afraz Khan and Sukanya Nair

About Us

Afraz
During my undergraduate years, I took a lot of courses in Electrical Engineering but as I progressed through my degree, I realised my affinity for Computer Science and in particular, Machine Learning. After graduating with a Masters Degree in Electrical Engineering from Imperial College London I came to UBC to pursue an MSc in Finance, to learn more about how the 2 fields can mix. As a Data Scientist and through this course, I have learnt the most important conflict surrounding data scientists, more often than not, they do not know what they are looking for. They know it’s ‘something’ but it’s not precise or more specifically: interpretable. I always imagined doing Machine Learning but I always believe in speaking a lot on what I report to the world. So hopefully someday I can promote my dream of being a Financial Data Scientist, for now here’s a pet project I did with my group partner on what people think about the medicinal drugs they use.

Sukanya
Machine Learning has been a fast-growing industry which has been attracting other fields into its aura. I guess that’s what brings me here as well. With a Masters in Economics and currently a graduate student of Finance I have been fascinated with the ML just because of its capacity to help humans understand huge dataset and work around both supervised and unsupervised learning. Pharmaceutical seems to be the centre of hope in 2020 and so why not use ML to understand the role of medical reviews and their impact on ratings!!

Background

Trying to determine how people will rate ‘anything’ has been a fundamental problem in Data Science since its inception. Notable examples include how people rate movies or how people rate performances which shows its prevalence in the entertainment industry. A more niche area surrounding ratings is pharmaceuticals. Especially with the state of the world today, customer satisfaction with healthcare is given top priority. Data scientists make use of the reviews that people leave for these medicines and use this to try and predict how someone will rate a drug. Classifying sentiments through text data is the core of Natural Language Processing (NLP) which is a very useful technique in Machine Learning. The full dataset is available on Kaggle. Here is a screenshot of how, analytically, someone could claim what the rating is:

Figure 1: Sample Review from WebMD

Components of our Project

Data Wrangling
We try to identify any ratings that fall out of range 1–5. There were total 3 cases where the ratings for Effectiveness, Easy of Use and Satisfaction fall out of our bound (We had 2 examples which were ‘6’ and 1 example which was ‘10’). We also note that our data is heavily imbalanced with respect to Gender. There is a 3:1 Female to Male ratio in our training data.

Exploratory Data Analysis (EDA)

Figure 2: Satisfaction over Age Bracket

Figure 2 shows us how each age bracket rates drugs. We clearly see a pattern, the mid-age group ranks drugs either 1 or 5 (almost one-third each) the most while 2,3,4 take the rest of the one-third ratings.

Figure 3: Satisfaction over Gender

Figure 3 shows us how we have more female reviews then male reviews in our data clearly. Male and female reviews are skewed heavily to 1 and 5 (extreme cases of satisfaction). While the rest of the ratings are skewed heavily for female (which is expected as the frequency of reviews are higher by one gender) over males.

Figure 4: Number of Reviews over Age Brackets

We can clearly see in Figure 4 that maximum reviews are written by folks between ages 35–64. Strangely, there are reviews by people below the age of 12. We expect the reviews to be skewed for middle-aged and above as they tend to take a lot of medication and Review in case they are satisfied or dissatisfied.

Figure 5: Number of Reviews across Gender

Figure 5 represents the frequency of reviews across Gender. Here we clearly see that Females tend to review more than their Male counterparts which is due to the fact that our dataset has more ‘female’ drugs than ‘male’ ones.

Figure 6: Number of Reviews across the top 15 conditions

We try to visualize the top 15 Conditions (as shown in Figure 6) and their corresponding frequency. Surprisingly, ‘Other’ is the highest Reviewed condition but we need to include it in our model since this option is available to the reviewer. These Reviews are from the US where a prevalent issue is Obesity, we see how High Blood Pressure is one of the top 3 conditions in our Data which is not surprising. Surprisingly (since this is recent), depression has been a trending issue as well (in the top 5).

Predictive Modelling
Since our target was a rating from 1–5 we decided to redefine our problem to make it binary and classify 1–2 rating as Negative and 4–5 rating as Positive. We drop our Neutral (3) reviews to help train a more interpretable model.

We used 3 models for our prediction:

  1. Logistic Regression
  2. Random Forest Regressor
  3. Light GBM

The idea behind using 3 different models is to identify which one has the best performance (yields highest accuracy). We focused on Reviews and therefore are highly dependent on the frequency of words, type of word: positive/negative and magnitude.

By experimenting with multiple models we identified the optimal one that helped predict the best results. Logistic Regression was the best model as we were able to identify an easily interpretable contribution (positive or negative coefficients) of each word to ratings and it had a reasonably high accuracy (about 77%). Our Random Forest Classifier was too complex and was overfitting on our training data (almost 100% accurate) while performing no better than Logistic Regression on our test data, this is a consequence of using a complicated decision-tree classifier as it tends to ‘over-learn’ relationships that aren’t there. Our LGBM Classifier exhibited similar performance but we opted to use Logistic Regression since it’s more interpretable.

Table 1: Performance across our models

We can clearly see this performance in the table above.

Feature Engineering
For our Logistic Regression model, we needed to manipulate our dataset to represent the frequency of words in a review. This is a necessary task to determine how satisfied customers are with these medicines.

We created our features by counting the number of times a word appeared in a Review. This is a very common approach in NLP where words are represented in a numeric form. We have decided to choose words that occur 1000 times or more while eliminating commonly used word (such as “the”, “a”, “an”, “in”).

What we learned

The Medical Review dataset taken from WebMD seems to be so relevant in this day and age (Covid-19). The lessons we have learnt from class have been vital to understand 1000s of reviews and analyze how much: what you write matters. We also learnt that no matter how well designed you could make your problem there are always more creative approaches (using more features like Medical Conditions, Age et cetera or embedding words in a different way) which yield higher predictive power and greater model accuracy. While we took a very small step, in customer satisfaction about medicines we realize there is a long way to go before we can definitively conclude how they feel about their medicines.

Acknowledgements

We would like to thank Prof. Mike Gelbart and all the TAs for all their motivation and guidance throughout this course.