Feeling Good about your Drugs?

A Machine Learning Study by Afraz Khan and Sukanya Nair

About Us

Sukanya
Machine Learning has been a fast-growing industry which has been attracting other fields into its aura. I guess that’s what brings me here as well. With a Masters in Economics and currently a graduate student of Finance I have been fascinated with the ML just because of its capacity to help humans understand huge dataset and work around both supervised and unsupervised learning. Pharmaceutical seems to be the centre of hope in 2020 and so why not use ML to understand the role of medical reviews and their impact on ratings!!

Background

Figure 1: Sample Review from WebMD

Components of our Project

Exploratory Data Analysis (EDA)

Figure 2: Satisfaction over Age Bracket

Figure 2 shows us how each age bracket rates drugs. We clearly see a pattern, the mid-age group ranks drugs either 1 or 5 (almost one-third each) the most while 2,3,4 take the rest of the one-third ratings.

Figure 3: Satisfaction over Gender

Figure 3 shows us how we have more female reviews then male reviews in our data clearly. Male and female reviews are skewed heavily to 1 and 5 (extreme cases of satisfaction). While the rest of the ratings are skewed heavily for female (which is expected as the frequency of reviews are higher by one gender) over males.

Figure 4: Number of Reviews over Age Brackets

We can clearly see in Figure 4 that maximum reviews are written by folks between ages 35–64. Strangely, there are reviews by people below the age of 12. We expect the reviews to be skewed for middle-aged and above as they tend to take a lot of medication and Review in case they are satisfied or dissatisfied.

Figure 5: Number of Reviews across Gender

Figure 5 represents the frequency of reviews across Gender. Here we clearly see that Females tend to review more than their Male counterparts which is due to the fact that our dataset has more ‘female’ drugs than ‘male’ ones.

Figure 6: Number of Reviews across the top 15 conditions

We try to visualize the top 15 Conditions (as shown in Figure 6) and their corresponding frequency. Surprisingly, ‘Other’ is the highest Reviewed condition but we need to include it in our model since this option is available to the reviewer. These Reviews are from the US where a prevalent issue is Obesity, we see how High Blood Pressure is one of the top 3 conditions in our Data which is not surprising. Surprisingly (since this is recent), depression has been a trending issue as well (in the top 5).

Predictive Modelling
Since our target was a rating from 1–5 we decided to redefine our problem to make it binary and classify 1–2 rating as Negative and 4–5 rating as Positive. We drop our Neutral (3) reviews to help train a more interpretable model.

We used 3 models for our prediction:

  1. Logistic Regression
  2. Random Forest Regressor
  3. Light GBM

The idea behind using 3 different models is to identify which one has the best performance (yields highest accuracy). We focused on Reviews and therefore are highly dependent on the frequency of words, type of word: positive/negative and magnitude.

By experimenting with multiple models we identified the optimal one that helped predict the best results. Logistic Regression was the best model as we were able to identify an easily interpretable contribution (positive or negative coefficients) of each word to ratings and it had a reasonably high accuracy (about 77%). Our Random Forest Classifier was too complex and was overfitting on our training data (almost 100% accurate) while performing no better than Logistic Regression on our test data, this is a consequence of using a complicated decision-tree classifier as it tends to ‘over-learn’ relationships that aren’t there. Our LGBM Classifier exhibited similar performance but we opted to use Logistic Regression since it’s more interpretable.

Table 1: Performance across our models

We can clearly see this performance in the table above.

Feature Engineering
For our Logistic Regression model, we needed to manipulate our dataset to represent the frequency of words in a review. This is a necessary task to determine how satisfied customers are with these medicines.

We created our features by counting the number of times a word appeared in a Review. This is a very common approach in NLP where words are represented in a numeric form. We have decided to choose words that occur 1000 times or more while eliminating commonly used word (such as “the”, “a”, “an”, “in”).

What we learned

Acknowledgements

--

--

Data Analyst

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store