Feeling Good about your Drugs?

About Us

During my undergraduate years, I took a lot of courses in Electrical Engineering but as I progressed through my degree, I realised my affinity for Computer Science and in particular, Machine Learning. After graduating with a Masters Degree in Electrical Engineering from Imperial College London I came to UBC to pursue an MSc in Finance, to learn more about how the 2 fields can mix. As a Data Scientist and through this course, I have learnt the most important conflict surrounding data scientists, more often than not, they do not know what they are looking for. They know it’s ‘something’ but it’s not precise or more specifically: interpretable. I always imagined doing Machine Learning but I always believe in speaking a lot on what I report to the world. So hopefully someday I can promote my dream of being a Financial Data Scientist, for now here’s a pet project I did with my group partner on what people think about the medicinal drugs they use.


Trying to determine how people will rate ‘anything’ has been a fundamental problem in Data Science since its inception. Notable examples include how people rate movies or how people rate performances which shows its prevalence in the entertainment industry. A more niche area surrounding ratings is pharmaceuticals. Especially with the state of the world today, customer satisfaction with healthcare is given top priority. Data scientists make use of the reviews that people leave for these medicines and use this to try and predict how someone will rate a drug. Classifying sentiments through text data is the core of Natural Language Processing (NLP) which is a very useful technique in Machine Learning. The full dataset is available on Kaggle. Here is a screenshot of how, analytically, someone could claim what the rating is:

Figure 1: Sample Review from WebMD

Components of our Project

Data Wrangling
We try to identify any ratings that fall out of range 1–5. There were total 3 cases where the ratings for Effectiveness, Easy of Use and Satisfaction fall out of our bound (We had 2 examples which were ‘6’ and 1 example which was ‘10’). We also note that our data is heavily imbalanced with respect to Gender. There is a 3:1 Female to Male ratio in our training data.

Figure 2: Satisfaction over Age Bracket
Figure 3: Satisfaction over Gender
Figure 4: Number of Reviews over Age Brackets
Figure 5: Number of Reviews across Gender
Figure 6: Number of Reviews across the top 15 conditions
  1. Logistic Regression
  2. Random Forest Regressor
  3. Light GBM
Table 1: Performance across our models

What we learned

The Medical Review dataset taken from WebMD seems to be so relevant in this day and age (Covid-19). The lessons we have learnt from class have been vital to understand 1000s of reviews and analyze how much: what you write matters. We also learnt that no matter how well designed you could make your problem there are always more creative approaches (using more features like Medical Conditions, Age et cetera or embedding words in a different way) which yield higher predictive power and greater model accuracy. While we took a very small step, in customer satisfaction about medicines we realize there is a long way to go before we can definitively conclude how they feel about their medicines.


We would like to thank Prof. Mike Gelbart and all the TAs for all their motivation and guidance throughout this course.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store