Monday, November 25, 2019

Extracting Patient Sentiment for Pharmaceutical Drugs from Twitter

A common problem that pharmaceutical companies face is predicting if a patient will switch pharmaceutical drugs based on their patient journey. Information indicative of patients seeking to switch prescription drugs may be present in social media posts. Analyzing patient sentiment as it is expressed in tweets about prescription drugs may be a step in the right direction for solving this problem.
In this post, we will extract patient sentiment for some of the most prescribed drugs in the US. We will be looking at 5 of the top 10 drugs prescribed in the US:
  1. : A drug used to treat moderate to severe pain
  2. : An HMG-CoA reductase inhibitor (statin) used for reducing the risk of stroke and heart attack
  3. : An angiotensin converting enzyme (ACE) inhibitor used to reduce high blood pressure and prevent kidney failure caused by diabetes
  4. : A statin used to prevent stroke and heart attack in people with coronary heart disease
  5. : A drug used for treating type 2 diabetes
To get started you need to apply for a Twitter developer account:
Source
After your developer account has been approved you need to create a Twitter application:
Source
The steps for applying for a Twitter developer account and creating a Twitter application are outlined here.
We will be using the free python library tweepy in order to access the Twitter API. Documentation for tweepy can be found here.
First, make sure you have tweepy installed. Open up a command line and type:
pip install tweepy
2. 
Next, open up your favorite editor and import the tweepy and pandas libraries:
import tweepy
import pandas as pd
3. 
Next, we need our consumer key and access token:
Source
Notice that the site suggests that you keep your key and token private! Here we define a fake key and token but you should use your real key and token upon creating the Twitter application as shown above:
consumer_key = '5GBi0dCerYpy2jJtkkU3UwqYtgJpRd' 
consumer_secret = 'Q88B4BDDAX0dCerYy2jJtkkU3UpwqY'
access_token = 'X0dCerYpwi0dCerYpwy2jJtkkU3U'
access_token_secret = 'kly2pwi0dCerYpjJtdCerYkkU3Um'
The next step is creating an OAuthHandler instance. We pass our consumer key and access token which we defined above:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
Next, we pass the OAuthHandler instance into the API method:
api = tweepy.API(auth)
4. 
Next, we initialize lists for fields we are interested in analyzing. For now, we can look at the tweet strings, users, and the time of the tweet. Next, we write a for loop over a tweepy ‘Cursor’ object. Within the ‘Cursor’ object we pass the ‘api.search’ method, set the query string for what we would like to search for, and set ‘count’ = 1000 so that we don’t exceed the twitter rate limit. Here we will search for tweets about Vicodin, a pain killer used to treat moderate to severe pain. We also use the ‘item()’ method to convert the ‘Cursor’ object into an iterable.
In order to simplify the query, we can remove retweets and only include tweets in English. To get a sense of what this request returns we can print the values being appended to each list as well:
twitter_users = []
tweet_time = []
tweet_string = []
for tweet in tweepy.Cursor(api.search,q='Vicodin', count=1000).items(1000):
            if (not tweet.retweeted) and ('RT @' not in tweet.text):
                if tweet.lang == "en":
                    twitter_users.append(tweet.user.name)
                    tweet_time.append(tweet.created_at)
                    tweet_string.append(tweet.text)
                    print([tweet.user.name,tweet.created_at,tweet.text])
We can also look at tweets for Simvastatin:
twitter_users = []
tweet_time = []
tweet_string = []
for tweet in tweepy.Cursor(api.search,q='Simvastatin', count=1000).items(1000):
            if (not tweet.retweeted) and ('RT @' not in tweet.text):
                if tweet.lang == "en":
                    twitter_users.append(tweet.user.name)
                    tweet_time.append(tweet.created_at)
                    tweet_string.append(tweet.text)
                    print([tweet.user.name,tweet.created_at,tweet.text])
For reusability we can wrap it all up in a function that takes the drug key word as input. We can also store the results in a dataframe and return the value :
def get_related_tweets(key_word):twitter_users = []
    tweet_time = []
    tweet_string = [] 
    for tweet in tweepy.Cursor(api.search,q=key_word, count=1000).items(1000):
            if (not tweet.retweeted) and ('RT @' not in tweet.text):
                if tweet.lang == "en":
                    twitter_users.append(tweet.user.name)
                    tweet_time.append(tweet.created_at)
                    tweet_string.append(tweet.text)
                    print([tweet.user.name,tweet.created_at,tweet.text])
    df = pd.DataFrame({'name':twitter_users, 'time': tweet_time, 'tweet': tweet_string})
    
    return df
When we can call the function with the drug name, ‘Lisinopril’, we get :
get_related_tweets('Lisinopril')
And for ‘Lipitor’:
get_related_tweets('Lipitor')
Finally for ‘Metformin’:
get_related_tweets('Metformin')
In order to get sentiment scores we need to import a python package called textblob. The documentation for textblob can be found here. In order to install textblob open a command line and type:
pip install textblob
Next import textblob:
from textblob import TextBlob
We will use the polarity score as our measure for positive or negative sentitment. The polarity score is a float with values from -1 to +1.
For example if we define a textblob object and pass in the sentence “I love my health insurance plan with Aetna” we should get a polarity score with a positive value:
sentiment_score = TextBlob(“I love my health insurance plan with Aetna”).sentiment.polarity
print("Sentiment Polarity Score:", sentiment_score)
Let’s get sentiment polarity scores for tweets about ‘Vicodin’:
df = get_related_tweets("Vicodin")
df['sentiment'] = df['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
print(df.head()
We can also count the number of positive and negative sentiments:
df_pos = df[df['sentiment'] > 0.0]
df_neg = df[df['sentiment'] < 0.0]
print("Number of Positive Tweets", len(df_pos))
print("Number of Positive Tweets", len(df_neg))
Again, for code reuse we can wrap it all up in a function:
def get_sentiment(key_word):
    df = get_related_tweets(key_word)
    df['sentiment'] = df['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
    df_pos = df[df['sentiment'] > 0.0]
    df_neg = df[df['sentiment'] < 0.0]
    print("Number of Positive Tweets about {}".format(key_word), len(df_pos))
    print("Number of Negative Tweets about {}".format(key_word), len(df_neg))
If we call this function with “Lipitor” we get :
get_sentiment(“Lipitor”)
It would be convenient if we can visualize these results programmatically. Let’s import seaborn and matplotlib and modify our get_sentiment function:
import seaborn as sns
import matplotlib.pyplot as pltdef get_sentiment(key_word):
    df = get_related_tweets(key_word)
    df['sentiment'] = df['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
    df_pos = df[df['sentiment'] > 0.0]
    df_neg = df[df['sentiment'] < 0.0]
    print("Number of Positive Tweets about {}".format(key_word), len(df_pos))
    print("Number of Negative Tweets about {}".format(key_word), len(df_neg))
    sns.set()
    labels = ['Postive', 'Negative']
    heights = [len(df_pos), len(df_neg)]
    plt.bar(labels, heights, color = 'navy')
    plt.title(key_word)get_sentiment(“Lipitor”)
And the results for the other four drugs are:
As you can see Vicodin, Simvastatin, Metformin and Lipitor have more positive sentiment than negative and Lisinopril has slightly more negative sentiment than positive sentiment. I encourage the reader to perform the same analysis on other drugs and see what the general sentiment for that drug is based on tweets. It would be interesting to collect a few years of data to see if there is any time dependence (seaonality) in the sentiment scores for certain drugs. Maybe I will save that for a future post!
Thank you for reading. The code from this post is available on GitHub. Good luck and Happy Machine Learning!

Towards Data Science

Sharing concepts, ideas, and codes.

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...