Twitter data collection tutorial using Python
Over the past year, I’ve become more active on Twitter, and with the growing number of interactions, I needed to answer basic questions like:
- Where are my followers from?
- How many likes do my tweets get on average?
- What’s the distribution of the accounts I am following?
So I thought this could be a fun programming exercise. But before I can perform any analysis, we need to collect the needed data.
In this tutorial, we’ll learn how to use Twitter’s API and some Python libraries to collect Twitter data. We will cover setting up the development environment, connecting to Twitter’s API, and collecting data.
For the “Just show me the code" folks, here’s the notebook:
Tools and Python libraries
Here’s the list of tools we’ll use
- Google Colab for the development environment
- Google Drive to store the data
They’re free with a basic Google account and will help keep things simple.
As for Python libraries, here’s what we’ll need
- tweepy for accessing the Twitter API using Python.
- google.colab to link Google Drive to the Colab notebook
- json for loading and saving
json
files - csv for loading and saving
csv
files - datetime for handling date data
- time for timing code execution
We’ll import all the libraries we need as follows
# Import all needed librariesimport tweepy # Python wrapper around Twitter API from google.colab import drive # to mount Drive to Colab notebook import json import csv from datetime import date from datetime import datetime import time
Connecting Google Drive to Colab
To connect Google Drive (where the data lives) to a Colab notebook (where the data is processed) run the following commands.
# Connect Google Drive to Colabdrive.mount('/content/gdrive')# Create a variable to store the data path on your drivepath = './gdrive/My Drive/path/to/data'
Executing the code block above will prompt you to follow a URL to authenticate your account, and allow data streaming between Google Drive and Colab. Simply, click through the prompts, and you’ll receive a message in your notebook when the drive is mounted successfully.
Authenticating to Twitter’s API
First, apply for a developer account to access the API. The Standard APIs are sufficient for this tutorial. They’re free, but have some limitations that we’ll learn to work around in this tutorial.
Once your developer account is setup, create an app that will make use of the API by clicking on your username in the top right corner to open the drop down menu, and clicking “Apps” as shown below. Then select “Create an app” and fill out the form. For the purposes of this tutorial, use the URL of the Google Colab notebook as the URL of the app.
Now that you have created a developer account and an app, you should have a set of keys to connect to the Twitter API. Specifically, you’ll have an
- API key
- API secret key
- Access token
- Access token secret
These could be inserted directly into your code, or loaded from an external file to connect to the Twitter API, as shown below.
# Load Twitter API secrets from an external JSON filesecrets = json.loads(open(path + 'secrets.json').read()) api_key = secrets['api_key'] api_secret_key = secrets['api_secret_key'] access_token = secrets['access_token'] access_token_secret = secrets['access_token_secret']# Connect to Twitter API using the secretsauth = tweepy.OAuthHandler(api_key, api_secret_key) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth)
Twitter Data Collection
Overview
We’ll create functions to collect
- Tweets: this also includes retweets, and replies collected as Tweet objects.
- Followers: all follower information collected as User objects.
- Following: information of all accounts I’m following (a.k.a. friends) collected as User objects.
- Today’s Stats: the followers and following count that day.
Also, we’ll create two helper functions to make our job easier
- Save JSON: to save the collected data in a
json
file on Google Drive - Rate Limit Handling: to manage the Twitter API limits that come with the free version, mainly the number of API calls permitted in a 15-minute period.
Helper Functions
Save JSON
# Helper function to save data into a JSON file # file_name: the file name of the data on Google Drive # file_content: the data you want to savedef save_json(file_name, file_content): with open(path + file_name, 'w', encoding='utf-8') as f: json.dump(file_content, f, ensure_ascii=False, indent=4)
Rate Limit Handling
# Helper function to handle twitter API rate limitdef limit_handled(cursor, list_name): while True: try: yield cursor.next() # Catch Twitter API rate limit exception and wait for 15 minutes except tweepy.RateLimitError: print("\nData points in list = {}".format(len(list_name)))) print('Hit Twitter API rate limit.') for i in range(3, 0, -1): print("Wait for {} mins.".format(i * 5)) time.sleep(5 * 60) # Catch any other Twitter API exceptions except tweepy.error.TweepError: print('\nCaught TweepError exception' )
To unpack this code, let’s start by defining what a cursor is. Here’s the introduction from Tweepy’s documentation:
We use pagination a lot in Twitter API development. Iterating through timelines, user lists, direct messages, etc. In order to perform pagination, we must supply a page/cursor parameter with each of our requests. The problem here is this requires a lot of boiler plate code just to manage the pagination loop. To help make pagination easier and require less code, Tweepy has the Cursor object.
My explanation is that a Cursor object is Tweepy’s way of managing and passing data that spans multiple pages, the same way the contents of your favorite book are distributed over multiple pages.
With that in mind, the function above first requests the next cursor (or page) of data. If the amount of data collected within the last 15 minutes exceeds the API limits, a
tweepy.RateLimitError
exception is raised, in which case the code will wait for 15 minutes. The last exception is meant to catch any other tweepy.error.TweepError
that could come up during execution, like connection errors to the Twitter API.Data Collection Functions
Tweets
We’ll reuse an implementation on Github with slight modification
# Helper function to get all tweets of a specified user # NOTE:This method only allows access to the most recent 3200 tweets # Source: https://gist.github.com/yanofsky/5436496def get_all_tweets(screen_name): # initialize a list to hold all the Tweets alltweets = [] # make initial request for most recent tweets # (200 is the maximum allowed count) new_tweets = api.user_timeline(screen_name = screen_name,count=200) # save most recent tweets alltweets.extend(new_tweets) # save the id of the oldest tweet less one to avoid duplication oldest = alltweets[-1].id - 1 # keep grabbing tweets until there are no tweets left while len(new_tweets) > 0: print("getting tweets before %s" % (oldest)) # all subsequent requests use the max_id param to prevent # duplicates new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest) # save most recent tweets alltweets.extend(new_tweets) # update the id of the oldest tweet less one oldest = alltweets[-1].id - 1 print("...%s tweets downloaded so far" % (len(alltweets))) ### END OF WHILE LOOP ### # transform the tweepy tweets into a 2D array that will # populate the csv outtweets = [[tweet.id_str, tweet.created_at, tweet.text, tweet.favorite_count,tweet.in_reply_to_screen_name, tweet.retweeted] for tweet in alltweets] # write the csv with open(path + '%s_tweets.csv' % screen_name, 'w') as f: writer = csv.writer(f) writer.writerow(["id","created_at","text","likes","in reply to","retweeted"]) writer.writerows(outtweets) pass
The code block above is essentially made up of two parts: a while-loop to collect all tweets in a list, and commands to save the tweets in a
csv
file.
Before we explain what’s going on in the while-loop, let’s first understand two key methods that are used
api.user_timeline([,count][,max_id])
returns the most recent tweets of the specified user. Thecount
parameter specifies the number of tweets we care to retrieve at a time, 200 being the maximum. Themax_id
parameter tells the method to only return tweets with an ID less than (that is, older than) or equal to the specified ID.list.extend(iterable)
adds all items initerable
to the list, unlikeappend
which adds only a single element to the end of the list.
Now, let’s breakdown what’s going on in the while-loop
- There are three variables in play:
alltweets
is a list to store all the collected tweets,new_tweets
is a list to store the latest batch of collected tweets since we can only retrieve 200 tweets at a time, andoldest
stores the ID of the oldest tweet we retrieved so far, so the next batch of retrieved tweets come before it. - The variables are initialized before the loop starts. Note that if the specified user doesn’t have any tweets,
new_tweets
will be empty, and the loop won’t execute. - In each iteration, a new list of 200 tweets that were posted before
oldest
is retrieved and added toalltweets
. - The while-loop will keep iterating until no tweets are found before
oldest
or the limit of 3,200 tweets is reached.
Now, to write the tweet data into a
csv
file, we first extract the information we care about from each tweet. This is done using a List comprehension where we capture information like tweet ID, text, and number of likes into a new list called outtweets
. Finally, we open a CSV
file, and first write a row with the header names of our table, and then write all data in outtweets
in the following rows.
Followers
# Function to save follower objects in a JSON file.def get_followers(): # Create a list to store follower data followers_list = [] # For-loop to iterate over tweepy cursors cursor = tweepy.Cursor(api.followers, count=200).pages() for i, page in enumerate(limit_handled(cursor, followers_list)): print("\r"+"Loading"+ i % 5 *".", end='') # Add latest batch of follower data to the list followers_list += page # Extract the follower information followers_list = [x._json for x in followers_list] # Save the data in a JSON file save_json('followers_data.json', followers_list)
As you can see, we use the helper functions we created above. In addition,
tweepy.Cursor(api.followers, count=200).pages()
creates a Cursor object that will return the data of 200 followers one page at a time. We can now pass this cursor to our limited_handled
function along with followers_list
. Note, that the retrieved User objects contain two keys _api
and _json
, so we simply extract the data we care about using the List comprehension [x._json for x in followers_list]
.
Following
# Function to save friend objects in a JSON file.def get_friends(): # Create a list to store friends data friends_list = [] # For-loop to iterate over tweepy cursors cursor = tweepy.Cursor(api.friends, count=200).pages() for i, page in enumerate(limit_handled(cursor, friends_list)): print("\r"+"Loading"+ i % 5 *".", end='') # Add latest batch of friend data to the list friends_list += page # Extract the friends information friends_list = [x._json for x in friends_list] # Save the data in a JSON file save_json('friends_data.json', friends_list)
You can see that this is exactly like our
get_followers()
function, except that we use api.friends
to define our Cursor object, so we can retrieve the data of the users we’re following.
Today’s Stats
# Function to save daily follower and following counts in a JSON filedef todays_stats(dict_name): # Get my account information info = api.me() # Get follower and following counts followers_cnt = info.followers_count following_cnt = info.friends_count # Get today's date today = date.today() d = today.strftime("%b %d, %Y") # Save today's stats only if they haven't been collected before if d not in dict_name: dict_name[d] = {"followers":followers_cnt, "following":following_cnt} save_json("follower_history.json", dict_name) else: print('Today\'s stats already exist')
api.me()
returns the authenticating user’s information, in this case, me. From there, collecting the follower and following counts is straightforward. The date format I specified %b %d, %Y
will return dates in a format that looks like Nov 11, 2019, for example. There are many formats to choose from.Closing Thoughts
I hope that you’ve enjoyed this tutorial which covered Twitter data collection. Writing this post was very helpful in clarifying my understanding of my own code. For example, I better understood tweepy Cursor objects. It reminded me of the quote
“If you want to learn something, teach it”
I’m always looking for ways to improve my writing, so if you have any feedback or thoughts, please feel free to share. Thanks for reading!
Comments