Saturday, December 14, 2019

Twitter data collection tutorial using Python

Over the past year, I’ve become more active on Twitter, and with the growing number of interactions, I needed to answer basic questions like:
So I thought this could be a fun programming exercise. But before I can perform any analysis, we need to collect the needed data.
In this tutorial, we’ll learn how to use Twitter’s API and some Python libraries to collect Twitter data. We will cover setting up the development environment, connecting to Twitter’s API, and collecting data.
For the “Just show me the code" folks, here’s the notebook:

Tools and Python libraries

Here’s the list of tools we’ll use
They’re free with a basic Google account and will help keep things simple.
As for Python libraries, here’s what we’ll need
We’ll import all the libraries we need as follows
# Import all needed librariesimport tweepy                   # Python wrapper around Twitter API
from google.colab import drive  # to mount Drive to Colab notebook
import json
import csv
from datetime import date
from datetime import datetime
import time

Connecting Google Drive to Colab

To connect Google Drive (where the data lives) to a Colab notebook (where the data is processed) run the following commands.
# Connect Google Drive to Colabdrive.mount('/content/gdrive')# Create a variable to store the data path on your drivepath = './gdrive/My Drive/path/to/data'
Executing the code block above will prompt you to follow a URL to authenticate your account, and allow data streaming between Google Drive and Colab. Simply, click through the prompts, and you’ll receive a message in your notebook when the drive is mounted successfully.

Authenticating to Twitter’s API

First, apply for a developer account to access the API. The Standard APIs are sufficient for this tutorial. They’re free, but have some limitations that we’ll learn to work around in this tutorial.
Once your developer account is setup, create an app that will make use of the API by clicking on your username in the top right corner to open the drop down menu, and clicking “Apps” as shown below. Then select “Create an app” and fill out the form. For the purposes of this tutorial, use the URL of the Google Colab notebook as the URL of the app.
Select “Apps” from the top right corner once you log into your developer account
Now that you have created a developer account and an app, you should have a set of keys to connect to the Twitter API. Specifically, you’ll have an
These could be inserted directly into your code, or loaded from an external file to connect to the Twitter API, as shown below.
# Load Twitter API secrets from an external JSON filesecrets = json.loads(open(path + 'secrets.json').read())
api_key = secrets['api_key']
api_secret_key = secrets['api_secret_key']
access_token = secrets['access_token']
access_token_secret = secrets['access_token_secret']# Connect to Twitter API using the secretsauth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

Twitter Data Collection

Overview

We’ll create functions to collect
Also, we’ll create two helper functions to make our job easier

Helper Functions

Save JSON
# Helper function to save data into a JSON file
# file_name: the file name of the data on Google Drive
# file_content: the data you want to savedef save_json(file_name, file_content):
  with open(path + file_name, 'w', encoding='utf-8') as f:
    json.dump(file_content, f, ensure_ascii=False, indent=4)
Rate Limit Handling
# Helper function to handle twitter API rate limitdef limit_handled(cursor, list_name):
  while True:
    try:
      yield cursor.next()    # Catch Twitter API rate limit exception and wait for 15 minutes
    except tweepy.RateLimitError:
      print("\nData points in list = {}".format(len(list_name))))
      print('Hit Twitter API rate limit.')
      for i in range(3, 0, -1):
        print("Wait for {} mins.".format(i * 5))
        time.sleep(5 * 60)    # Catch any other Twitter API exceptions
    except tweepy.error.TweepError:
      print('\nCaught TweepError exception' )
To unpack this code, let’s start by defining what a cursor is. Here’s the introduction from Tweepy’s documentation:
We use pagination a lot in Twitter API development. Iterating through timelines, user lists, direct messages, etc. In order to perform pagination, we must supply a page/cursor parameter with each of our requests. The problem here is this requires a lot of boiler plate code just to manage the pagination loop. To help make pagination easier and require less code, Tweepy has the Cursor object.
My explanation is that a Cursor object is Tweepy’s way of managing and passing data that spans multiple pages, the same way the contents of your favorite book are distributed over multiple pages.
With that in mind, the function above first requests the next cursor (or page) of data. If the amount of data collected within the last 15 minutes exceeds the API limits, a tweepy.RateLimitError exception is raised, in which case the code will wait for 15 minutes. The last exception is meant to catch any other tweepy.error.TweepError that could come up during execution, like connection errors to the Twitter API.

Data Collection Functions

Tweets
We’ll reuse an implementation on Github with slight modification
# Helper function to get all tweets of a specified user
# NOTE:This method only allows access to the most recent 3200 tweets
# Source: https://gist.github.com/yanofsky/5436496def get_all_tweets(screen_name):  # initialize a list to hold all the Tweets
  alltweets = []  # make initial request for most recent tweets 
  # (200 is the maximum allowed count)
  new_tweets = api.user_timeline(screen_name = screen_name,count=200)  # save most recent tweets
  alltweets.extend(new_tweets)  # save the id of the oldest tweet less one to avoid duplication
  oldest = alltweets[-1].id - 1  # keep grabbing tweets until there are no tweets left
  while len(new_tweets) > 0:
    print("getting tweets before %s" % (oldest))    # all subsequent requests use the max_id param to prevent
    # duplicates
    new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)    # save most recent tweets
    alltweets.extend(new_tweets)    # update the id of the oldest tweet less one
    oldest = alltweets[-1].id - 1
    print("...%s tweets downloaded so far" % (len(alltweets)))
    ### END OF WHILE LOOP ###  # transform the tweepy tweets into a 2D array that will 
  # populate the csv
  outtweets = [[tweet.id_str, tweet.created_at, tweet.text, tweet.favorite_count,tweet.in_reply_to_screen_name, tweet.retweeted] for tweet in alltweets]  # write the csv
  with open(path + '%s_tweets.csv' % screen_name, 'w') as f:
    writer = csv.writer(f)
    writer.writerow(["id","created_at","text","likes","in reply to","retweeted"])
    writer.writerows(outtweets)
  pass
The code block above is essentially made up of two parts: a while-loop to collect all tweets in a list, and commands to save the tweets in a csv file.
Before we explain what’s going on in the while-loop, let’s first understand two key methods that are used
Now, let’s breakdown what’s going on in the while-loop
  1. There are three variables in play: alltweets is a list to store all the collected tweets, new_tweets is a list to store the latest batch of collected tweets since we can only retrieve 200 tweets at a time, and oldest stores the ID of the oldest tweet we retrieved so far, so the next batch of retrieved tweets come before it.
  2. The variables are initialized before the loop starts. Note that if the specified user doesn’t have any tweets, new_tweets will be empty, and the loop won’t execute.
  3. In each iteration, a new list of 200 tweets that were posted before oldest is retrieved and added to alltweets.
  4. The while-loop will keep iterating until no tweets are found before oldest or the limit of 3,200 tweets is reached.
Now, to write the tweet data into a csv file, we first extract the information we care about from each tweet. This is done using a List comprehension where we capture information like tweet ID, text, and number of likes into a new list called outtweets. Finally, we open a CSV file, and first write a row with the header names of our table, and then write all data in outtweets in the following rows.
Followers
# Function to save follower objects in a JSON file.def get_followers():
  
  # Create a list to store follower data
  followers_list = []  # For-loop to iterate over tweepy cursors
  cursor = tweepy.Cursor(api.followers, count=200).pages()
  for i, page in enumerate(limit_handled(cursor, followers_list)):  
    print("\r"+"Loading"+ i % 5 *".", end='')
    
    # Add latest batch of follower data to the list
    followers_list += page
  
  # Extract the follower information
  followers_list = [x._json for x in followers_list]  # Save the data in a JSON file
  save_json('followers_data.json', followers_list)
As you can see, we use the helper functions we created above. In addition, tweepy.Cursor(api.followers, count=200).pages() creates a Cursor object that will return the data of 200 followers one page at a time. We can now pass this cursor to our limited_handled function along with followers_list. Note, that the retrieved User objects contain two keys _api and _json, so we simply extract the data we care about using the List comprehension [x._json for x in followers_list].
Following
# Function to save friend objects in a JSON file.def get_friends():
  
  # Create a list to store friends data
  friends_list = []  # For-loop to iterate over tweepy cursors
  cursor = tweepy.Cursor(api.friends, count=200).pages()
  for i, page in enumerate(limit_handled(cursor, friends_list)):  
    print("\r"+"Loading"+ i % 5 *".", end='')
    
    # Add latest batch of friend data to the list
    friends_list += page
  
  # Extract the friends information
  friends_list = [x._json for x in friends_list]  # Save the data in a JSON file
  save_json('friends_data.json', friends_list)
You can see that this is exactly like our get_followers() function, except that we use api.friends to define our Cursor object, so we can retrieve the data of the users we’re following.
Today’s Stats
# Function to save daily follower and following counts in a JSON filedef todays_stats(dict_name):  # Get my account information
  info = api.me()  # Get follower and following counts
  followers_cnt = info.followers_count  
  following_cnt = info.friends_count  # Get today's date
  today = date.today()
  d = today.strftime("%b %d, %Y")  # Save today's stats only if they haven't been collected before
  if d not in dict_name:
    dict_name[d] = {"followers":followers_cnt, "following":following_cnt}
    save_json("follower_history.json", dict_name)
  else:
    print('Today\'s stats already exist')
api.me() returns the authenticating user’s information, in this case, me. From there, collecting the follower and following counts is straightforward. The date format I specified %b %d, %Y will return dates in a format that looks like Nov 11, 2019, for example. There are many formats to choose from.

Closing Thoughts

I hope that you’ve enjoyed this tutorial which covered Twitter data collection. Writing this post was very helpful in clarifying my understanding of my own code. For example, I better understood tweepy Cursor objects. It reminded me of the quote
“If you want to learn something, teach it”

I’m always looking for ways to improve my writing, so if you have any feedback or thoughts, please feel free to share. Thanks for reading!

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...