Zero-Shot Text Classification with Hugging Face

 

his post is about detecting text sentiment in an unsupervised way, using Hugging Face zero-shot text classification model.


A few weeks ago I was implementing POC with one of the requirements to be able to detect text sentiment in an unsupervised way (without having training data in advance and building a model). More specifically it was about data extraction. Based on some predefined topics, my task was to automate information extraction from text data. While doing research and checking for the best ways to solve this problem, I found out that Hugging Face NLP supports zero-shot text classification.

What is zero-shot text classification? Check this post — Zero-Shot Learning in Modern NLP. There is a live demo from Hugging Face team, along with a sample Colab notebook. In simple words, zero-shot model allows us to classify data, which wasn’t used to build a model. What I mean here — the model was built by someone else, we are using it to run against our data.

I thought it would be a useful example, where I fetch Twitter messages and run classification to group messages into topics. This can be used as a starting point for more complex use cases.

I’m using GetOldTweets3 library to scrap Twitter messages. Zero-shot classification with transformers is straightforward, I was following Colab example provided by Hugging Face.

List of imports:

import GetOldTweets3 as got
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

from transformers import pipeline

Getting classifier from transformers pipeline:

classifier = pipeline("zero-shot-classification")

I scrap 500 latest messages from Twitter, based on a predefined query — “climate fight”. We are going to fetch messages related to climate change fight into Pandas data frame and then try to split them into topics using zero-shot classification:

txt = 'climate fight'
max_recs = 500

tweets_df = text_query_to_df(txt, max_recs)

In zero-shot classification, you can define your own labels and then run classifier to assign a probability to each label. There is an option to do multi-class classification too, in this case, the scores will be independent, each will fall between 0 and 1. I’m going to use the default option, when the pipeline assumes that only one of the candidate labels is true, returning a list of scores for each label which adds up to 1.

Candidate labels for topics — this would allow us to understand what are people actually talking about climate change fight. Some messages are simple adverts, we would like to ignore them. Zero-shot classification is able to detect adverts pretty well, this helps to clean the data:

candidate_labels = ["renewable", "politics", "emission", "temperature", "emergency", "advertisment"]

I’m going in the loop and classifying each message:

res = classifier(sent, candidate_labels)

Then I’m checking the classification result. It is enough to check the first label, as I’m using the default option when pipeline assumes only one of the candidate labels is true. If the classification score is greater than 0.5, I’m logging it for further processing:

if res['labels'][0] == 'renewable' and res['scores'][0] > 0.5:
candidate_results[0] = candidate_results[0] + 1

From the result, we can see that political topic dominates climate change fight discussion, perhaps as expected. Topics related to emission and emergency are close to each other by popularity. There were around 20 cases of adverts from scrapped 500 messages:

Image for post
Author: Andrej Baranovskij

Let’s see some examples, for each topic.

  • renewable
Eco-friendly Hydrogen: The clean fuel of the future Germany is promoting the use of #eco-friendly hydrogen in the fight against climate change. Hydrogen can replace fossil fuels in virtually every situation, in an engine or fuel cell!
  • politics
This is so crazy and wrong. It’s as if the ACA isn’t better than what we had before, that the fight for voting rights doesn’t matter, or equal pay for women, or marriage equality, or the Paris climate agreement. Just because Biden isn’t what we want doesn’t mean Dems = GOP
  • emission
A simpler, more useful way to tax carbon to fight climate change - Vox
  • temperature
I've noticed any time someone tries to tell me global warming is not a big deal and how climate change has happened before, my body goes into fight or flight.
  • emergency
(+ the next few years are CRUCIAL in the fight against climate change. if we don't address it, we'll pass the point of IRREVERSIBLE damage. biden supports the green new deal. trump... well, ya know.)
  • advertisement
What is your favorite party game? Have a look on @ClumsyRush https://www.nintendo.com/games/detail/clumsy-rush-switch/ #party #game #NintendoSwitch

Classification results are very good, I think Hugging Face zero-shot model does a really good job. Sample sentences from above didn't have direct mention of the topic label and still, they were classified correctly.

Conclusion

Unsupervised text classification with zero-shot model allows us to solve text sentiment detection tasks when you don’t have training data to train the model. Instead, you rely on a large trained model from transformers. For specialized use cases, when text is based on specific words or terms — is better to go with a supervised classification model, based on the training set. But for general topics, zero-shot model works amazingly well.

Source code

Comments

Popular posts from this blog

Flutter for Single-Page Scrollable Websites with Navigator 2.0

A Data Science Portfolio is More Valuable than a Resume

Better File Storage in Oracle Cloud