Distributed Llama 2 on CPUs

 

A toy example of bulk inference on commodity hardware using Python, via llama.cpp and PySpark.


Why?

This exercise is about using Llama 2, an LLM (Large Language Model) from Meta AI, to summarize many documents at once. The scalable summarization of unstructured, semi-structured, and structured text can exist as a feature by itself, and also be part of data pipelines that feed into downstream machine learning models.

Specifically, we want to prove the simultaneous feasibility of:

  • Running Llama 2 on CPUs (i.e., removing GPU capacity constraints)
  • Smooth integration of an LLM with Apache Spark (a key part of Big Data ecosystems)
  • No usage of third-party endpoints (i.e., models must run locally due to air-gapped infrastructure or confidentiality requirements)

How?

A lot of the hard work has already been done for us!

The llama.cpp project enables running simplified LLMs on CPUs by reducing the resolution (“quantization”) of their numeric weights. These ready-to-use model files are easily available.

Next, the llama-cpp-python bindings provide simple access to using llama.cpp from within Python.

Finally, Spark’s applyInPandas() (docs) enables splitting giant data sources into Pandas-sized chunks and processing them independently. Note that this approach can be an anti-pattern if vectorized Spark functions can accomplish the same result, but in our case, we’re basically using Spark as a simple orchestrator to scale out our llama.cpp usage. There’s likely more efficient ways to use llama.cpp in batch processing, but this one is attractive given the simplicity and automatic benefits of Spark’s fault tolerance and scalability.

Plan

A picture of “War and Peace”, a giant novel by Leo Tolstoy
Creative Commons License (CC BY-SA 3.0)

As a fun test, we’ll be using Llama 2 to summarize Leo Tolstoy’s War and Peace, a 1200+ page novel with over 360 chapters. We’ll treat each chapter as a document. Note that Llama 2 already “knows” about the novel; asking it about a key character generates this output (using llama-2–7b-chat.ggmlv3.q8_0.bin):

Prompt:
Briefly describe the character Anna Pavlovna from 'War and Peace'

Response:
Anna Pavlovna is a major character in Leo Tolstoy's novel "War and Peace".
She is a wealthy, intelligent, and charming socialite who is known for her beauty and wit.
Despite her beauty and charm, she is also portrayed as being manipulative and cunning, often using her intelligence and wit to get what she wants.
Throughout the novel, Anna Pavlovna is shown to be a complex and multifaceted character, with both admirable and flawed qualities.
She is particularly close to the main character Pierre Bezukhov, and plays an important role in his personal and emotional development.

Steps:

  1. Install the 7B quantized chat model and llama-cpp-python.
  2. Download the novel, split by chapter, create a Spark DataFrame.
  3. Partition by chapter and generate summaries.

Installation

Configuring a Spark cluster is outside our scope; I’ll assume you have Spark running locally, through a managed service (like Synapse or Elastic Map Reduce), or a custom deployment like Kubernetes.

There are two artifacts that need installed on all worker nodes, whether those nodes are physical machines, VMs, or pods in a serverless pool:

  • LLama 2 model in GGML format (located in /models)
  • The llama-cpp-python module (installed via pip)

We’re using the 7B chat “Q8” version of Llama 2, found here. The download links might change, but a single-node, “bare metal” setup is similar to below:

# download Llama 2 model
mkdir -p /models && cd /models
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin
# install llama.cpp Python bindings
pip install llama-cpp-python

Ensure you can use the model via python3 and this example. To recap, every Spark context must be able to read the model from /models and access the llama-cpp-python module.

Processing the Novel Text

The Bash commands below download the novel and print word counts.

# download "War and Peace" from Project Gutenberg
mkdir -p ~/data
curl "https://gutenberg.org/cache/epub/2600/pg2600.txt" -o ~/data/war_and_peace.txt
# print lines, words, characters
echo "$(cat ~/data/war_and_peace.txt | wc -l) lines"
echo "$(cat ~/data/war_and_peace.txt | wc -w) words"
echo "$(cat ~/data/war_and_peace.txt | wc -c) characters"

Next, we read the text file in Python, removing the Project Gutenberg header and footer. We’ll split on the regex CHAPTER .+ to create a list of chapter strings and create a Spark DataFrame from them (this code assumes a SparkSession named spark).

import re
# read book, remove header/footer
text = open('~/data/war_and_peace.txt', 'r').read()
text = text.split('PROJECT GUTENBERG EBOOK WAR AND PEACE')[1]
# get list of chapter strings
chapter_list = [x for x in re.split('CHAPTER .+', text) if len(x) > 100]
# print stats
print('number of chapters = '+str(len(chapter_list)))
print('max words per chapter = '+str(max([len(c.split(' ')) for c in chapter_list])))
# create Spark dataframe, show it
df = spark.createDataFrame(pd.DataFrame({'text':chapter_list,
'chapter':range(1,len(chapter_list)+1)}))
df.show(10, 60)

The code should produce the following output:

number of chapters = 365
max words per chapter = 3636

+------------------------------------------------------------+-------+
| text|chapter|
+------------------------------------------------------------+-------+
|\n\n“Well, Prince, so Genoa and Lucca are now just family...| 1|
|\n\nAnna Pávlovna’s drawing room was gradually filling. T...| 2|
|\n\nAnna Pávlovna’s reception was in full swing. The spin...| 3|
|\n\nJust then another visitor entered the drawing room: P...| 4|
|\n\n“And what do you think of this latest comedy, the cor...| 5|
|\n\nHaving thanked Anna Pávlovna for her charming soiree,...| 6|
|\n\nThe rustle of a woman’s dress was heard in the next r...| 7|
|\n\nThe friends were silent. Neither cared to begin talki...| 8|
|\n\nIt was past one o’clock when Pierre left his friend. ...| 9|
|\n\nPrince Vasíli kept the promise he had given to Prince...| 10|
+------------------------------------------------------------+-------+

Great! Now we have a DataFrame with 365 rows, each containing the full chapter text and number. The final step is creating a new DataFrame with summaries of each chapter.

Spark Processing

Below is the Python code for generating a single chapter summary (see the call to limit(1) to return a single row). Explanation below the snippet:

# this is the function applied per-group by Spark
# the df passed is a *Pandas* dataframe!
def llama2_summarize(df):
# read model
from llama_cpp import Llama
llm = Llama(model_path="/models/llama-2-7b-chat.ggmlv3.q8_0.bin",
n_ctx=8192,
n_batch=512)
# template for this model version, see:
# https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML#prompt-template-llama-2-chat
template = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible, while being safe.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
<</SYS>>
{INSERT_PROMPT_HERE} [/INST]
"""
# create prompt
chapter_text = df.iloc[0]['text']
chapter_num = df.iloc[0]['chapter']
prompt = 'Summarize the following novel chapter in a single sentence (less than 100 words):' + chapter_text
prompt = template.replace('INSERT_PROMPT_HERE', prompt)
output = llm(prompt,
max_tokens=-1,
echo=False,
temperature=0.2,
top_p=0.1)
return pd.DataFrame({'summary': [output['choices'][0]['text']],
'chapter':[int(chapter_num)]})
# create summaries via Spark
summaries = (df
.limit(1)
.groupby('chapter')
.applyInPandas(llama2_summarize, schema='summary string, chapter int')
.show(vertical=True, truncate=False)
)

The llama2_summarize() function is the code that is applied per-group by Spark. Since we’re grouping by the chapter column, the function is called on each chapter row; the df argument is simply a Pandas DataFrame with a single row. Note that we’re reading the model for every call of llama2_summarize(); this is a shortcut we’re taking for simplicity, but not very efficient.

Finally, using Spark we do the groupby() and call applyInPandas(), setting the schema to include the chapter summary and number.

The output (reformatted for readability) looks like this:

summary
The chapter is about a conversation between Prince Vasíli Kurágin and
Anna Pávlovna Schérer, a well-known socialite and favorite
of Empress Márya Fëdorovna.
They are discussing various political matters, including the possibility
of war with France and Austria's role in the conflict.
Prince Vasíli is hoping to secure a post for his son through
the Dowager Empress, while Anna Pávlovna is enthusiastic
about Russia's potential to save Europe from Napoleon's tyranny.
The conversation also touches on personal matters,
such as Prince Vasíli's dissatisfaction with his younger son
and Anna Pávlovna's suggestion that he marry off
his profligate son Anatole to a wealthy heiress.

chapter
1

(Note the use of Napoleon despite the fact it doesn’t occur in the chapter! Again, this is a fun exercise rather than a realistic example using truly unseen documents.)

The runtime for this single chapter test is about 2 minutes on a 64-core VM. There are many choices we glossed over that affect runtime, such as model size/quantization and model parameters. The key result is that by scaling out our Spark cluster appropriately, we can summarize all chapters in a handful of minutes. Processing hundreds of thousands (or even millions!) of documents daily is thus possible using large Spark clusters comprised of cheap virtual machines.

Summary

We haven’t even mentioned adjusting the standard LLM parameters like temperature and top_p which control the “creativity” and randomness of results, or prompt engineering, which is practically a discipline of its own. We also chose the Llama 2 7B model without justification; there might be smaller and more performant models or model families more suited to our particular use case.

Instead, we’ve shown how to easily distribute (quantized) LLM workloads using Spark with fairly minimal effort. Next steps might include:

  • More efficient load/caching of models
  • Parameter optimization for different use cases
  • Custom prompts

Comments

Popular posts from this blog

Flutter for Single-Page Scrollable Websites with Navigator 2.0

A Data Science Portfolio is More Valuable than a Resume

Better File Storage in Oracle Cloud