Monday, August 28, 2023

LLM Fine Tuning Guide for Enterprises in 2023

The widespread adoption of large language models (LLMs) has improved our ability to process human language (Figure 1). However, their generic training often results in suboptimal performance for specific tasks. To overcome this limitation, fine-tuning methods are employed to tailor LLMs to the unique requirements of different application areas.

Figure 1. Search volume for “large language model” over the last year

qUH8 GMW5Fp7wGi5M59 k1zJudv3V1Dpyl4EjawxTP7kcD1yhcrgFekK1OUQb4sMvlU5DrLxGznL20id03wT1svf hLFnT9S3 C5YrQ2q I8anr2Zmwba2pLNNgWHlFnUEdo3Be5fVk jUdA6hP5peQ

Source: Google Trends

This article explains the reasons, methods, and processes behind the LLM fine tuning, to refine these tools to better suit the intricacies and needs of specific tasks for enterprises.

What is a large language model (LLM)?

A large language model is an advanced artificial intelligence (AI) system designed to process, understand, and generate human-like text based on massive amounts of data. These models are typically built using deep learning techniques, such as neural networks, and are trained on extensive datasets that include text from a broad range, such as books and websites, for natural language processing.

One of the key aspects of a large language model is its ability to understand context and generate coherent, relevant responses based on the input provided. The size of the model, in terms of the number of parameters and layers, allows it to capture intricate relationships and patterns within the text. This enables it to perform various tasks, such as:

Answering questions
Text generation
Summarizing text
Translation
Creative writing

Prominent examples of large language models include OpenAI’s GPT (Generative Pre-trained Transformer) series, with GPT-3 and GPT-4 being the latest iterations.

Foundation models, like large language models, are a core component of AI research and applications. They provide a basis for building more specialized, fine-tuned models for specific tasks or domains.

Figure 2: Foundation models

mNMDAL1cm4peOJxTYtekUxtZ8dkrmPcFEuqB45Ly O2YNId9aSkZK ujfei vMg6dQb7W v5LTyfBrzCQnZwRCOe5TcRTrT9JJ4HNXV uNuHKNl9jH4YClYB2ewwjCTa5q5UTvN3LEFRQmGhWJC30yGZIg1gI TT RcL5UZ9xNO Ko5V9IjqcJDFkJbwQ

Source: Madrona News

What is LLM fine tuning?

Fine-tuning a large language model involves adjusting and adapting a pre-trained model to perform specific tasks or to cater to a particular domain more effectively. The process usually entails training the model further on a smaller, targeted dataset that is relevant to the desired task or subject matter.

The original large language model is pre-trained on vast amounts of diverse text data, which helps it to learn general language understanding, grammar, and context. Fine-tuning leverages this general knowledge and refines the model to achieve better performance and understanding in a specific domain.

Figure 3. Capabilities of an LLM after the fine tuning

Jv3f92t7ODS9a9I TGqP2aNzWuA DaLfrZ9GRxXPM6dxNWg1skGmkpqkwAOupQUpUCbKsw2xl3qs3s8OHvyLX — Benefits of LLM fine tuning

Source: AssemblyAI

For example, a large language model might be fine-tuned for tasks like sentiment analysis in product reviews, predicting stock prices based on financial news, or identifying symptoms of diseases in medical texts. This process customizes the model’s behavior, allowing it to generate more accurate and contextually relevant outputs for the task at hand.

Sentiment analysis
Chatbot development
Question answering

What are the methods used in the fine tuning process of LLMs?

Few-shot learning method

Few-shot learning (FSL) can be considered as a meta-learning problem where the model learns how to learn to solve the given problem. In this approach, the model is provided with a very limited number of examples (i.e., “few shots”) from the new task, and it uses this information to adapt and perform well on that task.

Figure 4. Few-shot learning scenario where the model learns to classify a set of images from the tasks it was trained on

Meta learning framework. The algorithm is trained on several training tasks and tested on test tasks.

Source: Borealis AI

This is particularly useful when there’s not enough data available for traditional supervised learning. In the context of LLMs, fine-tuning with a small dataset related to the new task is an example of few-shot learning.

Few-shot learning is a scenario where an LLM is fine-tuned using a small amount of task-specific data, enabling it to perform better on that task with limited examples.

Fine-tuning methods

Fine-tuning is a process that involves adapting a pre-trained model to a specific task or domain by training it further on a smaller, task-specific dataset. There are several fine-tuning methods that can be used to adjust a pre-trained model’s weights and parameters to improve its performance on the target task:

Transfer Learning: Transfer learning is a fine-tuning method that involves reusing a pre-trained model’s weights and architecture for a new task or domain. The pre-trained model is usually trained on a large, general dataset, and the transfer learning approach allows for efficient and effective adaptation to specific tasks or domains.
Sequential Fine-tuning: Sequential fine-tuning is a method where a pre-trained model is fine-tuned on multiple related tasks or domains sequentially. This allows the model to learn more nuanced and complex language patterns across different tasks, leading to better generalization and performance.
Task-specific Fine-tuning: Task-specific fine-tuning is a method where the pre-trained model is fine-tuned on a specific task or domain using a task-specific dataset. This method requires more data and time than transfer learning but can result in higher performance on the specific task.
Multi-task Learning: Multi-task learning is a method where the pre-trained model is fine-tuned on multiple tasks simultaneously. This approach enables the model to learn and leverage the shared representations across different tasks, leading to better generalization and performance.
Adapter Training: Adapter training is a method that involves training lightweight modules that are plugged into the pre-trained model, allowing for fine-tuning on a specific task without affecting the original model’s performance on other tasks.

Differences between few-shot learning & fine-tuning

The primary difference between fine-tuning methods and few-shot learning is the amount of task-specific data required for the model to adapt to a new task or domain. Fine-tuning methods require a moderate amount of task-specific data to optimize the model’s performance, while few-shot learning methods can adapt models to new tasks or domains with only a few labeled examples.
Another key difference is that fine-tuning methods generally involve pre-trained models, while few-shot learning methods can be applied to models with or without pre-training. Fine-tuning methods typically provide a better starting point for adapting models to new tasks or domains, while few-shot learning methods are useful when training data is scarce or expensive to obtain.

What are some fine-tuning examples?

OpenAI’s base models are suitable for fine-tuning:

Davinci
Curie
Babbage
Ada

Figure 5. GPT-3 base models and their features

Source: OpenAI

The pricing of these models for fine tuning differs on the basis of the model and the tokens used.

Figure 6. Pricing of OpenAI’s base models for fine tuning

Source: OpenAI

For example, Bloomberg has developed BloombergGPT, a large-scale language model tailored for the financial industry. This model focuses on financial natural language processing tasks such as sentiment analysis, named entity recognition, and news classification.

The BloombergGPT was created using a combination of finance and general-purpose datasets, and led to high scores in benchmark tests (Figure 7).

Figure 7. How BloombergGPT performs across two broad categories of NLP tasks: finance-specific and general-purpose

Nu XOtSP7Jx0C rnIXvmFySF HYtFmX4FYVjU7tkIdz0AcMvKp3 QiAWxgGVM 3Z7Uty94flo dIoPX nb43uo4Fo5LD0qYbALGPyy14w28zPWeDla9FRK35KEB9gRW iTHn54sGiDamoph7Jeb2N1A

Source: Bloomberg

Why or when does your business need a fine tuned LLM?

Businesses may need fine-tuned large language models for several reasons, depending on their specific requirements, industry, and objectives. Here are some common reasons:

1- Customization

Businesses often have unique needs and goals that may not be addressed by a generic language model. Fine-tuning enables them to tailor the model’s behavior to suit their specific objectives, such as generating personalized marketing content or understanding user-generated content on their platform.

2- Data sensitivity and compliance

Businesses handling sensitive data or operating under strict regulatory environments might need to fine-tune the model to ensure it respects privacy requirements, adheres to content guidelines, and generates appropriate responses that comply with industry regulations.

3- Domain-specific language

Many industries use jargon, technical terms, and specialized vocabulary that may not be well-represented in the general training data of a large language model. Fine-tuning the model on domain-specific data allows it to understand and generate accurate responses within the context of the business’s industry.

4- Enhanced performance

Fine-tuning improves the model’s performance on specific tasks or applications relevant to the business, such as:

This can lead to better decision-making, higher efficiency, and improved outcomes.

5- Improved user experience

A fine-tuned model can offer a better user experience by generating more accurate, relevant, and context-aware responses leading to increased customer satisfaction, in applications like:

What are the steps involved in the fine tuning of a LLM?

1- Preparing the dataset

This step involves preparing the task-specific dataset for fine-tuning. This may include data cleaning, text normalization (e.g., stemming, tokenization), and converting the data into a format that is compatible with the LLM’s input requirements (i.e. data labeling). It is essential to ensure that the data is representative of the task and domain, and that it covers a range of scenarios that the model is expected to encounter in production.

OpenAI states that each doubling of the dataset size leads to a linear increase in model quality. So, it is better to feed the language model with more data.¹

2- Choosing a foundation model and a fine tuning method

Selecting the appropriate base model and fine-tuning method depends on the specific task and data available. There are various LLM architectures to choose from, including GPT-3, BERT, and RoBERTa, each with its own strengths and weaknesses. The fine-tuning method can also vary based on the task and data, such as transfer learning, sequential fine-tuning, or task-specific fine-tuning.

While choosing the base model, you should consider:

whether the model fits your specific task
input and output size of the model
your dataset size
whether the technical infrastructure is suitable for the computing power required for fine tuning

3- Loading the pre-trained model

Once the LLM and fine-tuning method have been selected, the pre-trained model needs to be loaded into memory. This step initializes the model’s weights based on the pre-trained values, which speeds up the fine-tuning process and ensures that the model has already learned general language understanding.

4- Fine-tuning

This step involves training the pre-trained LLM on the task-specific dataset. The training process involves optimizing the model’s weights and parameters to minimize the loss function and improve its performance on the task. The fine-tuning process may involve several rounds of training on the training set, validation on the validation set, and hyperparameter tuning to optimize the model’s performance.

5- Evaluating

Once the fine-tuning process is complete, the model’s performance needs to be evaluated on the test set. This step helps to ensure that the model is generalizing well to new data and is performing well on the specific task. Common metrics used for evaluation include accuracy, precision, recall, and F1 score.

6- Deploying

Once the fine-tuned model is evaluated, it can be deployed to production environments. The deployment process may involve integrating the model into a larger system, setting up the necessary infrastructure, and monitoring the model’s performance in real-world scenarios.

For a detailed technical guide for creating fine tuned models, we recommend checking OpenAI’s fine tuning guideline.

If you have other questions, please book a call. Would love to answer your questions if you can answer a couple questions about your AIMultiple experience. We are currently offering this service for businesses based in the US or EU.

Thursday, August 24, 2023

Distributed Llama 2 on CPUs

A toy example of bulk inference on commodity hardware using Python, via llama.cpp and PySpark.

Why?

This exercise is about using Llama 2, an LLM (Large Language Model) from Meta AI, to summarize many documents at once. The scalable summarization of unstructured, semi-structured, and structured text can exist as a feature by itself, and also be part of data pipelines that feed into downstream machine learning models.

Specifically, we want to prove the simultaneous feasibility of:

Running Llama 2 on CPUs (i.e., removing GPU capacity constraints)
Smooth integration of an LLM with Apache Spark (a key part of Big Data ecosystems)
No usage of third-party endpoints (i.e., models must run locally due to air-gapped infrastructure or confidentiality requirements)

How?

A lot of the hard work has already been done for us!

The llama.cpp project enables running simplified LLMs on CPUs by reducing the resolution (“quantization”) of their numeric weights. These ready-to-use model files are easily available.

Next, the llama-cpp-python bindings provide simple access to using llama.cpp from within Python.

Finally, Spark’s applyInPandas() (docs) enables splitting giant data sources into Pandas-sized chunks and processing them independently. Note that this approach can be an anti-pattern if vectorized Spark functions can accomplish the same result, but in our case, we’re basically using Spark as a simple orchestrator to scale out our llama.cpp usage. There’s likely more efficient ways to use llama.cpp in batch processing, but this one is attractive given the simplicity and automatic benefits of Spark’s fault tolerance and scalability.

Plan

A picture of “War and Peace”, a giant novel by Leo Tolstoy — Creative Commons License (CC BY-SA 3.0)

As a fun test, we’ll be using Llama 2 to summarize Leo Tolstoy’s War and Peace, a 1200+ page novel with over 360 chapters. We’ll treat each chapter as a document. Note that Llama 2 already “knows” about the novel; asking it about a key character generates this output (using llama-2–7b-chat.ggmlv3.q8_0.bin):

Prompt:
Briefly describe the character Anna Pavlovna from 'War and Peace'

Response:
Anna Pavlovna is a major character in Leo Tolstoy's novel "War and Peace".
She is a wealthy, intelligent, and charming socialite who is known for her beauty and wit.
Despite her beauty and charm, she is also portrayed as being manipulative and cunning, often using her intelligence and wit to get what she wants.
Throughout the novel, Anna Pavlovna is shown to be a complex and multifaceted character, with both admirable and flawed qualities.
She is particularly close to the main character Pierre Bezukhov, and plays an important role in his personal and emotional development.

Steps:

Install the 7B quantized chat model and llama-cpp-python.
Download the novel, split by chapter, create a Spark DataFrame.
Partition by chapter and generate summaries.

Installation

Configuring a Spark cluster is outside our scope; I’ll assume you have Spark running locally, through a managed service (like Synapse or Elastic Map Reduce), or a custom deployment like Kubernetes.

There are two artifacts that need installed on all worker nodes, whether those nodes are physical machines, VMs, or pods in a serverless pool:

LLama 2 model in GGML format (located in /models)
The llama-cpp-python module (installed via pip)

We’re using the 7B chat “Q8” version of Llama 2, found here. The download links might change, but a single-node, “bare metal” setup is similar to below:

# download Llama 2 model
	mkdir -p /models && cd /models
	wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin

	# install llama.cpp Python bindings
	pip install llama-cpp-python

Ensure you can use the model via python3 and this example. To recap, every Spark context must be able to read the model from /models and access the llama-cpp-python module.

Processing the Novel Text

The Bash commands below download the novel and print word counts.

# download "War and Peace" from Project Gutenberg
	mkdir -p ~/data
	curl "https://gutenberg.org/cache/epub/2600/pg2600.txt" -o ~/data/war_and_peace.txt

	# print lines, words, characters
	echo "$(cat ~/data/war_and_peace.txt \| wc -l) lines"
	echo "$(cat ~/data/war_and_peace.txt \| wc -w) words"
	echo "$(cat ~/data/war_and_peace.txt \| wc -c) characters"

Next, we read the text file in Python, removing the Project Gutenberg header and footer. We’ll split on the regex CHAPTER .+ to create a list of chapter strings and create a Spark DataFrame from them (this code assumes a SparkSession named spark).

import re

	# read book, remove header/footer
	text = open('~/data/war_and_peace.txt', 'r').read()
	text = text.split('PROJECT GUTENBERG EBOOK WAR AND PEACE')[1]

	# get list of chapter strings
	chapter_list = [x for x in re.split('CHAPTER .+', text) if len(x) > 100]

	# print stats
	print('number of chapters = '+str(len(chapter_list)))
	print('max words per chapter = '+str(max([len(c.split(' ')) for c in chapter_list])))

	# create Spark dataframe, show it
	df = spark.createDataFrame(pd.DataFrame({'text':chapter_list,
	'chapter':range(1,len(chapter_list)+1)}))

	df.show(10, 60)

The code should produce the following output:

number of chapters = 365
max words per chapter = 3636

+------------------------------------------------------------+-------+
|                                                        text|chapter|
+------------------------------------------------------------+-------+
|\n\n“Well, Prince, so Genoa and Lucca are now just family...|      1|
|\n\nAnna Pávlovna’s drawing room was gradually filling. T...|      2|
|\n\nAnna Pávlovna’s reception was in full swing. The spin...|      3|
|\n\nJust then another visitor entered the drawing room: P...|      4|
|\n\n“And what do you think of this latest comedy, the cor...|      5|
|\n\nHaving thanked Anna Pávlovna for her charming soiree,...|      6|
|\n\nThe rustle of a woman’s dress was heard in the next r...|      7|
|\n\nThe friends were silent. Neither cared to begin talki...|      8|
|\n\nIt was past one o’clock when Pierre left his friend. ...|      9|
|\n\nPrince Vasíli kept the promise he had given to Prince...|     10|
+------------------------------------------------------------+-------+

Great! Now we have a DataFrame with 365 rows, each containing the full chapter text and number. The final step is creating a new DataFrame with summaries of each chapter.

Spark Processing

Below is the Python code for generating a single chapter summary (see the call to limit(1) to return a single row). Explanation below the snippet:

# this is the function applied per-group by Spark
	# the df passed is a Pandas dataframe!
	def llama2_summarize(df):

	# read model
	from llama_cpp import Llama
	llm = Llama(model_path="/models/llama-2-7b-chat.ggmlv3.q8_0.bin",
	n_ctx=8192,
	n_batch=512)

	# template for this model version, see:
	# https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML#prompt-template-llama-2-chat
	template = """
	[INST] <<SYS>>
	You are a helpful, respectful and honest assistant.
	Always answer as helpfully as possible, while being safe.
	Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
	Please ensure that your responses are socially unbiased and positive in nature.

	If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
	If you don't know the answer to a question, please don't share false information.
	<</SYS>>

	{INSERT_PROMPT_HERE} [/INST]
	"""

	# create prompt
	chapter_text = df.iloc[0]['text']
	chapter_num = df.iloc[0]['chapter']
	prompt = 'Summarize the following novel chapter in a single sentence (less than 100 words):' + chapter_text
	prompt = template.replace('INSERT_PROMPT_HERE', prompt)

	output = llm(prompt,
	max_tokens=-1,
	echo=False,
	temperature=0.2,
	top_p=0.1)

	return pd.DataFrame({'summary': [output['choices'][0]['text']],
	'chapter':[int(chapter_num)]})

	# create summaries via Spark
	summaries = (df
	.limit(1)
	.groupby('chapter')
	.applyInPandas(llama2_summarize, schema='summary string, chapter int')
	.show(vertical=True, truncate=False)
	)

The llama2_summarize() function is the code that is applied per-group by Spark. Since we’re grouping by the chapter column, the function is called on each chapter row; the df argument is simply a Pandas DataFrame with a single row. Note that we’re reading the model for every call of llama2_summarize(); this is a shortcut we’re taking for simplicity, but not very efficient.

Finally, using Spark we do the groupby() and call applyInPandas(), setting the schema to include the chapter summary and number.

The output (reformatted for readability) looks like this:

summary
The chapter is about a conversation between Prince Vasíli Kurágin and 
Anna Pávlovna Schérer, a well-known socialite and favorite 
of Empress Márya Fëdorovna. 
They are discussing various political matters, including the possibility 
of war with France and Austria's role in the conflict. 
Prince Vasíli is hoping to secure a post for his son through 
the Dowager Empress, while Anna Pávlovna is enthusiastic 
about Russia's potential to save Europe from Napoleon's tyranny. 
The conversation also touches on personal matters,
such as Prince Vasíli's dissatisfaction with his younger son
and Anna Pávlovna's suggestion that he marry off 
his profligate son Anatole to a wealthy heiress.

chapter
1

(Note the use of Napoleon despite the fact it doesn’t occur in the chapter! Again, this is a fun exercise rather than a realistic example using truly unseen documents.)

The runtime for this single chapter test is about 2 minutes on a 64-core VM. There are many choices we glossed over that affect runtime, such as model size/quantization and model parameters. The key result is that by scaling out our Spark cluster appropriately, we can summarize all chapters in a handful of minutes. Processing hundreds of thousands (or even millions!) of documents daily is thus possible using large Spark clusters comprised of cheap virtual machines.

Summary

We haven’t even mentioned adjusting the standard LLM parameters like temperature and top_p which control the “creativity” and randomness of results, or prompt engineering, which is practically a discipline of its own. We also chose the Llama 2 7B model without justification; there might be smaller and more performant models or model families more suited to our particular use case.

Instead, we’ve shown how to easily distribute (quantized) LLM workloads using Spark with fairly minimal effort. Next steps might include:

More efficient load/caching of models
Parameter optimization for different use cases
Custom prompts

Subrat's Technical Blog