A simple guide to fine tuning Llama 2 on your own data

August 15, 20235 min read

In this guide, I show you how it's actually quite easy to fine tune Llama 2 with your own dataset using HuggingFace's libraries!

All you need is your data in this format:

{"text": "text-for-model-to-predict"}

and HuggingFace's dataloader will handle the rest.

This the equivalent notebook.

Requirements

To start, use a machine with a reasonably latest version of Python and Cuda. (I used Python 3.10 and Cuda 11.7).

On the GPU side, I recommend at least 24GB of VRam: A100, A10, A10G etc. If you want to fine-tune the larger 13b and 70b models, I'd go straight for the A100 :)

Brev sorts all this out for you.

Setup

First step is always to install those pesky dependencies:

pip install -q huggingface_hub
pip install -q -U trl transformers accelerate peft
pip install -q -U datasets bitsandbytes einops wandb
pip install  -q ipywidgets
pip install -q scipy

And import them:

from huggingface_hub import notebook_login
notebook_login()

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments 
from peft import LoraConfig
from trl import SFTTrainer

Dataset

Preparing data

Store your dataset in the following format as a .jsonl file:

{"text": "text-for-model-to-predict"}
{"text": "text-for-model-to-predict-1"}
{"text": "text-for-model-to-predict-2"}

(The "text" feature is extracted later by the training code.)

You should structure each "text-for-model-to-predict" in such a way that the model has an idea of the task it is doing. I formatted my dataset of notes like this:

A note has the following:
Title: *some-title*
Labels: *some-label*
Content: *some-content*

Loading data

Load your train and (optionally) evaluation datasets like this:

from datasets import load_dataset

train_dataset = load_dataset('json', data_files='notes.jsonl', split='train')  
eval_dataset = load_dataset('json', data_files='notes_validation.jsonl', split='train')

(If you don't want to use an evaluation dataset, just comment out the above eval_dataset line and the other appearences of it in below in SFTTrainer.)

Model

Loading model

Load the Llama 2 non-chat model quantized to 4 bits:

base_model_name = "meta-llama/Llama-2-7b-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)
base_model.config.use_cache = False

# More info: https://github.com/huggingface/transformers/pull/24906
base_model.config.pretraining_tp = 1 

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Fine tuning model

And setup a train so that we log, save and evaluate every 50 steps:

output_dir = "./Llama-2-7b-hf-fine-tune-baby"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=50,
    max_steps=1000,
    logging_dir="./logs",        # Directory for storing logs
    save_strategy="steps",       # Save the model checkpoint every logging step
    save_steps=50,                # Save checkpoints every 50 steps
    evaluation_strategy="steps", # Evaluate the model every logging step
    eval_steps=50,               # Evaluate and save checkpoints every 50 steps
    do_eval=True                 # Perform evaluation at the end of training
)

We set the config for the Lora adapter:

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

I experimented with higher alpha and r and got slightly worse results :(

Then leverage the beautiful SFTTrainer class from Huggingface to fine tune Llama 2:

max_seq_length = 512
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,  
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args,
)

# pass in resume_from_checkpoint=True to resume from a checkpoint
trainer.train()

On a 40GB A100 training for 1000 steps took about 2 hours.

Running inference on a trained model

By default, the PEFT library will only save the Qlora adapters. So we need to load the base Llama 2 model from the Huggingface Hub:

base_model_name="meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

and load the qlora adapter from a checkpoint directory:

model = PeftModel.from_pretrained(base_model, "/root/llama2sfft-testing/Llama-2-7b-hf-qlora-full-dataset/checkpoint-900")

then run some inference:

eval_prompt = """A note has the following\nTitle: \nLabels: \nContent: i love"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

As always, if you have any questions

Subrat's Technical Blog

Thursday, August 17, 2023

llama2 own data