[ad_1]
Recently I watched the tutorial Let’s build GPT: from scratch, in code, spelled out by Andrej Kapathy, in which he walks through the process of constructing a baby GPT model, also known as nanoGPT, from the ground up and explains what is going on under the hood and what is at the core of chatGPT.
Let’s build GPT: from scratch, in code, spelled out. — YouTube
If you are new to the transformer-based language model and you just want to get your feet to the door, then you are in luck because it is super simple with nanoGPT. Andrej Kapathy provided the code that builds a babyGPT model based on Shakespeare’s text in this repository. I am so excited to try it out and build my own babyGPT model 😀
The approach is to train a small transformer-based language model. Please note that the repository provides two options to train a baby GPT model: (1) a character-level transformer model, which basically models how characters follow each other and which character is going to come next ( see data/shakespeare_char
folder) and (2) a word-based transformer model that predicts which term will come next (see data/shakespeare
folder).
The repository contains the following main files:
- model.py that defines the GPT model
- train.py allows us to train the model on any given text
- prepare.py that processes the data and convert it into a binary format for the training purpose
- config folder with files where you can save all (hyper)parameters that are relevant to training
- sample.py that can be used to generate the sample text using the trained model
Ed Sheeran — Perfect (Official Music Video) — YouTube
And because I love Ed Sheeran, in this article, we will be using nanoGPT to and applying it to our own data set to build a songwriter system that crafts lyrics that sounds like Ed Sheeran! I will be using the word-based transformer model but feel free to use the character-based model as well. Exciting? Let’s get started!
You can create a new environment and install the following dependencies for the project by simply running:
conda create -n nanogpt
conda activate nanogptpip install pytorch
pip install numpy
pip install transformers
pip install datasets
pip install tiktoken
The dataset
For this article, we will be using the Ed-sheeran data set that contains the lyrics to all the songs by Ed Sheeran. The data set is available on Huggingface. You can load this dataset directly with the datasets library:
from datasets import load_dataset
dataset = load_dataset("huggingartists/ed-sheeran")
Cool, we are now ready to do some data processing to get the lyrics from each song in the data set. The following block of code can be used for that purpose and save the processed data into the folder called “ed-sheeran”:
import pandas as pd
df = pd.DataFrame(data=dataset)
df['text'] = df.train.apply(lambda row: row.get("text"))
df = df[df.text != ""]
def get_title_lyrics(text):
lyrics_start = "Lyrics"
lyrics_index = text.index(lyrics_start)
title = text[:lyrics_index].strip()
lyrics = text[lyrics_index + len(lyrics_start):].strip()
return 'Title': title, 'Lyrics': lyricsdf[['Title', 'Lyrics']] = df['text'].apply(get_title_lyrics).apply(pd.Series)
df.to_csv("data/ed-sheeran/ed_sheeran.csv")
We need to revise theprepare.py
file that does the data processing for our new input data – which is a csv file. Because we are doing a word-level transformer model, instead of mapping characters to integers, we will encode each token using the GPT-2 tokenizer. We will take the raw lyrics and converts it into a sequence of integers, called token-ids. In this case, each token in the text is represented by a unique token id (integer).
First, we select 90% of the text as training data and 10% of the text as validation data. Next, the process involves encoding the text using the GPT2 tokenizer provided by the tiktoken library. The encoded text is split into a training set (train_ids
) and a validation set (val_ids
). These training and validation sets contain sequences of integers that correspond to the tokens in the original text:
import os
import tiktoken
import numpy as np
import pandas as pddf = pd.read_csv("data/ed-sheeran/ed_sheeran.csv")
data = df["Lyrics"].str.cat(sep="\n")
n = len(data)
train_data = data[: int(n * 0.9)]
val_data = data[int(n * 0.9) :]
# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has len(train_ids):, tokens")
print(f"val has len(val_ids):, tokens")
# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.dirname(__file__), "train.bin"))
val_ids.tofile(os.path.join(os.path.dirname(__file__), "val.bin"))
# train has 433,585 tokens
# val has 48,662 tokens
Now, I will save the above code in a file called prepare-edsheeran.py
and run the following command:
python data/prepare-edsheeran.py
What this does is that it will save the train_ids
and val_ids
sequences as binary files – train.bin
and val.bin
which holds the GPT2 BPE token ids in one sequence. And that is! The data is ready. We can kick off the training.
In this section, we will quickly train a baby GPT model. In the config folder, I created a new file called config/train_edsheeran.py
with the following settings:
out_dir = "out-lyrics"
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 20
log_interval = 10 # don't print too often# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False
dataset = "ed-sheeran"
batch_size = 12
block_size = 64 # context size
# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2
learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 2000
lr_decay_iters = 2000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small
warmup_iters = 100 # not super necessary potentially
# on macbook also add
# device = 'mps' to make use of parallel processing
# compile = False do not torch compile the model
# gradient_accumulation_steps=1 to speed up on macbooks. see https://github.com/karpathy/nanoGPT/issues/28
device = "mps"
compile = False
gradient_accumulation_steps = 1
I have a MacBook and I want to run the model on GPU, therefore, we must set device = “mps”
and compile = False
and gradient_accumulation_steps = 1
to speed up the training. Our context size is only 64, and the batch size is 12 examples per iteration. We will use a much smaller Transformer – 6 layers, 6 heads. And the size of embedding vector is 384, which means that each embedding vector for each token will have 384 dimensions. We will set the number of iterations equal to 2000.
All of these hyper-parameter settings can be changed; of course, if you have more data, time and computing resources, you can train a bigger transformer or have many more interactions. To train the model in your terminal, run the following code:
python train.py config/train_edsheeran.py
and voila, training starts…! ***Waiting****
Training is done. Next, we will create a plot displaying the loss on the validation set as a function of the number of iterations. Observing the following plot, we notice an increase in the validation loss after 500 iterations, suggesting the presence of overfitting.
To address this issue, we will limit our selection to these 500 iterations and proceed with retraining the model. Once the retraining finishes, the trained model ckpt.pt
will be saved to output directly out-lyrics
, which will allow us to use it to generate text.
Now it is the fun part! Let’s see how well our model can learn to craft songs that sound like Ed Sheeran! We can sample from the best model by pointing the sampling script at this directory:
python sample.py --out_dir=out-lyrics
This generates a few samples. Here is the result after running 500 iterations:
I think it does sound like Ed Sheeran with cheesy love songs with romantic themes, does not it? Although a lot of lines come straight out of songs like perfect and love yourself, it still contains original lines like “before I was holding mine, I don’t wanna be my own”. Not too bad for a small model, but better results are quite likely obtainable by training more time or finetuning a pretrained GPT-2 model on this dataset 🙂
In this blog post, we employed nanoGPT to train a personalized language model tailored to Ed Sheeran’s song lyrics. The implementation code is incredibly user-friendly and can be easily adjusted to be used on other datasets. With nanoGPT, anyone can now create their own language models that cater to their unique needs. The best way to learn is through doing, so I encourage you to explore further and experiment with the data of your interest. Happy learning!
[ad_2]
Source link