Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator
With the Generative AI (GenAI) revolution in full swing, text-generation with open-source transformer models like Llama 2 has become the talk of the town. AI enthusiasts as well as developers are looking to leverage the generative abilities of such models for their own use cases and applications. This article shows how easy it is to generate text with the Llama 2 family of models (7b, 13b and 70b) using Optimum Habana and a custom pipeline class – you’ll be able to run the models with just a few lines of code!
This custom pipeline class has been designed to offer great flexibility and ease of use. Moreover, it provides a high level of abstraction and performs end-to-end text-generation which involves pre-processing and post-processing. There are multiple ways to use the pipeline – you can run the run_pipeline.py
script from the Optimum Habana repository, add the pipeline class to your own python scripts, or initialize LangChain classes with it.
Prerequisites
Since the Llama 2 models are part of a gated repo, you need to request access if you haven’t done it already. First, you have to visit the Meta website and accept the terms and conditions. After you are granted access by Meta (it can take a day or two), you have to request access in Hugging Face, using the same email address you provided in the Meta form.
After you are granted access, please login to your Hugging Face account by running the following command (you will need an access token, which you can get from your user profile page):
huggingface-cli login
You also need to install the latest version of Optimum Habana and clone the repo to access the pipeline script. Here are the commands to do so:
pip install optimum-habana==1.10.4
git clone -b v1.10-release https://github.com/huggingface/optimum-habana.git
In case you are planning to run distributed inference, install DeepSpeed depending on your SynapseAI version. In this case, I am using SynapseAI 1.14.0.
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.14.0
Now you are all set to perform text-generation with the pipeline!
Using the Pipeline
First, go to the following directory in your optimum-habana
checkout where the pipeline scripts are located, and follow the instructions in the README
to update your PYTHONPATH
.
cd optimum-habana/examples/text-generation
pip install -r requirements.txt
cd text-generation-pipeline
If you wish to generate a sequence of text from a prompt of your choice, here is a sample command.
python run_pipeline.py --model_name_or_path meta-llama/Llama-2-7b-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt"
You can also pass multiple prompts as input and change the temperature and top_p values for generation as follows.
python run_pipeline.py --model_name_or_path meta-llama/Llama-2-13b-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --temperature 0.5 --top_p 0.95 --prompt "Hello world" "How are you?"
For generating text with large models such as Llama-2-70b, here is a sample command to launch the pipeline with DeepSpeed.
python ../../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py --model_name_or_path meta-llama/Llama-2-70b-hf --max_new_tokens 100 --bf16 --use_hpu_graphs --use_kv_cache --do_sample --temperature 0.5 --top_p 0.95 --prompt "Hello world" "How are you?" "Here is my prompt" "Once upon a time"
Usage in Python Scripts
You can use the pipeline class in your own scripts as shown in the example below. Run the following sample script from optimum-habana/examples/text-generation/text-generation-pipeline
.
import argparse
import logging
from pipeline import GaudiTextGenerationPipeline
from run_generation import setup_parser
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)
logger = logging.getLogger(__name__)
parser = argparse.ArgumentParser()
args = setup_parser(parser)
args.num_return_sequences = 1
args.model_name_or_path = "meta-llama/Llama-2-7b-hf"
args.max_new_tokens = 100
args.use_hpu_graphs = True
args.use_kv_cache = True
args.do_sample = True
pipe = GaudiTextGenerationPipeline(args, logger)
prompts = ["He is working on", "Once upon a time", "Far far away"]
for prompt in prompts:
print(f"Prompt: {prompt}")
output = pipe(prompt)
print(f"Generated Text: {repr(output)}")
You will have to run the above script with
python
as.py --model_name_or_path a_model_name --model_name_or_path
is a required argument. However, the model name can be programatically changed as shown in the python snippet.
This shows us that the pipeline class operates on a string input and performs data pre-processing as well as post-processing for us.
LangChain Compatibility
The text-generation pipeline can be fed as input to LangChain classes via the use_with_langchain
constructor argument. You can install LangChain as follows.
pip install langchain==0.0.191
Here is a sample script that shows how the pipeline class can be used with LangChain.
import argparse
import logging
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from pipeline import GaudiTextGenerationPipeline
from run_generation import setup_parser
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)
logger = logging.getLogger(__name__)
parser = argparse.ArgumentParser()
args = setup_parser(parser)
args.num_return_sequences = 1
args.model_name_or_path = "meta-llama/Llama-2-13b-chat-hf"
args.max_input_tokens = 2048
args.max_new_tokens = 1000
args.use_hpu_graphs = True
args.use_kv_cache = True
args.do_sample = True
args.temperature = 0.2
args.top_p = 0.95
pipe = GaudiTextGenerationPipeline(args, logger, use_with_langchain=True)
llm = HuggingFacePipeline(pipeline=pipe)
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer,\
just say that you don't know, don't try to make up an answer.
Context: Large Language Models (LLMs) are the latest models used in NLP.
Their superior performance over smaller models has made them incredibly
useful for developers building NLP enabled applications. These models
can be accessed via Hugging Face's `transformers` library, via OpenAI
using the `openai` library, and via Cohere using the `cohere` library.
Question: {question}
Answer: """
prompt = PromptTemplate(input_variables=["question"], template=template)
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "Which libraries and model providers offer LLMs?"
response = llm_chain(prompt.format(question=question))
print(f"Question 1: {question}")
print(f"Response 1: {response['text']}")
question = "What is the provided context about?"
response = llm_chain(prompt.format(question=question))
print(f"\nQuestion 2: {question}")
print(f"Response 2: {response['text']}")
The pipeline class has been validated for LangChain version 0.0.191 and may not work with other versions of the package.
Conclusion
We presented a custom text-generation pipeline on Intel® Gaudi® 2 AI accelerator that accepts single or multiple prompts as input. This pipeline offers great flexibility in terms of model size as well as parameters affecting text-generation quality. Furthermore, it is also very easy to use and to plug into your scripts, and is compatible with LangChain.
Use of the pretrained model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://ai.meta.com/llama/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.
To be able to run gated models like this Llama-2-70b-hf, you need the following:
- Have a HuggingFace account
- Agree to the terms of use of the model in its model card on the HF Hub
- set a read token
- Login to your account using the HF CLI: run huggingface-cli login before launching your script