The Pipeline is a simple but powerful inference API that is readily available for a variety of machine learning tasks with any model from the Hugging Face Hub.
Tailor the Pipeline to your task with task specific parameters such as adding timestamps to an automatic speech recognition (ASR) pipeline for transcribing meeting notes. Pipeline supports GPUs, Apple Silicon, and half-precision weights to accelerate inference and save memory.
Transformers has two pipeline classes, a generic Pipeline and many individual task-specific pipelines like TextGenerationPipeline or VisualQuestionAnsweringPipeline. Load these individual pipelines by setting the task identifier in the task parameter in Pipeline. You can find the task identifier for each pipeline in their API documentation.
Each task is configured to use a default pretrained model and preprocessor, but this can be overridden with the model parameter if you want to use a different model.
For example, to use the TextGenerationPipeline with Gemma 2, set task="text-generation" and model="google/gemma-2-2b".
Pipeline is compatible with many machine learning tasks across different modalities. Pass an appropriate input to the pipeline and it will handle the rest.
Here are some examples of how to use Pipeline for different tasks and modalities.
pipeline("Section was formerly set out as section 44 of this title. As originally enacted, this section contained two further provisions that 'nothing in this act shall be construed as in any wise affecting the grant of lands made to the State of California by virtue of the act entitled 'An act authorizing a grant to the State of California of the Yosemite Valley, and of the land' embracing the Mariposa Big-Tree Grove, approved June thirtieth, eighteen hundred and sixty-four; or as affecting any bona-fide entry of land made within the limits above described under any law of the United States prior to the approval of this act.' The first quoted provision was omitted from the Code because the land, granted to the state of California pursuant to the Act cite, was receded to the United States. Resolution June 11, 1906, No. 27, accepted the recession.")
[{'summary_text': 'Instructs the Secretary of the Interior to convey to the State of California all right, title, and interest of the United States in and to specified lands which are located within the Yosemite and Mariposa National Forests, California.'}]
At a minimum, Pipeline only requires a task identifier, model, and the appropriate input. But there are many parameters available to configure the pipeline with, from task-specific parameters to optimizing performance.
This section introduces you to some of the more important parameters.
Pipeline is compatible with many hardware types, including GPUs, CPUs, Apple Silicon, and more. Configure the hardware type with the device parameter. By default, Pipeline runs on a CPU which is given by device=-1.
To run Pipeline on a GPU, set device to the associated CUDA device id. For example, device=0 runs on the first GPU.
pipeline("the secret to baking a really good cake is ")
You could also let Accelerate, a library for distributed training, automatically choose how to load and store the model weights on the appropriate device. This is especially useful if you have multiple devices. Accelerate loads and stores the model weights on the fastest device first, and then moves the weights to other devices (CPU, hard drive) as needed. Set device_map="auto" to let Accelerate choose the device.
Pipeline can also process batches of inputs with the batch_size parameter. Batch inference may improve speed, especially on a GPU, but it isn’t guaranteed. Other variables such as hardware, data, and the model itself can affect whether batch inference improves speed. For this reason, batch inference is disabled by default.
In the example below, when there are 4 inputs and batch_size is set to 2, Pipeline passes a batch of 2 inputs to the model at a time.
for out inpipeline(KeyDataset(dataset,"text"),batch_size=8,truncation="only_first"):
print(out)
Keep the following general rules of thumb in mind for determining whether batch inference can help improve performance.
The only way to know for sure is to measure performance on your model, data, and hardware.
Don’t batch inference if you’re constrained by latency (a live inference product for example).
Don’t batch inference if you’re using a CPU.
Don’t batch inference if you don’t know the sequence_length of your data. Measure performance, iteratively add to sequence_length, and include out-of-memory (OOM) checks to recover from failures.
Do batch inference if your sequence_length is regular, and keep pushing it until you reach an OOM error. The larger the GPU, the more helpful batch inference is.
Do make sure you can handle OOM errors if you decide to do batch inference.
Pipeline accepts any parameters that are supported by each individual task pipeline. Make sure to check out each individual task pipeline to see what type of parameters are available. If you can’t find a parameter that is useful for your use case, please feel free to open a GitHub issue to request it!
The examples below demonstrate some of the task-specific parameters available.
Pass the return_timestamps="word" parameter to Pipeline to return when each word was spoken.
Pass return_full_text=False to Pipeline to only return the generated text instead of the full text (prompt and generated text).
__call__ also supports additional keyword arguments from the generate method. To return more than one generated sequence, set num_return_sequences to a value greater than 1.
pipeline("the secret to baking a good cake is",num_return_sequences=4,return_full_text=False)
[{'generated_text': ' how easy it is for me to do it with my hands. You must not go nuts, or the cake is going to fall out.'},
{'generated_text': ' to prepare the cake before baking. The key is to find the right type of icing to use and that icing makes an amazing frosting cake.\n\nFor a good icing cake, we give you the basics'},
{'generated_text': " to remember to soak it in enough water and don't worry about it sticking to the wall. In the meantime, you could remove the top of the cake and let it dry out with a paper towel.\n"},
{'generated_text': ' the best time to turn off the oven and let it stand 30 minutes. After 30 minutes, stir and bake a cake in a pan until fully moist.\n\nRemove the cake from the heat for about 12'}]
There are some instances where you need to process data in chunks.
for some data types, a single input (for example, a really long audio file) may need to be chunked into multiple parts before it can be processed
for some tasks, like zero-shot classification or question answering, a single input may need multiple forward passes which can cause issues with the batch_size parameter
The ChunkPipeline class is designed to handle these use cases. Both pipeline classes are used in the same way, but since ChunkPipeline can automatically handle batching, you don’t need to worry about the number of forward passes your inputs trigger. Instead, you can optimize batch_size independently of the inputs.
The example below shows how it differs from Pipeline.
For inference with large datasets, you can iterate directly over the dataset itself. This avoids immediately allocating memory for the entire dataset, and you don’t need to worry about creating batches yourself. Try Batch inference with the batch_size parameter to see if it improves performance.
from transformers.pipelines.pt_utils import KeyDataset
Accelerate enables a couple of optimizations for running large models with Pipeline. Make sure Accelerate is installed first.
!pip install -U accelerate
The device_map="auto" setting is useful for automatically distributing the model across the fastest devices (GPUs) first before dispatching to other slower devices if available (CPU, hard drive).
Pipeline supports half-precision weights (torch.float16), which can be significantly faster and save memory. Performance loss is negligible for most models, especially for larger ones. If your hardware supports it, you can enable torch.bfloat16 instead for more range.
Lastly, Pipeline also accepts quantized models to reduce memory usage even further. Make sure you have the bitsandbytes library installed first, and then add quantization_config to model_kwargs in the pipeline.
import torch
from transformers import pipeline, BitsAndBytesConfig