Voxtral
This model was released on 2025-07-15 and added to Hugging Face Transformers on 2025-07-18.
Voxtral
Section titled “Voxtral”Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.
You can read more in Mistral’s release blog post.
The model is available in two checkpoints:
Key Features
Section titled “Key Features”Voxtral builds on Ministral-3B by adding audio processing capabilities:
- Transcription mode: Includes a dedicated mode for speech transcription. By default, Voxtral detects the spoken language and transcribes it accordingly.
- Long-form context: With a 32k token context window, Voxtral can process up to 30 minutes of audio for transcription or 40 minutes for broader audio understanding.
- Integrated Q&A and summarization: Supports querying audio directly and producing structured summaries without relying on separate ASR and language models.
- Multilingual support: Automatically detects language and performs well across several widely spoken languages, including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.
- Function calling via voice: Can trigger functions or workflows directly from spoken input based on detected user intent.
- Text capabilities: Maintains the strong text processing performance of its Ministral-3B foundation.
Audio Instruct Mode
Section titled “Audio Instruct Mode”The model supports audio-text instructions, including multi-turn and multi-audio interactions, all processed in batches.
➡️ audio + text instruction
import torchfrom transformers import VoxtralForConditionalGeneration, AutoProcessorfrom accelerate import Accelerator
device = Accelerator().devicerepo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)
conversation = [ { "role": "user", "content": [ { "type": "audio", "url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav", }, {"type": "text", "text": "What can you tell me about this audio?"}, ], }]
inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)➡️ multi-audio + text instruction
import torchfrom transformers import VoxtralForConditionalGeneration, AutoProcessorfrom accelerate import Accelerator
device = Accelerator().devicerepo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)
conversation = [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3", }, { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, {"type": "text", "text": "What sport and what nursery rhyme are referenced?"}, ], }]
inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)➡️ multi-turn:
import torchfrom transformers import VoxtralForConditionalGeneration, AutoProcessorfrom accelerate import Accelerator
device = Accelerator().devicerepo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)
conversation = [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", }, { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3", }, {"type": "text", "text": "Describe briefly what you can hear."}, ], }, { "role": "assistant", "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.", }, { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav", }, {"type": "text", "text": "Ok, now compare this new audio with the previous one."}, ], },]
inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)➡️ text only:
import torchfrom transformers import VoxtralForConditionalGeneration, AutoProcessorfrom accelerate import Accelerator
device = Accelerator().devicerepo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)
conversation = [ { "role": "user", "content": [ { "type": "text", "text": "What if a cyber brain could possibly generate its own ghost, and create a soul all by itself?", }, ], }]
inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)➡️ audio only:
import torchfrom transformers import VoxtralForConditionalGeneration, AutoProcessorfrom accelerate import Accelerator
device = Accelerator().devicerepo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)
conversation = [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav", }, ], }]
inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)➡️ batched inference!
import torchfrom transformers import VoxtralForConditionalGeneration, AutoProcessorfrom accelerate import Accelerator
device = Accelerator().devicerepo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)
conversations = [ [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", }, { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3", }, { "type": "text", "text": "Who's speaking in the speach and what city's weather is being discussed?", }, ], } ], [ { "role": "user", "content": [ { "type": "audio", "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3", }, {"type": "text", "text": "What can you tell me about this audio?"}, ], } ],]
inputs = processor.apply_chat_template(conversations)inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")print("=" * 80)for decoded_output in decoded_outputs: print(decoded_output) print("=" * 80)Transcription Mode
Section titled “Transcription Mode”Use the model to transcribe audio (state-of-the-art performance in English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)! It also support automatic language detection.
import torchfrom transformers import VoxtralForConditionalGeneration, AutoProcessorfrom accelerate import Accelerator
device = Accelerator().devicerepo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)
# set the language is already know for better accuracyinputs = processor.apply_transcription_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)
# # but you can also let the model detect the language automatically# inputs = processor.apply_transcription_request(audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)
inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")print("=" * 80)for decoded_output in decoded_outputs: print(decoded_output) print("=" * 80)This model was contributed by Eustache Le Bihan.
VoxtralConfig
Section titled “VoxtralConfig”[[autodoc]] VoxtralConfig
VoxtralEncoderConfig
Section titled “VoxtralEncoderConfig”[[autodoc]] VoxtralEncoderConfig
VoxtralProcessor
Section titled “VoxtralProcessor”[[autodoc]] VoxtralProcessor
VoxtralEncoder
Section titled “VoxtralEncoder”[[autodoc]] VoxtralEncoder - forward
VoxtralForConditionalGeneration
Section titled “VoxtralForConditionalGeneration”[[autodoc]] VoxtralForConditionalGeneration - forward