Dia
This model was released on 2025-04-21 and added to Hugging Face Transformers on 2025-06-26.
Overview
Section titled “Overview”Dia is an open-source text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs. It can generate highly realistic dialogue from transcript including non-verbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).
Model Architecture: Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio.
Usage Tips
Section titled “Usage Tips”Generation with Text
Section titled “Generation with Text”from transformers import AutoProcessor, DiaForConditionalGenerationfrom accelerate import Accelerator
torch_device = Accelerator().devicemodel_checkpoint = "nari-labs/Dia-1.6B-0626"
text = ["[S1] Dia is an open weights text to dialogue model."]processor = AutoProcessor.from_pretrained(model_checkpoint)inputs = processor(text=text, padding=True, return_tensors="pt").to(torch_device)
model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)outputs = model.generate(**inputs, max_new_tokens=256) # corresponds to around ~2s
# save audio to a fileoutputs = processor.batch_decode(outputs)processor.save_audio(outputs, "example.wav")Generation with Text and Audio (Voice Cloning)
Section titled “Generation with Text and Audio (Voice Cloning)”from datasets import load_dataset, Audiofrom transformers import AutoProcessor, DiaForConditionalGenerationfrom accelerate import Accelerator
torch_device = Accelerator().devicemodel_checkpoint = "nari-labs/Dia-1.6B-0626"
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")ds = ds.cast_column("audio", Audio(sampling_rate=44100))audio = ds[-1]["audio"]["array"]# text is a transcript of the audio + additional text you want as new audiotext = ["[S1] I know. It's going to save me a lot of money, I hope. [S2] I sure hope so for you."]
processor = AutoProcessor.from_pretrained(model_checkpoint)inputs = processor(text=text, audio=audio, padding=True, return_tensors="pt").to(torch_device)prompt_len = processor.get_audio_prompt_len(inputs["decoder_attention_mask"])
model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)outputs = model.generate(**inputs, max_new_tokens=256) # corresponds to around ~2s
# retrieve actually generated audio and save to a fileoutputs = processor.batch_decode(outputs, audio_prompt_len=prompt_len)processor.save_audio(outputs, "example_with_audio.wav")Training
Section titled “Training”from datasets import load_dataset, Audiofrom transformers import AutoProcessor, DiaForConditionalGenerationfrom accelerate import Accelerator
torch_device = Accelerator().devicemodel_checkpoint = "nari-labs/Dia-1.6B-0626"
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")ds = ds.cast_column("audio", Audio(sampling_rate=44100))audio = ds[-1]["audio"]["array"]# text is a transcript of the audiotext = ["[S1] I know. It's going to save me a lot of money, I hope."]
processor = AutoProcessor.from_pretrained(model_checkpoint)inputs = processor( text=text, audio=audio, generation=False, output_labels=True, padding=True, return_tensors="pt").to(torch_device)
model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)out = model(**inputs)out.loss.backward()This model was contributed by Jaeyong Sung, Arthur Zucker, and Anton Vlasjuk. The original code can be found here.
DiaConfig
Section titled “DiaConfig”[[autodoc]] DiaConfig
DiaDecoderConfig
Section titled “DiaDecoderConfig”[[autodoc]] DiaDecoderConfig
DiaEncoderConfig
Section titled “DiaEncoderConfig”[[autodoc]] DiaEncoderConfig
DiaTokenizer
Section titled “DiaTokenizer”[[autodoc]] DiaTokenizer - call
DiaFeatureExtractor
Section titled “DiaFeatureExtractor”[[autodoc]] DiaFeatureExtractor - call
DiaProcessor
Section titled “DiaProcessor”[[autodoc]] DiaProcessor - call - batch_decode - decode
DiaModel
Section titled “DiaModel”[[autodoc]] DiaModel - forward
DiaForConditionalGeneration
Section titled “DiaForConditionalGeneration”[[autodoc]] DiaForConditionalGeneration - forward - generate