Skip to content

PE Audio (Perception Encoder Audio)

This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.

PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space. The model enables cross-modal retrieval and understanding between audio and text.

Text input

  • Produces a single embedding representing the full text.

Audio input

  • PeAudioFrameLevelModel
    • Produces a sequence of embeddings, one every 40 ms of audio.
    • Suitable for audio event localization and fine-grained temporal analysis.
  • PeAudioModel
    • Produces a single embedding for the entire audio clip.
    • Suitable for global audio-text retrieval tasks.

The resulting embeddings can be used for:

  • Audio event localization
  • Cross-modal (audio–text) retrieval and matching
TODO

[[autodoc]] PeAudioFeatureExtractor - call

[[autodoc]] PeAudioProcessor - call

[[autodoc]] PeAudioConfig

[[autodoc]] PeAudioEncoderConfig

[[autodoc]] PeAudioEncoder - forward

[[autodoc]] PeAudioFrameLevelModel - forward

[[autodoc]] PeAudioModel - forward