GAMMA Lab researchers collaborated with NVIDIA to release Audio Flamingo Next (AF-Next), a next-generation open audio-language model designed for advanced reasoning over speech, sound, and music.
AF-Next introduces Temporal Audio Chain-of-Thought, a reasoning paradigm that grounds intermediate reasoning steps to timestamps in long audio. This enables more faithful and interpretable reasoning over complex audio inputs, including speech, environmental sounds, music, and long-form recordings.
The model family includes three specialized variants: AF-Next-Instruct for general audio question answering, AF-Next-Think for multi-step audio reasoning, and AF-Next-Captioner for detailed audio captioning. The system supports long audio inputs up to 30 minutes and is trained using large-scale audio data spanning more than 1 million hours.
Together, AF-Next advances open research in audio-language modeling and provides a strong foundation for multimodal systems that can understand, reason over, and interact with real-world audio.