Seedance 1.5 Pro Audio & Video Creation Model Is Officially Launched

December 17, 2025 | Zoey

The performer executes multiple fluid spear dances while singing in an operatic style to a drumbeat. The piece was designed to be filmed in one take, which is why it resembles the creative approach taken in Seedance 1.5 Pro's single-shot generation experiment. While there are many aspects of this work that differ from those of a live opera, it is apparent that the performance as a whole shows much promise as an operatic work and contains a distinct vocal style.

ByteDance has now made public its next-generation audio and video generator model – Seedance 1.5 Pro – and it incorporates integrated functionality for generating audio and visual media from textual input, providing users with the option of creating many different types of custom multimedia by utilizing both sound and video in their productions as well as removing the limitation imposed by conventional video generation methods that only use visual elements.

The model offers greater stability for audio-visual correlations during video generation and improves how lip sync with audio intonation occurs. It also serves as an inherent multilingual solution; therefore, it can provide audio-visual performances using a wide variety of languages and dialects, with the ability to capture the distinctive rhythms of emotion in different speaking styles.
With proactive camera organization and control capability, the model offers insight into and control of complex camera movements (i.e., long-shot tracking and standard zoom) and provides professional-level scene transitions and tonal control, which adds to the dynamism and drama of a production.
The model's deeper semantic analysis allows for a more thorough understanding of storyline context and content-related logic. As a result, the model supports the development of both audio and video segments in a coordinated manner with respect to overall structure and expression rhythm, providing a basis for high-quality professional content creation.

In multiple comprehensive evaluations, Seedance 1.5 Pro demonstrated significant advantages, with several core capabilities ranking among the best in the industry. Seedance 1.5 Pro is now officially available on Jiemeng AI and Doubao. We welcome you to try it and provide your valuable feedback.

Model Card: https://arxiv.org/pdf/2512.13507

Project Homepage: https://seed.bytedance.com/seedance1_5_pro

Experience Access: (1) Jiemeng web interface - Video Generation - Select Video 3.5 Pro;

(2) Doubao App chat box - Animate Photos - Upload photo - Select 1.5 Pro model - Enter prompt (currently in beta testing)

Beyond the integration of sound and visuals: From material generation to complete narrative expression.

Seedance 1.5 Pro is no longer limited to generating fragmented content snippets; instead, it creates video and audio as a unified whole to address more diverse and complex creative needs. With its deep understanding of audiovisual synergy, dynamic scheduling, and cultural context, the model demonstrates more mature narrative capabilities and higher-quality audio-visual integration in applications such as film and television production, short film generation, advertising content creation, and traditional opera performance.

The following section will delve into specific application scenarios to explain in detail how Seedance 1.5 Pro supports and empowers professional-level content creation.

1. Coherent and nuanced narrative expression empowers film-grade creation

Seedance 1.5 Pro can understand and present emotions artistically much better than before by being able to analyze complex and subtle forms of human emotion. It also uses audiovisual coordination at the highest level of precision. This means that it generates visual representation and sound representation together with establishing atmosphere to create narratives that are consistent and layered.

When capturing emotion through video with close-ups, Seedance 1.5 Pro's precision is astounding because it is accurately portraying the character's emotional state by analysing simple changes in the body or facial expression, even when there is no spoken dialogue.

An example of this is that if one were to generate a video in the style of cyberpunk films, Seedance 1.5 Pro would be able to collect contextual intelligence based on the prompts and use this information to generate character states accurately and continuously throughout their emotional experiences.

In addition to creating close-ups that are conveying emotion and narrative, Seedance 1.5 Pro also has the ability to automatically order a sequence of scenes with a basic story structure, based on user prompts. For example, within the context of creating an anime, Seedance 1.5 Pro can continuously produce multiple scenes that fit naturally together, like a series of close-up shots depicting the male and female leads' confessions to one another, with their voices filled with heightened emotion. The end result is a coherent and flowing narrative that has been constructed naturally.

2. Professional-grade camera movement and dynamic performance, covering demanding scene requirements

Version 1.5 of Seedance has systematic improvements to Camera Controls and Dynamic Tension, allowing for more stable control of complex scenes and highly demanding scenes in terms of both intensity and capability of Complex Motion. The Seedance 1.5 system will allow for much more easily generated, high-energy and impactful, motion footage.

Seedance 1.5 can create strong immersive experiences by synchronizing Audio (Sound) with Visual (Camera) Movement in a ski scene. The camera falls in very close to the skier's movements and switches very quickly from one position to another, allowing the creation of all scenes with dynamic motion of the snow spray and the adrenaline rushes associated with skiing at high speeds and extreme angles.

Additionally, the Seedance model possesses an autonomous camera control function, which enables it to generate multiple complex camera movements and to adapt to different Scene Generation Conditions which require very precise (high level of detail) movement of the camera. In the simulation of the Red Carpet premiere event, for example, the model enables the generation of multiple fast panning shots to create the effect of a large, colorful, busy, and active environment combined with Chinese voiceovers and allows the user to visually experience the Red Carpet premiere as it appears to happen.

In the generated example of a promotional video for a robotic vacuum cleaner, the footage uses a slow, deliberate camera movement typical of commercial advertisements, continuously following the robot's movement trajectory. This keeps the visual focus consistently on the product itself, resulting in an overall more professional presentation.

3. Multilingual and Dialect Support, Enhancing Stylized Performance Effects in Comedy and Other Genres

Multilingual support in Seedance 1.5 Pro allows for seamless generation of authentic pronunciations and speech patterns for multiple languages such as: English, French, German, Spanish, Polish, Russian and other languages including those listed above as well as Chinese (including Mandarin, Cantonese...etc.). This includes creating performances with short dramatic sketches, comical performances and entertainment content.

An example of this is when generating a scene with a giant panda eating bamboo; all of a sudden the panda "complain[s]" about it to the viewer (in Chinese/Sichuan dialect). This is more than simply replicating the unique pronunciation and speech patterns; it also creates a "realistic" performance by synchronizing the panda's body language, emotions, and facial expression(s) with how they are speaking, which creates a greater emotional impact on the viewer (more entertaining).

4. Precise Sound Generation, Enhancing Immersive Experiences in Games and Other Scenarios

Seedance 1.5 Pro has robust capabilities for comprehending and producing both incidental sounds (environmental sound effects) as well as creating musical atmospheres. Additionally, its capabilities enable it to use video information alone to generate aligned or layered environmental sounds, resulting in a high level of clarity and an accurate representation of the audiovisual relationship between video and audio.

For instance, when using pixel-style game footage, this model creates a consistent rhythm of camera movement through the fast-paced actions of the character running and jumping, while also creating simultaneous 8-bit game sound effects that are appropriate for the action depicted in the video; therefore, it allows for stable audiovisual coherence during rapid movements. Conversely, in the case of using a 3D video game example, the model's use of the physical maps of the environments allows it to create fully developed worlds that can be traversed by characters, with additional sounds such as walking footsteps of the characters and the sound of their breathing being in sync with the character's movements within the video segment as well as, through the use of persistent background cawing noises from crows in the far background, providing an immersive experience and a greater depth of reality.

With this ability, Seedance 1.5 Pro Can be very useful for several genres of creative products including; movies, advertising, shorts, and animated products. In I2V (Image-To-Video) assignments, this model shows good consistency in style, keeping its features consistent for the characters throughout multiple camera switches and complicated action scenes; this helps keep things coherent from material generation through to final video production.

Seedance 1.5 Pro Review Summary

Instruction understanding and audio generation performance are particularly outstanding.

To more comprehensively and objectively measure the model's overall capabilities in the field of audio and video generation, the R&D team has built a new evaluation system, SeedVideoBench 1.5. This benchmark was developed with evaluation criteria jointly formulated by film and television industry directors and cutting-edge technology experts, covering multiple key capability dimensions. It focuses on evaluating the model's comprehensive performance in understanding complex visual instructions, motion stability and expressiveness, image aesthetic quality, as well as audio instruction adherence, audio-visual synchronization accuracy, and overall sound quality.

Video Generation Capability Evaluation Results

In terms of video generation, compared with mainstream models participating in the comparative evaluation, Seedance 1.5 Pro performed more accurately in understanding complex action descriptions, camera language, and narrative rhythm. The model can better parse prompts containing multiple constraints, maintaining consistency with user expectations in narrative logic and visual style.

The evaluation results show that the videos generated by this model have relatively rich dynamic performance, with continuous and natural character movements and vivid and detailed facial expressions. In scenes containing complex camera movements, the frame transitions are relatively smooth, and the consistency with the reference image in style and texture is high, and the overall visual effect is closer to real shooting standards. However, in high-intensity motion or rapidly switching scenes, there is still room for improvement in image stability, which needs further optimization.

Audio Generation Capability Evaluation Results

In terms of audio generation, Seedance 1.5 Pro's overall performance is already at an industry-leading level. The model demonstrates high consistency and stability in multiple dimensions, including audio instruction understanding, audio-visual synchronization control, sound quality clarity, and expressiveness.

The model can accurately generate matching human voices and environmental sound effects according to instructions, especially performing outstandingly in Chinese dialogue scenes, with high completeness of lines, clear and natural pronunciation, and effectively responding to the generation needs of various Chinese dialects. Compared with similar models, the generated human voices have more natural fluctuations, and the mechanical feel is significantly reduced; the sound effects are closer to actual auditory experience in terms of realism, spatial layering, and reverberation effects, while the audio-visual misalignment phenomenon is significantly reduced. Although there is still room for improvement in scenarios involving multi-character alternating dialogues, choruses, or singing, overall, Seedance 1.5 Pro already possesses considerable practical application capabilities, supporting the creation of short dramas, stage performances, and film-level narrative content centered on Chinese and dialect dialogues.

Multimodal Joint Generation Framework

Reliable Synergy Between Audio / Video

Seedance 1.5 Pro is based on the fundamental architecture of a joint audio/video generation model. By rebuilding the underlying architecture and data processing pipelines, polygraphic training techniques, and inference processes, overall generalization and stability across all audio/video tasks have been maximally enhanced.

Unified Multimodal Joint Architecture

Based on the MMDiT architecture, the team proposed a unified audio-video joint generation framework. This framework achieves high-precision alignment of visual and auditory streams at both the temporal and semantic levels through a deep cross-modal information interaction mechanism. Through multi-task joint training on a large-scale mixed-modal dataset, the model demonstrates strong adaptability in various downstream audio-video generation tasks.

Multi-stage Data Pipeline Design

Through its data pipeline for developing a richer final product, Seedance has balanced the volume of audio and video created during development (audio and video balance) and improved the expressiveness of motion (motion expressiveness) and introduced curriculum-based training scheduling to improve training outcomes. The pipeline provides enhanced video descriptions — in addition to improving audio descriptions and adding audio descriptions into the original video provides new high-quality audio and video descriptions, which give diverse, accurate, reliable data for generating high-quality audio and video content.

Refined post-training and optimization strategies

In its refined post-training strategy, Seedance will fine-tune Seedance 1.5 Pro using an established high-quality audio and video data set to create a new supervised fine-tuning strategy and introduce a novel reinforcement learning from human feedback (RLHF) algorithm to train for audio and video generation scenarios. The resulting synergy of multiple positive-multiple reward models has resulted in marked improvements in Seedance's ability to complete T2V and I2V tasks; including substantial improvements in motion quality, aesthetics, and fidelity of audio.

Optimising inference to improve performance

To provide better performance in terms of actual implementation of video content generation, the Seddance Team further improved the operation of the Pipeline through optimising the multi-stage distillation of datasets that will reduce the number of function evaluations (NFE) expected to occur when performing generation and will leverage recently introduced Infrastructure optimisations at the Inference layer through the use of Quantised and Parallel Computered Architecture, resulting in a 10 times end-to-end acceleration of the Inference process with negligible performance decrease.

Summary and future outlook

Compared to its predecessor, Seedance 1.0, Seedance 1.5 Pro has achieved a significant leap forward in immersive audiovisual experience and professional-grade narrative expression capabilities. Leveraging a joint audio-visual generation architecture and refined post-training strategies, the model demonstrates greater maturity in multimodal instruction understanding and execution – exhibiting excellent application potential in both film-quality high-dynamic range shots and dialect performance scenarios requiring high lip-sync accuracy.

At the same time, the team is also aware that the model still has room for improvement in areas such as stability in complex physical movements, multi-character complex dialogues, and singing content generation. In the future, the Seed team will continue to explore longer-sequence narrative generation capabilities, lower-latency real-time experiences on end devices, and continuously enhance the model's understanding of the laws of the physical world and its multimodal perception capabilities.

The team hopes that the Seedance series of models will become more vivid, efficient, and truly understand the needs of creators in the future, helping content creation break through sensory boundaries and unleash more imaginative audiovisual expression potential.