Gemini ‘Omni’ Video Model Revealed: Google Maybe Building a Unified AI Multimodal Generation System

May 12, 2026 | Zoey

Google I/O 2026 will be the place to hear more about Google's new video AI model, Gemini, however, prior to this there has been a leak of it.

A while ago some Gemini users found a feature in their app called "Powered by Omni." At this time, without any official confirmation from Google about the model, there has been considerable chatter regarding it on various social platforms like Reddit, and also seeing some users sharing their experiences with it. The most insight we can now ascertain about Omni, from the info at hand, is that it may represent more than just an evolutionary step from Veo, and may also allow for one AI multimodal generation framework from various components of Google's own various AIs.

If indeed an AI video model can this be confirmed it would be a significant piece of leaked information for the recent AI video space.

What is Gemini Omni?

Gemini app describes that Omni is a totally new mode of video creation.

The system description reads:

"Try our new mode of creating videos. Remix videos, edit in the chat, use templates and other ways to create video content."

This indicates that Omni may be aiming to go more than just "creating a video."

This looks to combine video generation, video editing, chat-based control, template-based video creation,

synchronising audio; multiple cameras via the Gemini chat interface.

To put it another way, it's likely that Google wants people to make video content in the same way as they "chat".

Why is everyone starting to pay attention to Omni?

The thrill for the community is not simply Omni's features but rather the things this new product provides as a demonstration of its capabilities. According to several user tests, Omni has numerous experiences that are fundamentally different than what users find in previous Veo series products, especially in the area of "audio" which has been overlooked too long in the artificial intelligence generated video space.

Most AI generated video models have been able to create very realistic images in terms of visual appearance; however, audio has been the weakest part of nearlyall models. Many times when produced by artificial intelligence all of the audio sounds completely out of sync with the visual images. There are many times that there are background noises that do not match the locations shown in the video or the spacing of sound in relation to where voice is located or created unnatural depth with no natural ambient sound in videos.

However with initial uses of Omni, users specifically mention the digital ambient sound. Ambient sound could be subtle background noises in a restaurant, wind noise from beach, the sound of voice travelling in volume create distance from the speaker and layers in speech create separate identity of one speaking to each other, and sound produced by echo created from ambient sound all cannot just be seen created with video. The quality of recordings in Omni's product are meeting the quality of recordings made with live action video sources and are beyond improvement when compared with the existing Veo series product.

Should the aforementioned feedback be officially validated, this may conclude that Google has indeed made "native audio production" a primary focus. In doing so, this could become an important direction as we move toward the next phase of competition between A.I.-generated video products.

Are there any limitations to Omni?

Nonetheless, as a result of the currently available bad media, while Omni is fantastic at generating quality video, it's also an incredibly energy-intensive operation.

As an example, one of the test users indicated that generating only two videos consumed 86% of the available daily AI Pro limit. This clearly demonstrates that AI video production of outstanding quality continues to be an exceptionally costly process, particularly when factoring in additional aspects such as more realistic audio, audio sync, and complex movement through video creation tools.

This is one of the principal reasons that Google currently continues to restrict the extent of its testing.

Thus, if Omni receives an official launch during Google I/O, there is no doubt that it will not be completely available to users at that time. As a result, it will be given priority to Pro subscribers, and there will likely be new limitations placed on the number of generation attempts and/or use of queuing and limits to the default generated resolution to assist in managing the server load.

This format of development closely mirrors the development path of most modern-day AI video platforms. Essentially all AI video generation models of high-quality video have encountered this same type of problem throughout their early development: the greater the capability for an effect, i.e., higher-quality video, the tighter the demand on a computing network.

Conclusion

The leaked demos and user input show a clear direction by Google toward incorporating AI video functionality through Gemini Omni. Previously, the focus was on enhancing video quality and length only; however, the primary focus of this growing competitive marketplace is towards creating highly realistic sound, multimodal synchronization, and more natural interactive experiences, all of which are clearly shown through the capabilities of Omni.

Omni doesn't appear to be just another video model; it could also serve as the starting point of a gradually merging relationship between Gemini, Veo, etc., in regards to generative functions. If Google achieves “chat style video creation”, it could mark the beginning of a new phase for the AI video marketplace. As Google I/O approaches, we can expect to see further details released regarding Omni.