You write a text and Google’s new Image Video tool can generate high-definition videos. To do this, it uses a base video generation model and a sequence of spatial and temporal video super-resolution models. This tool has ability to generate various video and text animations in various artistic styles and with understanding of 3D objects.

That is, a person writes that he wants a video. Examples given by Google: an astronaut riding a horse; coffee falling into a cup; a giraffe in a microwave; view of a castle with tall towers reaching to the clouds in a forest of hills at sunrise; A sheep to the right of a glass of wine… and the Google tool creates these videos by understanding natural language.

Just a few days ago Meta launched a similar tool, called Make-a-Video. And Google has not been slow to arrive with its own AI. Image Video can generate video clips from a text for example you say, “a teddy bear washing dishes” and in a few seconds it creates a video of a teddy doing the washing up The results are not perfect but, as Google says, is a step towards a system with a “high degree of control” and knowledge of the worldincluding the ability to generate sequences in a range of artistic styles.

It is similar to the hyper popular DALL-E (and its Mini variants and the evolved DALL-E 2, which has become very popular) but in this case with videos. Google already has Parti in the market, a model for generating photorealistic images and it has Google Image that creates photographs from a text (even if it is surreal as we have already been able to verify) that bBuild your technology on Google AI.

How this technology works





Video Image generates high-resolution videos with cascaded diffusion models (technology that in English is called Cascaded Diffusion Models). The first step is to take some input text and pass it through a T5 text encoder. Next, a basic video broadcast model outputs a 16-frame video at 24×48 resolution and 3 frames per second.





Next, Google’s AI uses various temporal (TSR) and spatial (SSR) super-resolution models to upsample and generate a final 128-frame video at 1280×768 resolution and 24 frames per second, which yields result 5.3 seconds of HD video. Image Video uses the Video U-Net architecture to capture spatial fidelity and temporal dynamics. The Video U-Net architecture allows Image Video to model the temporal dynamics long-term.

For now, we can see the results but we cannot test the tool.