一周不到，AI 画师又“进阶”了，还是一个大跨步 —— 直接 1 句话生成视频的那种。输入“一个下午在海滩上奔跑的女人”，立刻就蹦出一个 4 秒 32 帧的小片段
这个最新的文本-视频生成 AI，是清华 & 智源研究院出品的模型 CogVideo。
CogVideo“一脉相承”于文本-图像生成模型 CogView2，这个系列的 AI 模型只支持中文输入，外国朋友们想玩还得借助Google翻译
看完视频的网友直呼“这进展也太快了，要知道文本-图像生成模型 DALL-E2 和 Imagen 才刚出”
还有网友想象：照这个速度发展下去，马上就能看到 AI 一句话生成 VR 头显里的 3D 视频效果了
所以，这只名叫 CogVideo 的 AI 模型究竟是什么来头？
在设计模型上，模型一共有 90 亿参数，基于预训练文本-图像模型 CogView2 打造，一共分为两个模块。
在训练上，CogVideo 一共用了 540 万个文本-视频对。
这样就避免了 AI 看见一句话，直接给你生成几张一模一样的视频帧。
其中，每个训练的视频原本是 160×160 分辨率，被 CogView2 上采样（放大图像）至 480×480 分辨率，因此最后生成的也是 480×480 分辨率的视频。
至于 AI 插帧的部分，设计的双向通道注意力模块则是为了让 AI 理解前后帧的语义。
最后，生成的视频就是比较丝滑的效果了，输出的 4 秒视频帧数在 32 张左右。
研究人员首先将 CogVideo 在 UCF-101 和 Kinetics-600 两个人类动作视频数据集上进行了测试。
其中，FVD（Fréchet 视频距离）用于评估视频整体生成的质量，数值越低越好；IS（Inception score）主要从清晰度和生成多样性两方面来评估生成图像质量，数值越高越好。
CogVideo 的共同一作洪文逸和丁铭，以及二作郑问迪，三作 Xinghan Liu 都来自清华大学计算机系。
此前，洪文逸、丁铭和郑问迪也是 CogView 的作者。
对于 CogVideo，有网友表示仍然有些地方值得探究，例如 DALL-E2 和 Imagen 都有一些不同寻常的提示词来证明它们是从 0 生成的，但 CogVideo 的效果更像是从数据集中“拼凑”起来的：
目前 CogVideo 的代码还在施工中，感兴趣的小伙伴可以去蹲一波了~
AI one sentence generated video explosion: resolution reached 480×480 only support Chinese input
In less than a week, the AI artist "advanced" again, or a big step - directly 1 sentence to generate the kind of video. Type in "a woman running on the beach in the afternoon" and a short clip of four seconds and 32 frames pops up
The latest text-video generation AI is CogVideo, a model produced by Tsinghua & Zhiyuan Research Institute.
As soon as the Demo was posted on the Internet, it went viral, and some people were already eager for the paper:
CogVideo is based on CogView2, a text-to-image generation model. This series of AI models only support Chinese input, and foreign friends want to play with Google Translate:
"This is too fast, given that the text-to-image generation models Dall-E2 and Imagen are just coming out."
At this rate, we will soon be able to see 3D video effects generated by AI words in VR headsets:
So, what exactly is this AI model called CogVideo?
Generate low frame video and then insert frame
According to the team, CogVideo should be the largest and first open source text-generated video model available.
In terms of the design model, the model has a total of 9 billion parameters and is built based on the pre-trained text-image model CogView2, which is divided into two modules.
The first part, based on CogView2, generates several frames of images from text, while the frame rate of composite video is still very low.
The second part will interpolate several frames of images based on bidirectional attention model to generate a complete video with higher frame rate.
For training, CogVideo has 5.4 million text-to-video pairs.
Instead of just "plugging" text and video directly into the AI, you need to split the video into several frames and add an extra frame tag to each image.
This avoids the AI seeing a sentence and giving you several identical video frames.
Among them, each training video originally had a resolution of 160×160, but was upsampled (enlarged image) by CogView2 to a resolution of 480×480, so the final video was also generated with a resolution of 480×480.
As for the AI frame insertion, the bidirectional attention module is designed to allow the AI to understand the semantics of the before and after frames.
In the end, the resulting video is silky and the output of the 4-second video frame number is around 32.
Scored highest on the human assessment
This paper evaluates the model using both data testing and human scoring.
The researchers first tested CogVideo on two human action video datasets, UCF-101 and Kinetics 600.
Among them, FVD (Frechet Video Distance) is used to evaluate the overall quality of video generation, and the lower the value, the better; IS (Inception Score) IS used to evaluate the generated image quality mainly from the two aspects of sharpness and generation diversity. The higher the value, the better.
Overall, the video quality generated by CogVideo was middling.
However, from the perspective of human preference, the video effect generated by CogVideo is much higher than other models, and even achieves the highest score among the current best generation models:
Specifically, volunteers were given a scale and asked to randomly evaluate the videos generated by several models based on how well the videos were generated to determine their overall score:
CogVideo's co-authors, Hong Wenyi and Ding Ming, as well as zheng Wendi and Xinghan Liu, are all from the Computer science department of Tsinghua University.
Previously, CogView was co-authored by Hong Wen-yi, Ding Ming and Zheng Wen-di.
The advisor of this paper is Tang Jie, professor of the Department of Computer Science, Tsinghua University, and academic vice president of Zhiyuan Research Institute. His research interests include AI, data mining, machine learning and knowledge graph.
Dall-e2 and Imagen both have some unusual hints to prove they are generated from 0, but CogVideo's effect is more like being "cobbled" together from a dataset:
For example, videos of lions drinking water directly "with their hands" don't quite fit our conventional wisdom (though they're hilarious) :
(Kind of like a bird with two hands, right?)
But some netizens pointed out that the paper offered some new ideas for language models:
Video training may further unlock the potential of language models. Because it not only has a lot of data, but also contains some common sense and logic that is difficult to convey in text.
CogVideo code is still under construction, interested partners can squat a wave ~
Project & Thesis Address: