BEIJING, July 10 (Xinhua) -- Chinese researchers have constructed a new three-mode pre-training model to realize mutual generation between speech and image.
The model, OPT-Omni-Perception pre-Trainer, can jointly learn the multi-mode content of text, speech, image and video.
The current pre-training models often cover the modes of image, video and text, while ignoring the speech information in the environment. To cope with the limitations, the new model can conduct cross-modal generation tasks such as image generation from text, text generation from image, and image generation from speech.
The construction of the new model will promote the development of artificial intelligence (AI) and significantly improve the performance of basic tasks of text, speech, image and video, according to the Institute of Automation, Chinese Academy of Sciences, the developer of the model.
It has great potential value in speech recognition and synthesis, as well as the commercial applications such as human-computer interaction and unmanned driving. Enditem
Original website: http://www.xinhuanet.com/english/2021-07/10/c_1310054059.htm