[Image source: Gettyimagesbank]
Big tech companies in the United States are expected to take a leap forward in the global artificial intelligence (AI) race by introducing a multimodal AI that can analyze and generate various forms of data such as speech, images, and videos, going beyond text.
South Korean companies, on the other hand, are more focused on developing text-based large language models (LLMs).
According to U.S. information technology-focused publication The Information on Monday, Google and OpenAI each aim to release a multimodal AI this year.
Unlike LLMs that only generate text when prompted with a sentence, multimodal AI generates text, images, speech, videos, and more without restrictions.
For example, if a user uploads a food image, it can generate the content of the ingredients and cooking instructions, or when a document with numbers is uploaded, it can instantly display graphs and charts.
Google is a big tech company rushing to release multimodal AI. It has completed the development of its multimodal engine, Gemini, and is testing it with some companies.
Gemini has around one trillion parameters, which are equivalent to the synapses in the human brain. This is about twice as many parameters as the latest version of GPT-4 by OpenAI, which is estimated to have around 500 billion parameters.
Industry observers expect that Gemini will automatically generate and analyze scripts when the URL of a YouTube video is inputted. The service, however, is unlikely to be free, with a monthly subscription fee of about $30.
Google has been actively consolidating its AI organizations to compete with OpenAI and Microsoft. It merged its AI organization DeepMind and its internal AI organization, Brain, into Google DeepMind, as well as appointing Demis Hassabis, the key figure behind AlphaGo, as DeepMind’s CEO.
Google Founder Sergey Brin is known to actively support the move.
OpenAI, in the meantime, has taken steps to compete with Google in the multimodal AI business.
When OpenAI unveiled GPT-4 in March, it demonstrated the initial version of multimodal AI. When a cooking image is uploaded, it generates a recipe and analyzes the ingredients, but this functionality was limited to the demonstration.
The Information noted that OpenAI is expected to release a technology known as GPT-Vision soon, adding that the company is also running a project called Gobi, which is more powerful than GPT-Vision.
OpenAI previously introduced GPT-4, a LLM, and an image AI called DALL E. But Gobi is being developed as a multimodal AI from the start and is expected to be different from combining a LLM and image AI.
OpenAI is also actively recruiting talent. According to its website, it is currently hiring multimodal experts, offering a maximum annual salary of $370,000.
Industry insiders note that the multimodal AI competition between Google and OpenAI has entered its second round.
The Information said that Google has a strong business advantage in the field of multimodal because it owns search engines and YouTube.
According to market research firm ABI Research, the widespread multimodal AI could bring about significant changes in the autonomous driving, robotics, and smart home sectors. For example, the AI could analyze images and videos input from robots and deliver them to consumers in easily understandable text.
According to Fortune Business Insights, the global AI industry is projected to reach $2.03 trillion in 2030, up from $428 billion in 2022.
There are concerns, however, about misuse due to the multimodal AI’s ability to learn from various data sources. For example, it could be exploited to hack facial recognition AI once a profile photo is uploaded for analysis.
OpenAI has delayed the release of GPT-Vision for this reason, but competition is expected to accelerate with Google firmly in the lead.
In Korea, LG has introduced EXAONE, which generates bi-directional image sentences.
By Lee Sang-duk and Lee Eun-joo
[ⓒ Pulse by Maeil Business Newspaper & mk.co.kr, All rights reserved]