OpenAI Releases GPT-4o Upgrade, with Breakthrough AI Voice Assistant

On Monday, OpenAI debuted GPT-4o (o for “omni”), a major new AI model that can ostensibly converse using speech in real time, reading emotional cues and responding to visual input. It operates faster than OpenAI’s previous best model, GPT-4 Turbo, and will be free for ChatGPT users and available as a service through API, rolling out over the next few weeks, OpenAI says.

OpenAI revealed the new audio conversation and vision comprehension capabilities in a YouTube livestream titled “OpenAI Spring Update,” presented by OpenAI CTO Mira Murati and employees Mark Chen and Barret Zoph that included live demos of GPT-4o in action.

OpenAI claims that GPT-4o responds to audio inputs in about 320 milliseconds on average, which is similar to human response times in conversation, according to a 2009 study, and much shorter than the typical 2–3 second lag experienced with previous models. With GPT-4o, OpenAI says it trained a brand-new AI model end-to-end using text, vision, and audio in a way that all inputs and outputs “are processed by the same neural network.”

“Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations,” OpenAI says.

During the livestream, OpenAI demonstrated GPT-4o’s real-time audio conversation capabilities, showcasing its ability to engage in natural, responsive dialogue. The AI assistant seemed to easily pick up on emotions, adapted its tone and style to match the user’s requests, and even incorporated sound effects, laughing, and singing into its responses.

he presenters also highlighted GPT-4o’s enhanced visual comprehension. By uploading screenshots, documents containing text and images, or charts, users can apparently hold conversations about the visual content and receive data analysis from GPT-4o. In the live demo, the AI assistant demonstrated its ability to analyze selfies, detect emotions, and engage in lighthearted banter about the images.

Additionally, GPT-4o exhibited improved speed and quality in more than 50 languages, which OpenAI says covers 97 percent of the world’s population. The model also showcased its real-time translation capabilities, facilitating conversations between speakers of different languages with near-instantaneous translations.

 In the past, OpenAI’s multimodal ChatGPT interface used three processes: transcription (from speech to text), intelligence (processing the text as tokens), and text to speech, bringing increased latency with each step. With GPT-4o, all of those steps reportedly happen at once. It “reasons across voice, text, and vision,” according to Murati. They called this an “omnimodel” in a slide shown on-screen behind Murati during the livestream.

OpenAI announced that GPT-4o will be accessible to all ChatGPT users, with paid subscribers having access to five times the rate limits of free users. GPT-4o in API form will also reportedly feature twice the speed, 50 percent lower cost, and five-times higher rate limits compared to GPT-4 Turbo. (Right now, GPT-4o is only available as a text model in ChatGPT, and the audio/video features have not launched yet.)

Related posts

ET Hunters Using AI to Find Unusual Phenomena in Space

Phygital Labs Launches Art & Culture Metaverse ‘Galerio’

Why AI’s Large Language Models Can’t Spell ‘Strawberry’

This website uses cookies to improve your experience. To read more or opt here visit the privacy policy. Read More