Sunday, June 15, 2025
HomeEngagementThe Future Is Multimodal: Why Voice Alone Will Never Be the Answer

The Future Is Multimodal: Why Voice Alone Will Never Be the Answer

PCs and mobile phones typically require a screen for them to be practical and functional. With screens comes the necessity of effective input modalities. We understand the necessity of being able to indicate items that are presented to us (commonly done with a mouse or a finger touch), as well as having the ability to enter text or give instructions (generally through a keyboard). Being effective with a PC calls for the act of communicating with it and the answer is to recognize the most productive approaches to get and give data and instruction to and from the machine.

Amazon Echo Brought Voice to the Forefront

In 2014, Amazon made waves when it released the Amazon Echo, a revolutionary new computer device that didn’t require a monitor, keyboard, mouse, or touchpad in order to be used in homes. It was a revolutionary concept to make voice the only means of communication between the device and the user. You could say that Echo was designed for performing specific tasks and was not made to replace our laptops, however, it was capable of a lot of things that a computer does commonly such as looking up the forecast, answering inquiries, set alarms, and keep us reminded. Furthermore, it can be used for playing games, controlling the home, talking to friends, and of course, playing music. The device would have been justified solely on the basis of the latter activity. Calling out “Alexa, play the Beatles” is the easiest and most effective way to listen to their music, compared to the laborious task of navigating to the artist name, selecting an album or station, and then playing it.

The introduction of the Echo catalyzed a new wave of excitement for technology, complete with all its accompanying features and gadgets. Programmers were eager to utilize their skills in coding to a user interface which they were mostly unfamiliar with (despite the fact that spoken language has been around for ages). They tried to recreate famous mobile applications or web experiences as speech-based interactions without considering the unique elements of verbal user interfaces. The original versions were as anticipated: awkward, challenging to operate, usually just temporary, since their utility was confined. Concurrently, there were comparable contests taking place: automated conversational agents on messaging platforms such as Facebook Messenger.

The problem was compounded by the difficulty of settling on one particular phrase to describe these voice experiences. Amazon refers to them as “abilities”, Google names them “actions” and Samsung chose an unwieldy and complex term called “capsules” for its Bixby voice helper. Apple doesn’t have a third-party ecosystem for Siri yet. In the customary fashion of Apple, the business is carefully considering how this fresh input approach could be effective for users and become a standard. It’s ironic that Apple, the first company to introduce a voice assistant in 2011, had the last major technological development before the unfortunately early death of Steve Jobs.

Voice’s Ongoing Struggle With Stickiness

Jump forward six years, and voice activated home assistants stand as the most rapidly growing consumer electronic item ever seen. It’s interesting to note the slow pace of uptake and the durability of the skills or activities that the community contributed to, which were highly beneficial in the case of mobile apps. We still mainly employ our home aides for playing tunes, remembering to do things, and requesting information. No one in my family has used any functions on the Amazon Echo besides the ones that are already available. Why is that?

The local population generated the hashtag #VoiceFirst to promote voice as the leading way of communicating. They say that it is the path of communication that is closest to human instinct because it is the original route that was developed so that humans can interact with one another. And it certainly is! We should all strive to create our own individual sets of voice commands and use them often. We definitely aren’t.

It should become apparent that maybe speaking is not just another ‘channel’, as some believe it is. It is not feasible for one single thing to keep an entire grouping of voice-only inventors, plus businesses that build voice encounters for other organizations, engaged. This type of communication requires specific expertise from VUI Designers in order to create successful interactions. It is primarily an approach with distinctive characteristics. It doesn’t have the potential to reach the same level of success and prevalence as the web or mobile.

People accuse the issue of “discoverability” being to blame for certain things. When we do not have the help of a screen, which can show us which applications are accessible in the app store or on the home page, then it is up to us to recall what our device can do to help us along. We should be able to rely on voice assistance to provide us with answers when we genuinely need them. Theoretically, the device should be able to talk us through the different abilities it has by engaging in conversation. In practice, however, that is where things fall apart.

The Problem With Voice Only? It’s Both Fast and Slow

Voice can be incredibly frustrating as a channel. It is inherently slow when used in dialogue. Speaking is much more rapid than typing, with a 2016 Stanford study showing that one can get their message across three times faster when talking than when typing. Yet it is (surprisingly?) slow as an output channel.

We have a purpose for listening to podcasts at a faster pace, such as 1.25x or 1.5x speed. When viewing a complicated website like Amazon.com, we usually can understand its broad structure right away. Visualize someone reciting what is written on the page out loud over the phone. It would take quite some time, and lots of online pages utilize visuals heavily. Add to that the fact that dialogue is error-prone. While the capability of speech conversion from talking to text is becoming more sophisticated, understanding that text and the uncertainty of language is a more challenging assignment. When it comes to having a full conversation with our devices, we are still just starting out. Today’s implementations are mostly simple question and answer systems.

Would progress in dialogue really help? Let’s imagine what we can do if we use the information on the World Wide Web in combination with a human’s conversational abilities and make that accessible via a voice-only home assistant. Even if we tried, we wouldn’t be able to create a tool that could assist us with many of our everyday needs. Our success is heavily dependent on our capacity to demonstrate to us what we need to see, not just discuss it. “A picture says more than a thousand words.”

Building the future

There is no uncertainty that the majority of programs in this area will be a combination of business APIs, open-source libraries, and AI motors joined with exclusive calculations and preparing information sets. Putting it all together won’t be a simple task, however almost all the parts are present for someone to construct the future that is only visible on the presentation stage at technology events.

Significant companies such as Amazon are making it easier to reach the potential by making available cloud-based application programming interfaces. Using the Amazon suite of services, you can use the Chime SDK for communication API applications, Rekognition for image and video examination, SageMaker for developing machine learning models, and Polly to provide natural-sounding human speech in many different languages.

Three ways Video and Conversational AI can overlap

There is a great possibility for Voice and Video technologies to intersect. Here’s four ideas to consider:

  1. Bots and Agents working smoothly together in Customer Service
  2. Multimodal in Video Chat
  3. Video Chat with a Bot
  4. Hyper Augmentation of Video

Let’s explore each in turn.

#1 – Bots and Agents working smoothly together in Customer Service

You are already accustomed to communicating with Bots for Customer Service. In many cases, when you make a telephone call to a customer service line or start a written conversation on a web page, the initial dialogue you share is with a robotic program. That Bot may not be very sophisticated. It’s probably a collection of frequently asked questions and their respective answers. To talk to a real person, you should say “Agent” or press “0” to bypass the automated Bot.

Due to the development of Conversational AI, it will become easier to change from Bot to Agent. You might not even realize it’s happening. In an online chat, the Bot may collect first-hand data and transfer that information to the Agent, who then enters the text chat to finish the customer service conversation.

#2 – Multimodal in Video Chat

Imagine a video chat conversation between two normal humans. Perhaps it’s a business meeting. What would “multi-modal” mean in this context? Let’s explore how we can use Voice technology to take collaboration to the next level, beyond the typical features like screen sharing, whiteboarding, and file sharing. Taking notes using a recording and transcribing system would lessen the necessity to type loudly during a gathering.

Let’s take it a bit further.

What do you think about incorporating a Speech interface in the Video Conversation? Envision the capacity to mention a colleague’s name and solicit the conference tool to bring them in on the gathering. Send a text to remind them that the gathering has begun.

Maybe during the discussion you could say a trigger word like “Alexa” and this would allow other material to be added to the meeting instantly. “Alexa, can you look on Google Drive for the document titled ‘Quarterly Projections’ and post it here?”

#3 – Video Chat with a Bot

Would you ever video chat with a Bot? This is an intriguing question. Conversational AI is in a position where such a thing could be achievable. There are a number of initiatives already. I previously referred to the Ericsson digital human initiative, the goal of which is to create lifelike video avatars. In a commercial or client service setting, the concept is that even if you recognize you’re conversing with an automated program, you are probably more likely to think of it as a beneficial experience if the bot presents itself in a way that is as human-like as possible. Their lifelike digital beings could be linked to your conversational artificial intelligence system to give the exchange a more human touch.

Demonstrations of Artificial Intelligence robots are already taking place for gaming and virtual environments. Kuki has been acclaimed as an artificially intelligent mind crafted with the specific purpose of keeping people entertained. Its API allows it to be easily integrated into chat and virtual world programs. The website also states that Kuki can be customized with a variety of identities to fit your company’s branding.

Although visually impressive, Kuki is most definitely an avatar. The user will not be led to believe that it is an actual person. It is correct from a moral perspective for us to be aware of who we are communicating with, as it is very important. Kuki’s current abilities won’t be enough to convince you it is a human in a Turing test, though it is fascinating, as can be seen in the interview between Kuki and Youtuber Kaden.

#4 – Hyper Augmentation of Video

In the demonstration of Abridge mentioned previously, it was evident how augmenting a Voice conversation using hyper technology can bring a great deal of benefit to a physician appointment. The AI is highly adept at taking detailed notes and is able to arrange them with great precision. This takes away the pressure from medical staff, a great perk considering how the vast majority of them are dealing with exhaustion.

This offers a great deal of benefit for both the patient and those close to them. Automatic notifications associated with the appointment and the capacity to go back to particular portions of the audio to listen to parts from the genuine dialogue are among them. This gives confidence that the AI successfully recorded the dialogue.

A lot of the same arguments could also be used from a Video Conference since the same instance of people conversing with their voices is present. You could transmit the audio recordings to a service similar to Abridge that provides the same features.

The future is Video and Voice – together

It was obvious that Voice-based user interfaces and Conversational Artificial Intelligence will make a great impact in what’s to come, be it in the virtual world, or the smart devices or vehicles we own, and even the mobile apps and websites.

Video will be a major component going forward. The huge increase in the use of live video chat due to the pandemic has changed people’s views on working remotely. Although we may be thrilled to be able to travel again and have face-to-face meetings, video conferencing and remote work is here to stay.

Bring together the current movements in social interaction with the expected lift of the metaverse and it is foreseeable that Conversational AI will be used in our day-to-day experiences through voice and video technology.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -

Most Popular

Recent Comments