How to Design Awesome Voice Interfaces

Since Siri’s first appearance in 2012, its moniker has become synonymous with voice assistance. Its introduction has led to voice assistants becoming a standard feature on every smartphone manufactured after 2018. Furthermore, voice assistant technology has separated into a hardware category of its own, with popular tech pieces like the Google Home, HomePod, Amazon Echo, and Alexa included in the mix.

Those voice assistants are most immediately known for their superb functionality and intelligent UX/UI design. The popularity of voice assistance is perpetually booming, with forecasts showing that controlling a web connection to the internet vocally will continue to be popular moving into the future.

The digital design sphere has always focused on visual designs, engaging users through visual perception through suburb UX and graphic user interfaces. But while the visual design aspects like the Gestalt laws are well known in the design sphere, there is no graphical interface for voice-activated objects. So what principles should UI UX design firm utilize to craft a better voice-interaction design? Let’s talk about this new and fascinating topic in more depth.

Talking and Communicating

Graphic interfaces deliver on vision and semiotics, but vocal interfaces have a wide-ranging scope of aspects to address, including linguistics, semantics, and pragmatics. Therefore, building a VUI interface without innately understanding how humans converse with one another and the particular intricacies of their speech is like developing a graphic interface without any concept of visual perception and the comprehension of Gestalt laws.

The Maxims of Grice

Paul Herbert Grice, one of the leading theorists of meaning and communication, established four rules by which we can understand conversations between individuals. 

  • Maximum Quality: Only communicate truthfully.
  • Maximize Quantity: State requirements without saying too little or saying too much
  • Reporting Summary: State things relevant to the topic being discussed. 
  • The Maxim Of The Way: Assure that speech is unambiguous.

All of the above maxims rely on the cooperation principle. Your conversational contribution should be connected to the content of the request, it’s timing, the accepted collected intent, or aligned with the verbal exchange that is transpiring. 

Conversational Structure

It’s easiest to think of these maxims the same as one would think of Gestalt principles. Essentially, they form the cognitive foundation on which we base our ability to communicate and interpret the other party’s meaning and intent. 

The maxims are not monolithic either, as they can be deliberately disrespected or unaided by in some instances while observed in others. The speaker engages in communicative effects that are alternatives to the conventional one by doing so. By comparison, the visual equivalent of this behavior is when our brain confuses what we see for optical illusions according to Gestalt rules. In an auditory sense, this speaks to things said ironically or metaphorically. Telling someone they are “a lion” clearly isn’t meant to be a literal statement. However, the phrase still carries an understood meaning because there are attributes people immediately associate with lions.

But maxims are not alone in ruling our communication. Speech acquirest content and meaning based on other elements that include: 

  • The cultural, physical, social, or psychological context in which the conversation occurs.
  • Various background noises can serve the communicated message to be polluted. For instance, biological noise like a crowd’s murmur can make communication challenging. The psychological state or mood of an individual can influence how the same person interprets the same message in varying mental conditions. These background noises can be cultural, such as when an expression means something different to various individuals based on their cultural background.
  • Communication can also be non-verbal, as some gestures correlate to aspects of speech. Para-verbal touch and motion can be based on a vocal tone of infliction.

In voice interfacing design, transparent and respectful communication is paramount, so we must abide by the Grice principles as closely as possible.


Humans are visually-oriented designers, but still, they tend to compare visual designs to vocal ones. But there are some notable differences to consider in these correlations.

Straight Flow

Graphic user interface is based on redundancy. In other words, the arrangement of screens can be set up like a tree with branching structures of varying degrees of complexity, stemming out allowing the user to navigate between screens by following visual indicators. The users can also observe processes and their current state independently of other factors while moving around a set hierarchical order through multiple, simultaneously presented contents. 

But with vocal interfaces, there is a near lack of visual aids, so the navigation and processing rely on conversation. Therefore, VUI user flows must be linear. The users progress through the vocal process onto the following states only through the previous trigger that led them there. As the system responds to verbal commands, it can only access the whole system directly with its applications.

Lack of Screens

It helps to think of both GUI and VUI interfaces as a driver steering a vehicle. In other words, the user is interacting with a complex mechanical or technological system through simple acts. But while the presentation from a GUI is visual, allowing a user to interact with graphics such as text, forms, tabs, styles, buttons, and images on a screen through touch or via mouse, VUIs have only one manipulable component: voice. This application listens to the speech, combines the vocal input with a particular output type, and communicates it vocally back to the user.

Intent and Expression

Voice interface components are invisible and impalpable. A user must use short-term memory elaborated at the moment to use sentences much like they would use labels, buttons, forms, and icons in a visual setting. While making a request, the user must vary their inputs. Rather than pushing a button to represent the action of “play” the user must request the action through a vocal command. The commands will, however, vary based on the user. A user can ask Siri to “sing me…” a song, to “start this song,” or even to “Play the song…”. All are different commands, but they have the same intention, and VUI must be prepared to act on them accurately. 

Nielsen Heuristics

All ten of the Neilsen defined heuristics of sound system usability remain valid for GUI and VUI. Still, it is especially pertinent to consider these when there is a lack of visual support, such as auditory interaction with technology.

Wrapping Up

As the technologies, computing power, machine learning, and artificial intelligence become more capable of synthesizing speech and understanding conversation, VUIs will continue their ascent in popularity. The need for vocal interfaces will soon spread among various industries as people lean more into needing human-friendly voices to help them interact with new technology. 

Leave a Reply

Your email address will not be published. Required fields are marked *