Conversational AI — A dimensional shift in the way we hold conversations

“It’s important that AI not be the other, it must be us” — Elon Musk

The human history of conversation goes back hundreds of thousands of years. Evolution has rendered different flavors to the art of conversation over a period. The current happenings, however, is not just another turn in this evolutionary journey; it’s a leap. We are talking about an intelligent, natural conversation between a human and a machine.

Conversational AI technology conducts human-like interactions over different modalities — text, voice, visual, or even advanced three-dimensional entity. We see its application in areas like Intelligent voice assistants, customer service, gaming, smart displays, voice tech in health and education, and many more.


Why Conversational AI

Why is the world rushing to master conversational AI? Data.

Data has made our computational systems more powerful and us humans more inquisitive and hungry for more data.

A human is limited by memory, experience, exposure, and computational power. Reaching a piece of specific information in the ocean of knowledge base available is no mean task. Take customer service use case for example.

Chatbots and IVRs have been doing a decent job when it comes to answering queries in a standard rule-based pattern. But how many times it happens that we keep on pressing ‘9’ on the number key to skip IVR and reach a customer care personnel in flesh and blood?

In a world where IOT is becoming more and more mainstream, it cannot be as effective without the power of Conversational AI. An AI that understands ‘unstructured’ intent camouflaged in a voice command lifts that extra burden off our shoulders — whether we want to make calls or draft a message while driving a car, adjust the mood lights of our apartment, or turn on the AC before we reach home.

Where have we come from and the way we are headed

The dream of Conversational AI has been around since the 1950s when a program was designed which could identify 16 words in human speech and with more research progressed towards identifying around 1000 words. Follow the links to know the history.

Until recently we went as far as breaking down human speech/text into segments of sentences, building a decision tree out of them with its nodes representing nouns, verbs, adjectives, filler words. With this tree structure, a rule-based model is developed that defines each node’s purpose and how different nodes relate to each other. Based on these correlations inferences are made.

The limitation here is obvious. It is rule-based. It needs an extensive enumeration of possible structures or templates of speech uttered by humans. If anything falls out of this enumerated list, the program is bound to fail. This is the real reason why the engagement with the likes of Alexas, and Google Assistants fall flat after the initial excitement of watching a machine talk back to you.

Where we are headed is indicated by the winners of the Alan Turing awards in the past decade. It’s interesting to note how these awards indicate the importance and the speed at which innovation in Conversational AI is evolving.

The year 2014 award was for innovations in modern database systems.

The year 2017 award celebrated ground-breaking results in optimizing and speeding our systems’ computing power, which we can now carry in our pockets.

The year 2018 award was for breakthroughs made in the field of deep neural networks.

Put these pieces together and the bigger picture is revealed. The world tech community is striving to bring together the 3-pillars that support Conversational AI to understand human language better. The 3 pillars are — immense data set, superior computing power to structure and mobilize this data, and the ability to extract useful inferences from this data and further store the intelligence recursively in a feedback loop.

Only then Conversational AI will be able to conduct a sustained engagement with humans.

What goes on behind the scene

Natural language Processing Vs Natural language Understanding Vs Machine Learning

It happens in a layered manner where the first layer (Natural Language Processing) processes voice/text syntactically into nouns, verbs, segments of speech, grammar, etc (Automatic Speech Recognition).

The second layer (Natural Language Understanding) transforms the output of the first layer semantically and returns broader and more abstract information, for example, the intent, accent, sentiment analysis, summarization. The dialogue management system then forms a response based on the above information and renders it into a human-understandable form (Natural Language Generation).

This interaction provides a baseline model which is then repeatedly trained over a vast ocean of data available. This is how a machine learns (Machine Learning) in a self-sustained way.

In a way, personal computers are also conversational AI with a low degree of conversation restricted to digital command and digital response to solve a particular problem. As these systems diverged towards sophistication, the objective shifted from solving a problem sitting behind a computer to Ambient Computing. Ambient computing is a setup of interacting devices acting as an extension of each other, are mostly self-operating, so that a user engages with these systems in a seamless manner. A human does not have to sit behind such systems to extract information. They work and engage with each other independently as well as collaboratively.

Key Aspects for a memorable user experience

Dialogue Management should be handled in such a way that the interacting human gets most of the control to steer the conversation. The conversation agent should be able to switch context and come back to the conversation. The AI should be able to conduct a proactive conversation and not just be responsive.

This cannot be achieved by feeding the conversation agent with a predefined set of questions and answers. For instance, Google Lamda has been trained on millions and millions of actual dialogue with the intent of learning the patterns and the flow of dialogue in humans.

Conversation Design is being given special focus as it is a key factor in “delighting” the customer, or at least not leading them to frustration. The role of Conversation designer has never been more crucial. The best practices a designer follows while designing conversations are -

Building the brand persona — If there is no well-defined persona for the conversation AI which gels well with the brand category, the ambiance, and time pressures it operates in, the user interacting with the system will imagine a persona based on their biases which may not portray the brand as intended.

Writing down sample dialogues between a user and the persona and reading them aloud with someone. This simple act helps unravel unthought-of use cases.

Rapid re-prompt and graceful repair — A Conversation designer scripts conversation keeping in mind rapid re-prompt and graceful repair of a conversation if the conversation breaks due to unexpected user response. This will ensure to bring the conversation back on track and reestablish user trust and engagement. For example, a conversational agent asks for your choice of coffee

Prompt: Let’s begin. Would you like milk coffee or black coffee?
Response: I am a vegan
Rapid re-prompt: Sorry, was that Milk coffee or black coffee?
Response: What do you think a vegan would have?
Graceful Repair: I am sorry. At present, I can only work with the options of Milk coffee and black coffee. Would you like to try again?

Well, ideally it would have been a really sophisticated conversational agent if it could have responded with a black coffee when a human says they are vegan.


Personalization — With highly sophisticated machine learning algorithms personalization is taken to the next level where the intelligent conversational agent does not just know your name, rather, they have learned your interests, ethnicity, your typical behavior in a given context. They have learned the customer’s journey as a student, a professional, a parent, or even a spiritual seeker. Empowered with this “wisdom” the conversational agent should engage with a human.

Toolkit to design your own Conversational AI

If you are a team of developers and have a grip over multiple programming languages and want to upskill in the world of machine learning then there are a plethora of open source libraries ready to give you the first taste of building your own conversational AI. Some of them are listed below.

Natural Language Toolkit

Gensim is written in Python and is used for Topic modeling

spaCy written in Cython

CoreNLP written in Java with extensions for other languages.

TextBlob is built on top of the Natural language Toolkit.

Hugging Face NLP Library offers the latest, state-of-the-art models like Transformers available for developers to use. These language models help to build high-performing NLP tasks like Named -entity recognition, sentiment analysis, summarizers, etc on top of it.

If your team is not skilled or aware of machine learning, and you are in a phase where you would like to test the waters to see whether AI/ML will bring great value to your projects then some prebuilt models can be used to generate a quick prototype. Google’s Natural Language API provides one such prebuilt model. It can be used along with Text-to-speech and speech-to-text models, also prebuilt.

Future of Conversational AI

Conversational AI has come a long way since the 1950s-60s. It has achieved many feats, and yet scientists, developers, and researchers are constantly endeavoring to better it. There is a lot of scope for improvement and augmentation. Understanding human language is an immensely complex task after all.

NLU does not have the kind of maturity yet to understand demographic quirks in the way humans interact with AI. For example:

An Indian may tell the AI “Please clear off all my pending payments using my credit card”

An American might say “Pay those bad boys off. Use the card that kicks”

To master understanding context at a deeper level we can foresee a lot of efforts to be pumped into capturing billions, even trillions of parameters.

The covid era has given rise to many use cases for the implementation of conversational ai. The possibility of augmented reality/virtual reality holds great promise in the scenarios like conducting online meetings, offering companionship or counseling, or even education for children.

A lot of technology advancement is required to master sentiment analysis so that a conversational AI is more conversational versus transactional. Empathizing with a human is necessary for the AI’s avatar as a digital companion. Think the storyline for movies like A.I. and Her more recently.

A lot of work is still needed on Speech recognition models to be able to filter out the noise and recognize user’s speech accurately even from a farther field.

Personalization vs privacy — While users are delighted by the super personalized experience, concerns regarding privacy are rife. Technology has tried to answer these concerns with methods like Differential Privacy and Federated learning.

With so many digital assistants catering to different use-cases sprouting up, how cool it would be to consolidate them together and establish a communication channel between them. Imagine different NLUs talking to each other and coming up with a response with a high degree of personalization and contextual understanding without the user having to interact separately with all the different assistants.



Mukta Bulsara