Humor in multilingual digital assistants

Originally published on GALA‘s website

The sense of humor is a unique human trait – although not uniquely human it seems – and quite a complex one. As Paul McDonald states in The philosophy of humour

“The fact that even a simple joke uses simultaneously language skills, theory-of-mind, symbolism, abstract thinking, and social perception, makes humor arguably the most complex cognitive attribute humankind may have” 

It shouldn’t come as a surprise that humans expect humor to be encountered, and understood, even in interactions with virtual beings. 

Virtual being is a broader, and rather new term, that refers to a character that doesn’t exist in reality, but that can interact with humans through digital means. An example would be Mica, LEAP’s virtual being in augmented reality, that can communicate with a viewer through the Magic Leap, the company’s augmented reality glasses. 

This kind of interaction is still not mainstream, the technology being not as accessible as an Amazon Echo device, for instance. Therefore, in this article, we will discuss humor in digital assistants and conversational agents

Examples of a digital assistant, or intelligent personal assistant, are Siri, Cortana, Alexa and Google Assistant. 

A conversational agent is any dialogue system that uses NLU (natural language understanding) and NLP (natural language processing) to maintain human-like conversations. 

In order to understand how humor works in these contexts, first we have to understand what humor is. 

What is humor? 

According to the dictionary, humor is  the quality in something that makes it funny; the ability to laugh at things that are funny.

Similarly, the sense of humor is the ability to see the funny side of life.

Does this explain what humor is? The truth is, humans have tried to explain and theorize humor since antiquity. 

Some of the most common theories on humor are: superiority theory, relief theory, and incongruity theory. 

The superiority theory dates back to Plato and Aristotle and postulates that people find humor in, and laugh at, earlier versions of themselves and the misfortunes of others, because this makes them feel superior. That would probably explain why we laugh at someone stumbling, or falling…

The relief theory, by Herbert Spencer and Sigmund Freud, posits that laughter is a homeostatic mechanism that allows people to relieve “nervous energy.” This  explains why jokes on taboo topics can make us laugh: the energy invested in suppressing “inappropriate” emotions,  is released as laughter.

The incongruity  theory, arisen in the 18th century, states that people laugh when they find an incongruity between expectations and reality, something that “violates our mental patterns and expectations.” 

Humor can be conveyed by words, images or actions, but we will of course focus exclusively on verbally expressed humor

Humor in translation

Translating verbally expressed humor is an extremely hard task, the main constraints being: conveying the same concept in different languages and for different cultures

Delia Chiaro has extensively treated the subject in Primer of humor research  

Prof. Chiaro explains how when dealing with the translation of a pun, which are notoriously untranslatable, the translator must come to some sort of compromise. 

“As long as the TT serves the same function as the ST, it is of little importance if the TT has to depart somewhat in formal terms from the original. Some feature of the ST is lost in exchange for a gain in the TL”.

How much more complicated does it become when humor involves human-computer interaction? 

Humor in digital assistants 

Humor in Human-Computer Interaction is the object of study of the rather new computational humor, a branch of computational linguistics.

Reasons to use humor in virtual agents might be:  to engage and entertain users, and to mitigate performance limitations. If the virtual agent doesn’t understand, a bit of humor might help the user feel less frustrated. 

Many experiments have been conducted around this topic. Brent Rose, writer and host of the successful WIRED video series Out of Office with Brent Rose, performed Stand-Up Comedy using only jokes from Siri, Alexa, Cortana and Google Home. The results showed that Cortana and Siri were the funniest of the four assistants.  

The video is definitely worth watching, I cannot imagine a better way to demonstrate how bad the majority of the jokes told by digital assistants are. 

I think that somehow it also proved that humor is not only about a written text, otherwise the jokes would be funny, as they are written by humans, not by AI, but most of them aren’t.

Another study worth mentioning is the survey Humor in Human-Computer Interaction: A Short Survey. According to the survey “users rated significantly better the system that gave humorous comments  in task-oriented interactions and overall an improved perception of systems qualities”. 

Humor gives the digital assistant a human touch and overall creates a more likable experience for the users.  Digital assistants with a sense of humor stand more chances to be liked, as they are perceived as more human. 

The survey also found that social conversations increased up to 50% when a virtual agent used jokes in interaction with human users.

Humor is important in human-human interaction as well as in human-computer interaction.  Studies on humor in Human-Computer Interaction actually show that similar beneficial effects can be encountered as encountered in human-human interaction. 

The constraints 

Certainly one of the biggest problems is the unavailability of intelligent content frameworks in all languages and the lack of a multilingual dataset for humor.

There are some interesting projects though, such as the UR-FUNNY: A Multimodal Language Dataset for Understanding Humor, which aims at understanding humor in a multimodal manner, “through the usage of words (text), gestures

(vision) and prosodic cues (acoustic).”

As far as voice is concerned, a big constraint is that automatic speech recognition (ASR) often struggles with recognizing accents, dialects, slang, unclear speech, etc.. This makes it very hard to entertain pleasant conversations with a digital assistant and often leads to frustration.

The Tools 

What are the tools of the trade? What we should be looking for is more open datasets for humor detection available in more languages.

Two of the most promising projects I’ve found are: 

Leyzer: A Dataset for Multilingual Virtual Assistants  “designed to study multilingual and cross-lingual NLU models and localization strategies in VAs”. 

The paper was published in 2020, and although the results as far as intent accuracy is concerned seem unsatisfactory, may set the baseline for further work, and hopefully in the future the dataset will be extended to more languages.

The Leyzer dataset, the translation memories and the detailed experiment results presented are available here

Another interesting project is the Large Dataset and Language Model Fun-Tuning for Humor Recognition by Vladislav Blinov, Valeriia Bolotova-Baranova, and Pavel Braslavski

The authors have created a publicly available dataset for humor recognition in Russian that consists of more than 300,000 short texts in total (only half

of them being funny)

The authors found that most of the available humor-related datasets are in English only, are relatively small, and focus primarily on puns, neglecting other forms of humor.

They implemented a humor detection method based on the universal language model fine-tuning. This method is purely data-driven and has proven to generalize well, yielding overall positive results. 

The authors plan to reproduce the experiment with English data and hopefully there will be more similar projects in the future. 

The dataset is available here, whereas the paper can be consulted here  

Conclusions

We have explained what humor is in theory and shown some examples of applications in human-computer interaction, specifically in digital assistants. 

However, humor is hardly something that can be explained in theory, that is why it is so difficult to reproduce it, and two different persons might tell the same joke, yet one might be funnier than the other one. 

In my experience, the funniest conversational agents, or digital beings, are those that have a strong personality. 

These characters don’t just tell already-made jokes, but are scripted to respond with natural, human humor. 

One example is AnnA the bot. She’s bright and witty and will make you laugh. 

While we completely rely on the availability of large multilingual datasets in order to create humor in multilingual digital assistants, we can encourage the creation of small, local, monolingual digital assistants with humor, that might in the future contribute to the creation of larger datasets and frameworks. 

Bibliography

What is humor: 

What’s So Funny? The Science of Why We Laugh

Philosophy of Humor

Humor in human-computer interaction

Humor in Human-Computer Interaction: A Short Survey – Anton Nijholt, Andreea I. Niculescu, Alessandro Valitutti, Rafael E. Banchs 

“Let’s Be Serious and Have a Laugh: Can Humor Support Cooperation with a Virtual Agent?”  Philipp Kulms, Stefan Kopp, Nicole C. Krämer 

Voice First Sucks

Yup, those were the exact words that I recently heard uttered in a mild miff of refreshing biting pique by someone who seemed to have had it with Voice First.

I know it was biting pique because the person who spoke those words was, is, has been, and I know will continue to be, someone who is heavily vested in Voice First, loves Voice First, has had, and will continue to have — in spite of the long, cold winter of discontent that Voice First is enduring — high hopes for Voice First. And I say that it was refreshing because hearing someone deviate from the oppressively relentless cheerleading and instead engage earnestly to get to the truth is a rare occasion worth noting and celebrating.

But reality is reality, and the reality that this Voice First believer was facing was a harsh one: Voice First was not delivering on the promise everyone thought it held when it declared itself a revolutoin a few years ago and they had had enough. It was time to roll up our sleeves, get real, sober up and find a path to the bottom of things.

‘Beyond the weather, time, and the occasional timer and alarm,’ they mused out loud, ‘am I myself really using my Amazon Alexa and Google Assistant that much in my life? I mean, really and honestly, am I? No, actually, not really…. So, if am not using them that much and yet I am such a believer in Voice First, what hope is there for the rest of the world?’

I, a veteran of sorts in the space — we veterans call the space “Speech” — felt for this young person. Yes, whether you want to call it “Voice First” or “Speech” or something else, the space is and has always been a tricky bitch. Nothing comes easy with Voice First. Nothing will fall on your lap. No one will embrace anything you build just because. Unlike most other fields, you have to work really, really — really — hard to get to value, let alone monetizable value. Sure, you need to work really hard to deliver value in anything that is worthwhile — that’s a basic reality. But here is another basic reality: Voice First is a beast of an altogether different kind. You have to work really, really — really — hard to get to value

This reality about the space and its technologies was as much of a basic fact in 1991, 2001, 2011, as it is in 2021 — Apple Siri, Amazon Alexa, Google Assistant, and Samsung Bixby and their respective tantalizing human language technologies notwithstanding.

Fact: unlike other interactive modalities (visual, tactile, textual, olfactif, haptic, multimodal), Conversational Voice First is spartan and unforgiving. It’s sound made, sound received, sound consumed, and sound made again in a time laced, unrelenting back and forth.

Fact: people usually don’t use Conversational Voice First to enjoy the experience or to admire color schemes or to get a warm-and-fuzzy. There is nothing to look at, nothing to behold, nothing to gawk at, nothing persisting that you can engage and feel. Instead, Voice First is an anxiety inducing, running race — a race against time and patience. It’s ephemeral, serial; it needs you to be cognitively engaged, focused, listening; it is not interested in letting you multitask, pause, take a break, wander off in your thoughts; it messes around with your breathing, taxes your memory, forces you to use your vocal chords, maybe even shatter a beautiful silence. Voice First is greedy, possessive, narcissistic: it certainly won’t let you say something to someone else when you are engaged with it. Voice First wants to control your breath, it demands that you enunciate, articulate, that you come alive and speak loud enough; and it insists that you listen attentively, that you be patient, that you repeat yourself when you are asked to repeat yourself.

Voice First needs you fully vested. Otherwise: Voice First will gladly and pitilessly swallow your time, and it will happily do it over and over again.

And so, if Voice First must be used, the human user shoots back, Voice First had better deliver something concrete, tangible; something worth the tedious while. ‘You want me to talk to the thing? Ok. Sure. No problem. I will talk to it. But only if you give me something back in return that makes my life easier. Otherwise, thank you very much, because I do have a life to live.’

Unlike other modalities, Voice First has no room for getting you to buy because the thing looks cool, or smells sweet, or is heavenly to the touch, or makes you look High Church by association.

Unlike other modalities, Voice First has no room for bamboozlement. Voice First demands the simple truth, the bottom line; it needs you to cut to the chase and it gives you no room to hide and no chance to throw sand at people’s eyes. (Quite the interface for the general Zeigest of the times.)

So: does Voice First suck?

No. Voice First does not suck — and has never sucked.

Voice First is powerful because it is ephemeral, temporal, invisible; because it demands your attention, requires that you speak up, that you be constantly present and not wander off, that you engage in a focused way. You want an interface that lets you space out once in a while. Sure, no problem, and no one is judging. You have your pick of interfaces. Just stay away from Voice First. And if you do use Voice First and then you wander off during the interaction, don’t blame the interface — and don’t blame yourself either. Blame him or her who decided that it was OK to have you engage a Voice First interface in a use case where you were likely to wander off.

Here’s a concrete example of working with rather than against Voice First. Voice First is ephemeral, temporal, invisible, demands the user’s attention, requires them to speak up and to focus, insists that they be present and not wander off, that they remain engaged in a focused way? Ok. What use case demands all of that?

Can any interface other than Conversational Voice First deliver the above experience? The answer is no. No other interface is as temporal, invisible, demands your complete attention, requires you to speak up, to focus, and insists that you not wander off and that you remain engaged in a focused way the way that Conversational Voice First does. Sure, you can try to emulate the above via visual/textual/tactile, but you will end up with a poor man’s version of the elegant, back and forth, urgent, let’s keep-moving-and-learning, conversational interface above.

In the spirit of getting real, I call on the Voice First community to do two things:

First and foremost, I call on them to take the time to really understand the Voice First interface. I truly and sincerely fear that many people (though certainly not all) in the Voice First space still do not fully understand the interface. Oftentimes, in my conversations, it is clear to me that many really do believe that Voice First is a stripped down version of other, “richer” interfaces; that Voice First is lacking, poor, constrained and constraining, meager — a poor man’s version of something much more. I am referring for instance to those who seem to believe that multi-modality is “the next step” (someone actually used that exact phrase!) for Voice First, an upgrade, rather than a completely separate interface with its own pluses and minuses, and to which Voice First can be its superior, hands down, given the right use case. And who do I blame for this misguided notion? Well, to be frank, I will squarely lay the blame at the feet of the folks at Amazon and Google who clearly have come to look down on the voice-centric experience and to declare it as inherently lacking. As in: ‘Yeah, so, we now have visual devices, you see, so we will need you to like show something visual when the skill/action launches, and we need what you show to look good. This you will have to do or we can’t certify your skill/action.’ (All true stories here.)

And second, let’s begin compiling examples of actual use cases where the value of the Conversational Voice First interface is clear and compelling. Let’s do this for at least three reasons: (1) So that we get the point across that the power and the value of an interface are not inherent in the interface itself but are a function of the fit between the interface and the use case; (2) So that we can learn that Step One to success is pinpointing use cases where the “weaknesses” of Voice First are in fact the exact features that we need for the given use cases, and (3) So that we keep hope alive and lift the community’s morale by pointing to actual built experiences and not mere yearnings or seemingly unfulfilled and unfulfillable aspirations.

So here’s the document. Feel free to email me your use cases and I will add them.


Ahmed Bouzid, previously Head of Product at Amazon Alexa, is Founder and CEO of Witlingo, Inc., a McLean, VA-based B2B Saas company that helps brands launch voice first solutions and experiences on platforms such as Amazon Alexa, Google Assistant, Samsung Bixby, and beyond.

In addition to inviting you post your comments, Ahmed also invites you to post something about yourself in The Voice First channel here: www.witlingo.com/voicefirst

How to run local #VoiceLunch?

Why talk about voice locally?

While the global VoiceLunch is an excellent opportunity to discuss generic voice topics and meet fellow voice enthusiasts from all over the world, sometimes you really want to get down to how things are for your part of the world, in your specific language, or for your particular market.

That’s why we set up local VoiceLunch: separate slots throughout the week, where local voice communities meet.

Right now we are active in The Netherlands, UK, US, Japan, India, Brazil, France and Italy.

How to apply to host a local meeting?

Interested in hosting a local VoiceLunch? Please send a mail to hello@voicelunch.com. 

Identity

We encourage local VoiceLunch hosts to truly help to shape the community’s own voice. That’s why we don’t want to enforce strict rules. 

However, we do wish to point out that the character of a local VoiceLunch should be truly local. Some questions that you can use as a guideline:

  • What language should we speak? If it’s not your local language, are you sure that everybody there is comfortable enough to speak English?
  • What topics should we address? What are truly local ‘hot topics’? Do we overlap with global VoiceLunches?

Social media accounts

We would love it if you announce your localVoiceLunch on your social media! Please use the brand guidelines and templates provided by HQ.

Tips and tricks for having a great online meeting

Online meetups are slightly different beasts than real-life events. Share these tips with your community to have your online meetup off for a great start!

Practical points

  • Mute your mike when you’re not speaking.
  • When you want to make a contribution, raise your hand or wait for an opportunity to speak. 

Online meetups: extra awareness

When online, we miss a lot of non-verbal cues that we usually have available, so natural turn-taking can be a bit awkward. Be extra aware of whether another person has actually finished speaking before you start. If not sure, ask for confirmation. 

Active listening and asking questions

Active listening and asking questions are great ways to deepen a conversation:

  • Ask yourself: ‘Do I truly understand what the other person is saying? If not, what kind of questions could I ask her to make her elaborate?’
  • Ask yourself: ‘What can I bring to this discussion? Which perspectives can I open that contribute in making this discussion even more valuable?’
  • Ask yourself: ‘Do I really need to take this turn, raise this topic, or make this point now? Is this the right time to do so? Or can we elaborate on the topic under discussion?
  • Ask yourself: ‘Can I ask a question, rather than make a statement?’