Reading Time: 12 minutes
In this episode of Digital Conversations with Billy Bateman, we have Wolf Paulus principle software engineer at Intuit join us. He goes over the challenges of using voice for banking authentication, managing the users expectations and creating a frictionless process that delights your user.
Guest: Wolf Paulus– At Intuit, Wolf is an internationally experienced technologist and innovator, accelerating the discovery and adoption of emerging technologies. Changing the world one life at a time. He’s also an Adjunct Professor at Grossmont College, where he teaches an intermediate class in the Computer Science department. He was appointed to the advisory committee at the University of California, Irvine, and frequently speaks at conferences and user groups on topics ranging from Physical Computing to Emotional Prosody.
Billy: All right everyone, welcome to the show today. Today I have the pleasure of being joined by Wolf Paulus principal engineer at intuit, Wolf how are you doing today?
Wolf: Doing great thanks for having me.
Billy: Yeah, thank you for joining us. Really excited to talk about your perspective on conversational interfaces and some of the work you’ve been doing at Intuit. But before we get into that just tell us a little bit about yourself and how you got into this industry.
Wolf: So yeah, I was born and raised in Germany. Got all my education over there and then 23, 24 years ago I came to the US. Worked for a couple companies and the last nine years of my journey here is with Intuit. So, Intuit you probably know us as the company that provides you with this tax filing solutions. That will allow you to file your taxes from home during these times and a nice thing to have. We also have QuickBooks helping our small businesses with accounting. So, what got me into voice and conversation user interfaces is early on when I joined Intuit was Siri. Siri just maybe six months or so after I joined into it and we imagine at that point a solution that would open up.
We thought that Apple would open up an ecosystem and let other companies like us play in this space. And we started just imagining solutions where we imagine people could just talk to our software offerings. We got together with people from our business units and I work for. I’m very fortunate to work for a group that we call Intuit technology futures meaning I don’t have to work specifically on one of the products. I’m free to look into tech, into solutions that may help or even hurt our products in years to come. it’s like I’m open to experiment a little bit more than you typically do as an engineer.
Wolf: So I started looking into that and Apple never opened up, so the first solutions that we built were on android. Actually android had still a very crummy but working voice speech recognizer and the speech synthesis worked okay. It was still robotic but we could build prototypes around that. That got me started in the space of conversation and voice user interfaces specifically. We did experiments, we failed a lot, we learned a lot at the beginning.
So one of out earliest learnings was that voice user interfaces are not great at data collection. One of the things that we really wanted to do. We wanted to let users use the phone and use it for data collection. For our small businesses to enter their receipt data. Which is a cumbersome process so we wanted to help them. But we found out really early on that the users hated solutions like that.
Billy: What do you think that was like? I used QuickBooks this morning, the mobile app to put in receipts for you know for our business here .
Billy: Why do you think they didn’t like the voice entry on it?
The Ambiguity of Voice
Wolf: So first of all thanks for being a customer. I think there’s too much ambiguity, it’s not the speech recognition not working. It got so good in the last couple years the same with speech synthesis. So natural and I would love to talk about speech synthesis. This is one of my hobbies on the side in a way to make speech that really work well. They’re natural and sometimes even funny. So the technologies that that we build our solutions on really got so good. But if you think about it, if you were to enter like an address using your voice and or like email address for instance. There’s so much ambiguity in that you end up spelling it out for the speech recognizer to work well with that stuff.
Your friends names, their street addresses all this is cumbersome and error-prone to enter with your voice. So it’s not really that it doesn’t work, it just takes longer than typing it in and one of the biggest benefits that voice user interfaces give us is frictionless access to information. What I just mentioned data collection is not a frictionless process when you use your voice so that’s I think why it doesn’t work.
Billy: Yeah, that would that would make sense, I think we were talking to Pulse Labs the other day and they do testing for voice skills and I think voice is really good for accessing information but you make a good point, recording it there is some ambiguity there. We like to see things written down as humans we’re like okay I see it there.
Wolf: Yes and you would always have to kind of ask for confirmation then right is that what you entered? Is that the amount? We can make this work obviously but is that a good use case for voice that was our learning early on we can do it if we have to but it’s not really what users are looking for it’s not a delighter.
And that was kind of what got me going in this space. Again we failed a lot at the beginning but we got a lot of learnings out of that. If I maybe give you another example if it’s not just only the frictionless way we can access information with an Alexa skill for instance. Is also that we should let the skill or whatever you build. Maybe you build an app for yourself and not build on these mega platforms. But we should let these things do all the talking. So, it’s not just only 50-50 I think it should be the skill should do more of the talking compared to what you do.
So we see it everywhere. Examples that people find helpful find delightful on these skills. Look at just the thing that everyone talks about. What they use is what’s the weather today? Not only does it know where you are. It gives you the current information and gives you a little bit of a preview. If you would measure the time how long does it take me to ask it for information. And how long does it take for the response it’s overwhelming. An overwhelming amount of information enough time that these good working skills do the talking. There’s also like learning let the skill do the work.
Voice VS Security
Billy: Yeah, I got you. Where have you guys found success using voice at Intuit then?
Wolf: Not all that much actually right. So we are still what some people call nibbling on the edges. One of the reasons is what I just said that the benefit is frictionless information. What comes against us is privacy and security right. We are obligated to ask for a voice pin for instance like amazon will not let a skill on the public storefront. If we don’t protect financial information with the voice pin. So voice is now in a competing situation with your phone basically right. Does it take more time, does it take more action to get to information on the phone versus through voice?
So, when accessing financial information through voice we have this balanced convenience versus security issue. That’s true for the phone as well I think. We have fingerprint or face id to unlock the phone and you swipe around a little bit find the app that you want to open. Now you open your app and then again you have to authenticate inside the app using probably the same mechanisms. Then you swipe around in the app for a little bit until you find the play that you were looking for.
All that is not so different now when I say Alexa open my banking skill and then it asks what’s your voice pin. Then you give it the pin and then you have to utter the intent that gives you the place where you find the information. So it’s very similar. Wo we lose a little bit of the frictionless access to information because we have to protect it with the voice pin. Most of these devices now have what it’s called speaker identification. Meaning they can differentiate if it’s you or your wife or your kids.
But that’s not authentication but it would be nice if we had voice authentication. Whereas where a user would be really authenticated. We would know this is the person who owns an account and then we could forego all the initial authentication requests with the voice pin.
The Future of Voice Authentication
Billy: That would be… how far off is I’m sure eventually we’ll get there, like what are your thoughts on how realistic that voice authentication could be?
Wolf: There are a couple companies that have solutions, nuance comes to mind. You have all heard my voice is my password kind of thing, there’s more companies like that. They are mostly are currently applied in the financial space. So we know of a couple of brokers who have systems trained for all their employees. There’s a little bit of training involved, same way how you train your phone to recognize your fingerprint. Same way your phone has to learn what your face looks like right. The training process is a little bit involved at the beginning but the benefit comes with usage.
So we could imagine the same and maybe the whole covid crisis right now helps us with that a little bit because like once you have a mask on and in like where we currently go. If I want to go and buy groceries I have to have a face mask on and it’s not currently encouraged to touch surfaces. So touch id and face id may not work all that well currently. Maybe that pushes voice biometrics into the direction where it becomes a part of these mega platforms.
Billy: So, let me ask you this, I mean we build a lot of a lot of chat bots and you know it is largely just a decision tree and you can do them well and you can do them not so well. We see both ways and sometimes we build something we realize that’s not good we’re going to go back and redo it but you know when you do see that done well what does that look like to you, it’s like oh I actually like this chat bot.
Wolf: I like it when there’s a lot of free form. Then I have still the illusion that I’m in control of the conversation. This asks me anything which is of course not true. You don’t ask a chatbot for anything. You don’t have time to waste so it’s domain specific. So when we say ask me anything we mean ask me anything in the domain. In kind of the restrictions of the expectations that I have. I think that’s super critical to set expectations right and if we would go back to Siri, how did they do it so well?
We all knew how Siri’s voice sounded before we ever interacted with her for the first time. And we already knew what we could ask her because we saw commercials. We saw it introduced at the WWDC conference and so on. So that’s what apple does so well. They said expectations and it was not a surprise to hear her voice. That’s what we have to do with the chat bots as well. We have to set expectations. You have to say yeah this is the area that you can ask me things. And I can respond to these questions. So those are good chatbots I think that don’t surprise you too much. Or surprise you maybe by giving you a little bit more than you had expected.
What you don’t see a lot I think with chatbots is that people will end the conversation with thank you which they do with well-built Alexa skills. That’s I think one of the measurements, do you get a thank you at the end.
Billy: Yeah, no that that is a great point. One of the things I really I’m totally on board with you with the manage expectations for what this bot can do and what it cannot. Get people familiar with it. If you don’t manage the expectations your bot is going to fail. Even if it’s as simple as a bot that’s on a website to qualify people for the sales team. If you don’t say, “hey I’m going to ask you three questions or two questions.” Or whatever it is, people are like okay when does this end.
They may bail before they were almost there and they could have had this bot do what they wanted. But they didn’t know what the expectation was. Then ending the conversation with thank you that’s spot on.
Wolf: We see different, different types of spots right. One is that I often call the bouncer. The bouncer is the guy who just sits there and it’s kind of the firewall shielding the human representative from too much work basically and you can also call the filter or whatever you want to call it but we have to be very clear from the beginning and tell people that. So you’re talking to a chatbot and this is not a route to get to a live representative.
If that’s what the bot is doing right so that’s expectation. Then I know oh this guy, the bot can only help me with things that are maybe more faq-like of things. Not if I have a real technical problem and think that where I know the bot will not be able to help me. If I want to negotiate maybe that’s what people currently really do. They call the cable company to negotiate at a better rate.
That’s probably not what your chatbot is helping you with. So we’re very open and upfront about what the what’s the role of the chatbot is. Is it a bouncer or is it just something that’s collecting information to help the representative that you will talk to later. To have all the information already available.
Billy: Yeah, now that’s a great point just managing those expectations whether it’s like a bouncer or a filter bot or it’s let me get some information before I give you somebody just be clear on those bots. What it actually does and the biggest mistake we’ve seen with our customers when we come in is. They’ve already got some bots and the expectation is not clear what these bots will do. Nobody’s interacting. Nobody’s doing anything with them help us. The number one thing is okay let’s set the expectation for what I can do and what I can’t do.
The Future of Voice
Billy: So awesome man, well we’re running out of time but I do want to ask you where what are some areas where we’re not using conversational interfaces whether it’s you know a type form or voice that you think are coming in the future?
Wolf: Well I still think that what we need to do first is get more people familiar with voice user interfaces in our domains. So, if you ask the general population do you feel comfortable doing financial topics for instance on the smart speaker. The majority would say they are not really comfortable with that. However, if you look at the current user base which is really large. We cannot just call amazon echo users freaks or early adopters even.
There is really a large part of the US population who has at least one smart speaker in their home. So if you ask those people they are really comfortable talking about finances, their personal finances on these speakers. So the people who have already adopted they are there. But to really get production we have to first get even more people on board with these technologies. And for that like we said earlier, you have to be able to delight them with frictionless access to information.
So that’s kind of where what I still see as a milestone that we have to tackle and data collection like we said also this will not be a good use case probably ever.
Billy: Yeah, I mean we’ve all got phones in our pockets so you know if you need to take a take a picture of something to record it. That’s so good now you can take a picture of a receipt and some tools and it’ll pull all the information right out for you. Why do I need to regurgitate that. Well awesome Wolf. Thank you so much for the time. If people want to reach out to you and continue the conversation, what’s the best way for them to contact you?
Wolf: Well just enter my name into google Wolf Paulus you find me on I don’t know LinkedIn and just add. I have also of course the web page where I frequently post things around voice and it’s quite technical most of the time actually so wolfpaulus.com.
Billy: Okay, okay thank you wolf and we’ll chat later.
Wolf: All right, thank you so much, thanks for having me that was fun.
Billy: Yeah it was