Almost 20 years ago, I read a paper by some HCI folks who wanted to test how good the Star Trek computer interface would be. They simulated it by having a series of tests where either a human used a computer directly or a human asked another human to use a computer. They timed the tasks and subtracted the time that it took for the human who was pretending to be the computer to interact with the computer. In the majority of cases, the human using a GUI or CLI was faster. Humans evolved to interact with the universe with our hands long before we developed speech, so this wasn’t surprising. The only tasks where the voice interface was best were the ones that required a lot of creativity and interpretation by the user and even then they often required a lot of iteration because humans are really bad at asking for exactly what we want in an unambiguous format.
I think augmented reality with haptic feedback, done well, would be far more of a revolution in how we interact with computers than any natural language model.
I’m afraid the paper you found doesn’t match @david_chisnall’s description.
David described an experiment involving humans listening to voice input and then operating a computer according to those instructions. In the paper you linked, “Voice recognition based human-computer interface design” by Wei Zhang et al., voice input went not to humans, but to “InCube software, which is a voice recognition software that […] handles lexicons of up to 75 commands”. Its voice recognition capabilities sound much weaker than that of humans:
For the voice input interfaces, the ASGC [automatic semester grade calculation] software is pre-loaded with the oral commands such as “start”, “enter”, “OK”, etc., as well as 10 digits and 26 letters. A command consists of several keystrokes, a voice template, and an attribute. The commands are then linked with the equivalent keystrokes. Before a command can be recognized, a corresponding voice template must first be created.
I, too, would be curious to read the paper @david_chisnall described.
I wonder how different it is when one doesn’t have the high-bandwidth output channel of vision. It would be interesting to repeat the experiment except with blind people.
The way Tony Stark uses JARVIS in the marvel movies kinda makes more sense, now that you mention it. Even though he uses voice commands extensively, there’s almost always a visual, and often, if not tactile, at least manipulatable, holographic, component.
The voice commands are either for doing background tasks, or for additional, complex, queries or transformations in the visual representation.
But it’s all focused on the visual and touchable representation: the voice commands are shortcuts, not the main interface. Like keyboard shortcuts in a GUI program.
Looks like a much cooler future than blank text boxes everywhere.
People took to GUIs when they didn’t take to CLIs because a GUI offers options to pick from. A CLI must be known first, or else you constantly hit dead ends. If you know your options you can often do more with a CLI than a GUI, because a CLI can use programs in combination while a GUI tends to push users down a one-app path per task, and OpenDoc never landed. But most of my coworkers use the command line as nothing more than a “shortcut” or wish they could do without it. They don’t know its utility because they never crossed that threshold.
Speaking to Siri is like a CLI in this regard. Your options aren’t shown to you, so you need to know what you can say. It takes trial and error, or study. If capabilities expand to answer practically anything, the way a search engine can almost always give some kind of answer, general speech interfaces could jump up in utility but especially in usage. That’s how ChatGPT feels to me: Because dead ends are rare, people continue to find it compelling.
I think it’s a really interesting topic. Personally I’m someone who strongly believes that we will be highly efficient with mouse and keyboard for many decades for most tasks and has been rolling eyes at keyboard and mouse becoming obsolete for decades.
But I wonder if we will a shift to more implicitness in some areas in the language area. Not that we voice control, but there seems to be a pattern there:
Programming languages become closer to languages and more implicit over time
Search engines take language implications (synonyms, writing styles, etc.) into consideration
Some of the “AI” stuff like text to image, which at times feels a bit like an extension of search implicitly has to do a lot of things
So I wonder whether natural language can in certain areas use word choices, tone of voice, etc. into consideration, maybe in a way gestures do. I would imagine that to be used on a very high level computer use. With high level I mean that data entry, programming and many others are most convenient with keyboards, while there is some situations that are more convenient with gestures. Both on mobile devices, but also video games at times, maybe even with mouse.
Of course I also think timers, and other situations like playing music that we already see with Home Assistants will stick. But they don’t even feel very much like natural language interfaces.in the sense that one thinks about these devices as computer gadgets. For tone of voice and subtext, you need to go further. I imagine something like a very good secretary style person that knows a lot of implicit stuff. This could be a lot of other jobs as well, I am talking about jobs where there’s strong ties to another individual, so that subtext and tone of voice actually become relevant.
When I think about human language it can be kind of inefficient for programming, for describing pictures, and things like prose can make for bad references at times.
However, everyone knows these situations where you don’t want to write, because you need tone of voice, word choices, maybe even slurs are important and there natural language actually uses that part of the bandwidth that feels more like padding in other situations.
So when we go back to search an AI topics I think here they’ll help, similar to how Android’ss and Gmail’s word/sentence suggestions and swype already is pretty good at guessing what you want to type.
I imagine one could gain a lot with natural languages when applying this for anything that is like a search. And at large where how something is expressed is a relevant parameter, because it provides narrowing that might not easily be expressed in other languages.
And I think maybe that’s a bit where Star Trek comes into play. If I remember correctly a lot of it was Home Assistant style stuff “How long does it take for X”, “Where is the next Y?”, calculus. Things where you don’t want to pick up a device, cause it would be only for a second.
On the other hand they use what essentially feels like terminals connected to a main system (spaceship, etc.). While I think the UI is silly and also 3D keys are better than touch things, etc. I like to imagine these to be similar to what computers, status dashboards, etc. are good for. Or maybe well designed HUDs in video games. That is displaying status information. One of my biggest “science fiction” childhood dreams coming true is having something like a minimap, maybe even with navigation points, which we got. I remember when those navigation points were still imagined to be augmented reality, but largely weren’t displayed like that in video games.
I think language is currently mostly used for gadget in home smartphones, home assistants and home automation. With the most useful situation being that you need to do something like setting a timer, asking information or playing music while your hands are full. I think when/if conveying emotions, tone of voice, word choices or other subtext deducted from natural language can be used to not use a few dozen more words it might become interesting for more applications. I also think that natural language is not really used well in IT. That is it’s often used in a way where one effectively tries to reduce it to “make it into something that could have been typed on a keyboard”. So it’s like lossy compression, usually followed by some form of simplification, removing “unnecessary words” for the task at hand. And of course this isn’t exactly direct or efficient.
But I think it’s wrong to per default assume that information redundant for one thing is redundant for everything. What you often do as a programmer, especially when consulting is actually using that “redundant” (but actually implied) information. Else the person describing it could code it themselves. I think we’d likely already have well-working systems that you tell what you want and it will result in both software and hardware architecture. Maybe not perfect, but close to. We try to do that through various means. GUIs, high level programming languages, etc. But in many situations you hire a person to translate that. Even for things that are being done a million times and are always very similar. Now I am not saying that’s the only reason for it. It certainly isn’t. However, I don’t think transcribing something someone said, then removing stop-words, only retaining some order, etc. that NLP in most currently deployed production systems does is all there is. It’s mostly just what can be handled right now. I have basically no background in NLP, but I’ve had a job where I did a bit with it and there’s lot’s of examples where people would understand something one way and the machine the completely opposite, which was technically also (or actually) correct. But there was also a reason why it was said that way.
So in short. I don’t think that NL is just a weird, buggy version of formal language. It’s just that if your goal is to make formal languages or computer instructions out of that, because you will only use a very reduced subset of the natural language and the formal language just happens to look similar, because we use the way we communicate and subsets their off until it turns into the execution of a machine/a process.
The kind of funny thing is that “text to image” for example seems to pick up a lot of implied stuff, but is based on a super reduced language, that is “image tagging/description”. So in the end you again cut away all that implied language most of the time, simply because the training set already does. There is a reason there is whole documentaries, books, wikipedia articles to just describe a single paining. There’s also reasons why interpretations of things like poems or lyrics tend to be many times as long as poems themselves.
Almost 20 years ago, I read a paper by some HCI folks who wanted to test how good the Star Trek computer interface would be. They simulated it by having a series of tests where either a human used a computer directly or a human asked another human to use a computer. They timed the tasks and subtracted the time that it took for the human who was pretending to be the computer to interact with the computer. In the majority of cases, the human using a GUI or CLI was faster. Humans evolved to interact with the universe with our hands long before we developed speech, so this wasn’t surprising. The only tasks where the voice interface was best were the ones that required a lot of creativity and interpretation by the user and even then they often required a lot of iteration because humans are really bad at asking for exactly what we want in an unambiguous format.
I think augmented reality with haptic feedback, done well, would be far more of a revolution in how we interact with computers than any natural language model.
Maybe I’m understanding it wrong but that’s no surprising outcome and also not testing Star Trek Computers?
Human -> Human -> Computer
is a longer path than
Human -> Computer
So why would it be faster?
Perhaps this is the key point?
Don’t know how I missed that. Oops. Thanks! :)
Because they timed the second human -> computer step and subtracted that from the total time.
That’s an interesting-sounding paper. I tried to find it, is it this one? I was able to find a really crappy scan of the full text.
I’m afraid the paper you found doesn’t match @david_chisnall’s description.
David described an experiment involving humans listening to voice input and then operating a computer according to those instructions. In the paper you linked, “Voice recognition based human-computer interface design” by Wei Zhang et al., voice input went not to humans, but to “InCube software, which is a voice recognition software that […] handles lexicons of up to 75 commands”. Its voice recognition capabilities sound much weaker than that of humans:
I, too, would be curious to read the paper @david_chisnall described.
I’d love to know what paper that was.
I wonder how different it is when one doesn’t have the high-bandwidth output channel of vision. It would be interesting to repeat the experiment except with blind people.
Does the paper account for inexperience using a voice interface? Specificity could be learned
The way Tony Stark uses JARVIS in the marvel movies kinda makes more sense, now that you mention it. Even though he uses voice commands extensively, there’s almost always a visual, and often, if not tactile, at least manipulatable, holographic, component.
The voice commands are either for doing background tasks, or for additional, complex, queries or transformations in the visual representation.
But it’s all focused on the visual and touchable representation: the voice commands are shortcuts, not the main interface. Like keyboard shortcuts in a GUI program.
Looks like a much cooler future than blank text boxes everywhere.
People took to GUIs when they didn’t take to CLIs because a GUI offers options to pick from. A CLI must be known first, or else you constantly hit dead ends. If you know your options you can often do more with a CLI than a GUI, because a CLI can use programs in combination while a GUI tends to push users down a one-app path per task, and OpenDoc never landed. But most of my coworkers use the command line as nothing more than a “shortcut” or wish they could do without it. They don’t know its utility because they never crossed that threshold.
Speaking to Siri is like a CLI in this regard. Your options aren’t shown to you, so you need to know what you can say. It takes trial and error, or study. If capabilities expand to answer practically anything, the way a search engine can almost always give some kind of answer, general speech interfaces could jump up in utility but especially in usage. That’s how ChatGPT feels to me: Because dead ends are rare, people continue to find it compelling.
I think it’s a really interesting topic. Personally I’m someone who strongly believes that we will be highly efficient with mouse and keyboard for many decades for most tasks and has been rolling eyes at keyboard and mouse becoming obsolete for decades.
But I wonder if we will a shift to more implicitness in some areas in the language area. Not that we voice control, but there seems to be a pattern there:
So I wonder whether natural language can in certain areas use word choices, tone of voice, etc. into consideration, maybe in a way gestures do. I would imagine that to be used on a very high level computer use. With high level I mean that data entry, programming and many others are most convenient with keyboards, while there is some situations that are more convenient with gestures. Both on mobile devices, but also video games at times, maybe even with mouse.
Of course I also think timers, and other situations like playing music that we already see with Home Assistants will stick. But they don’t even feel very much like natural language interfaces.in the sense that one thinks about these devices as computer gadgets. For tone of voice and subtext, you need to go further. I imagine something like a very good secretary style person that knows a lot of implicit stuff. This could be a lot of other jobs as well, I am talking about jobs where there’s strong ties to another individual, so that subtext and tone of voice actually become relevant.
When I think about human language it can be kind of inefficient for programming, for describing pictures, and things like prose can make for bad references at times.
However, everyone knows these situations where you don’t want to write, because you need tone of voice, word choices, maybe even slurs are important and there natural language actually uses that part of the bandwidth that feels more like padding in other situations.
So when we go back to search an AI topics I think here they’ll help, similar to how Android’ss and Gmail’s word/sentence suggestions and swype already is pretty good at guessing what you want to type.
I imagine one could gain a lot with natural languages when applying this for anything that is like a search. And at large where how something is expressed is a relevant parameter, because it provides narrowing that might not easily be expressed in other languages.
And I think maybe that’s a bit where Star Trek comes into play. If I remember correctly a lot of it was Home Assistant style stuff “How long does it take for X”, “Where is the next Y?”, calculus. Things where you don’t want to pick up a device, cause it would be only for a second.
On the other hand they use what essentially feels like terminals connected to a main system (spaceship, etc.). While I think the UI is silly and also 3D keys are better than touch things, etc. I like to imagine these to be similar to what computers, status dashboards, etc. are good for. Or maybe well designed HUDs in video games. That is displaying status information. One of my biggest “science fiction” childhood dreams coming true is having something like a minimap, maybe even with navigation points, which we got. I remember when those navigation points were still imagined to be augmented reality, but largely weren’t displayed like that in video games.
I think language is currently mostly used for gadget in home smartphones, home assistants and home automation. With the most useful situation being that you need to do something like setting a timer, asking information or playing music while your hands are full. I think when/if conveying emotions, tone of voice, word choices or other subtext deducted from natural language can be used to not use a few dozen more words it might become interesting for more applications. I also think that natural language is not really used well in IT. That is it’s often used in a way where one effectively tries to reduce it to “make it into something that could have been typed on a keyboard”. So it’s like lossy compression, usually followed by some form of simplification, removing “unnecessary words” for the task at hand. And of course this isn’t exactly direct or efficient.
But I think it’s wrong to per default assume that information redundant for one thing is redundant for everything. What you often do as a programmer, especially when consulting is actually using that “redundant” (but actually implied) information. Else the person describing it could code it themselves. I think we’d likely already have well-working systems that you tell what you want and it will result in both software and hardware architecture. Maybe not perfect, but close to. We try to do that through various means. GUIs, high level programming languages, etc. But in many situations you hire a person to translate that. Even for things that are being done a million times and are always very similar. Now I am not saying that’s the only reason for it. It certainly isn’t. However, I don’t think transcribing something someone said, then removing stop-words, only retaining some order, etc. that NLP in most currently deployed production systems does is all there is. It’s mostly just what can be handled right now. I have basically no background in NLP, but I’ve had a job where I did a bit with it and there’s lot’s of examples where people would understand something one way and the machine the completely opposite, which was technically also (or actually) correct. But there was also a reason why it was said that way.
So in short. I don’t think that NL is just a weird, buggy version of formal language. It’s just that if your goal is to make formal languages or computer instructions out of that, because you will only use a very reduced subset of the natural language and the formal language just happens to look similar, because we use the way we communicate and subsets their off until it turns into the execution of a machine/a process.
The kind of funny thing is that “text to image” for example seems to pick up a lot of implied stuff, but is based on a super reduced language, that is “image tagging/description”. So in the end you again cut away all that implied language most of the time, simply because the training set already does. There is a reason there is whole documentaries, books, wikipedia articles to just describe a single paining. There’s also reasons why interpretations of things like poems or lyrics tend to be many times as long as poems themselves.