Beating flesh-and-blood chess and Jeopardy champs was one thing. Training Project Debater to sway opinion is a brand new challenge.
“Hello, Noa. We meet again. And for the first time in front of a non-IBM audience. I’ve been told it helps to take a deep breath. But unfortunately, I cannot do that.”
The setting is a competitive debate being held at Watson West, IBM’s AI outpost in San Francisco’s tech-centric SOMA neighborhood. Noa is Noa Ovadia, a champion debater from Israel. The speaker greeting her is her opponent—whose inability to breathe deeply makes perfect sense given that it’s a piece of software, generating Siri-like female speech that emanates from a human-sized black column with a screen on its front.
This is indeed the first time that the software in question, Project Debater, has shown its stuff outside of secret trial runs. Since 2012, IBM Research has been teaching it to debate humans on a vast array of subjects—making it a successor to Deep Blue (which beat Garry Kasparov in a six-game chess match in 1997) and Watson (which won a Jeopardy tournament against Ken Jennings and Brad Rutter in 2011).
On Monday afternoon in San Francisco, Project Debater argued in favor of government subsidies for space exploration while Ovadia made the case against them. The software then took on another Israeli debater, Dan Zafrir, on the topic of increased investment in telemedicine. (Project Debater was for it; Zafrir was against.) Throughout, it was obvious that Project Debater is a work in progress. But the AI scored some points, appeared to understand the gist of its opponents’ arguments well enough to respond to them, and made the audience laugh—sometimes even on purpose. Based on voting from the few dozen spectators, it narrowly lost the space exploration debate but swayed more audience members to its position on telemedicine than Zafrir did.That was good enough to make the exhibition a success in the eyes of Noam Slonim, a senior technical staff member at IBM’s research center in Haifa, Israel and the person who originally proposed the Project Debater idea in 2011. The effort now includes dozens of researchers at multiple IBM labs and is led by Slonim’s Haifa colleague Ranit Aharonov. Merely seeing the software keep the audience engaged over a 20-minute debate “was a very positive feeling,” he tells me at a post-debate reception.
Witnessing a computer thrash the Kasparovs and Jenningses of competitive debating at their own craft would be an epoch-shifting moment, but “our goal is not to develop yet another system that is better than humans in doing something,” stresses Aharonov. Instead, IBM wants to create debating software that can spar with “a reasonably competent human, but not necessarily a world champion, and come across holding its own,” says IBM director of research Arvind Krishna. Still, even if the company is keeping its aspirations in check, its latest adventure in AI involves challenges unlike any it’s tackled before.
The next grand challenge
Soon after Watson bested Jennings and Rutter in a tournament taped at IBM’s Yorktown, N.Y. research center in January 2011, the company began to think about how to top that memorable feat of artificial intelligence. “All of the thousands of researchers … got the same email asking what should be the next AI grand challenge that IBM Research should pursue,” remembers Slonim. The goal, he explains, was to come up with a project that was “scientifically interesting and challenging and would have some business value. Something big, something that would make a difference.”
Slonim brainstormed with colleagues and hatched the concept of IBM Research training AI to debate a human being. Israel—a country with a strong culture of competitive debating—was a logical place for this particular dream to originate. “They certainly have a very firmly held belief, both in society and their political system, about free debate in all forms,” says Krishna.
It took about a year before Slonim’s proposal, in more fully evolved form, won out over other grand-challenge possibilities. But in 2012, the company officially decided to pursue it. As more researchers were devoted to the project, they made headway; by 2016, they had a rough draft of the technology that worked well enough to make clear the idea wasn’t a pipe dream. “But not in a way you’d be willing to watch,” says Krishna. “You’d get a topic, then it would come back a few hours later and spit out a one-minute speech.”
Even today, with Project Debater capable of performing before an audience, gauging its success isn’t that straightforward a matter. Winning a game of chess or Jeopardy is a well-defined goal, and there can be no doubt when you’ve achieved it. Whether someone (or something) has won a debate, however, is entirely subjective and up to every spectator to decide for himself or herself. (In case you hadn’t noticed, even reasonable people tend not to agree on who won high-profile debates such as those associated with political campaigns.)
No debater can convince an audience without sounding well informed about the subject at hand. IBM has given Project Debater a huge corpus of information, feeding it billions of sentences drawn from sources such as hundreds of millions of newspaper articles. The AI is able to identify facts and opinions and weave them together, in a sort of sweeping automated cribbing which its human opponents can’t match.The resulting arguments, read aloud by IBM’s text-to-speech engine during a debate, are far from seamless. In San Francisco, the software sometimes made assertions—such as telemedicine being a boon to working mothers—and then abruptly shifted gears without explaining what it had meant. It also peppered its advocacy for subsidized space exploration with a surprisingly large number of references to the Middle East, a quirk that Slonim told me might simply relate to its corpus of knowledge skewing toward sources from that region. But Project Debater’s argumentative patchwork holds together better than you might expect, and it stitches it together rapidly. Though not in absolute real time: After each of its opponents’ statements, the AI took a couple of minutes to mull over its response.
In many instances, the software rephrases the material it mines, but snippets of existing text do make their way into its speech unaltered. In its first San Francisco debate, for example, it helpfully explained that “space exploration is the ongoing discovery and exploration of celestial structures in outer space, by means of continuously growing and evolving space technology.” That definition seems to have come directly from the Center for Planetary Science; it’s an act of artful borrowing rather than evidence that the software is capable of formulating sentences on its own.
Just as successful human debaters don’t have to be Ken Jennings-like polymaths to argue whatever position they’re charged with making, IBM’s software is only partially about the facts and figures it can spew out. The bigger challenge was teaching it to argue any proposition by applying the same sort of debating techniques a human would. (The topics that IBM gives Project Debater for its debates—such as space exploration and telemedicine—are chosen at random from a curated list; the company hasn’t written the software with them in mind or trained it specifically to handle them.)
In many domains of AI, thinking like a human isn’t just unnecessary; it can be a hindrance. The current version of Google’s AlphaGo Zero software, for instance, plays Go so well because it taught itself rather than be distracted by the imperfect strategies that flesh-and-blood Go masters apply to the game. By contrast, the whole point of a competitive debate is to push audience members’ buttons in an effective manner. “You cannot win a debate while adopting a tactic that humans do not understand,” says Slonim.
Slonim and Aharonov emphasize that they haven’t tried to build software that could fool anyone into thinking it was a human. Project Debater’s female voice isn’t bad, but it’s unabashedly synthetic—unlike Google’s disturbingly naturalistic Duplex, which salts its speech with computer-generated ums and ahs. Project Debater even likes to allude to the fact that it’s a computer, as witness its reference to being unable to take a deep breath. The display on the front of the column that personifies the software onstage merely shows simple animations indicating whether Project Debater is talking or thinking, more akin to HAL’s red light than full-blown anthropomorphism.
Bottom line: If any AI definitively passes the Turning Test anytime soon, it won’t be this one. And that’s okay.
Which is not to say that IBM hasn’t worked hard to make its debating machine personable in its own way. “If…I…sound…robotic…like…this…you are not going to listen for 10 minutes,” Krishna says. “You’ll go to sleep. It’s got to learn a little bit of inflection. It’s got to learn a little bit of humor and how to interact in a way that humans keep paying attention.”
During the San Francisco face-offs, Project Debater didn’t score anywhere near as many style points as human opponents Ovadia and Zafrir, who modulated their voices as appropriate, punctuated key points by jabbing at the air, and thoughtfully shuffled the papers on their podiums. “It’s very clear in a debate that humans are more persuasive and more emotional and have better rhetoric, and the AI system is better at bringing relevant evidence,” says Aharonov. But as the software made its cases, you could see how IBM’s researchers had programmed it to mimic the best practices of human debaters.
For example, Project Debater often reinforced arguments by telling the audience what it was about to say or what it had just said. It rattled off numbered lists, referenced pertinent stats, and quoted experts. In both debates, it attempted to diffuse its adversaries’ talking points even before they’d made them (“You might hear my opponent today talk about different priorities and subsidies”). Certain canned phrases, such as “Let me start with a few words of background,” popped up in both debates. And it wedged jokes into its patter, seemingly picking them from an assortment composed by its creators.
The way the software filters a giant collection of preexisting text through a framework of standard debating techniques can lead to some strange moments. Space exploration, it declared at one point, “is more important than good roads or improved schools or better healthcare,” provoking chuckles from the audience. It also asserted that “peaceful nuclear-powered space exploration is designed to reassure people about future nuclear weapons in orbit,” a sound bite it borrowed verbatim from an article and apparently mistook for an innocuous sentiment rather than a dystopian prospect. And at one point, it blurted “Scott Pelley voiceover,” an apparently accidental non-sequitur about the CBS newsman.
Public failure is not new to IBM Research’s grand challenges. Deep Blue’s victory over Gary Kasparov in a six-game match in 1997 is the stuff of history; less well remembered is the fact that Kasparov twice trounced the system’s predecessor, Deep Thought, in 1989. In the case of Watson, Stephen Baker’s book Final Jeopardy reports that when IBM was developing its Jeopardy-playing software, the company deemed 5% of its incorrect responses to be not only wrong but “embarrassing.” Even after IBM tweaked the AI specifically to avoid wacky blunders, it came up with the question “What is Toronto?” in a Final Jeopardy round with the category “U.S. Cities.”
Given its attempt to simulate extemporaneous speaking on nearly any subject, Project Debater’s goofs are more likely to be weird and jarring than those of its predecessors. But each erratic moment is a data point which Slonim, Aharonov, and their collaborators can use to improve future versions of the software. “Through some of the mistakes which are sometimes funny and actually make the whole debate fun, I think one learns a lot,” says Aharonov.
Glitches aside, Project Debater gets enough things right to impress a debating pro such as Ovadia, who has been matching wits with it in test matches for a few months. “At the beginning, I was floored, both by just the speech but also the argument construction,” she says. “The ability to listen and then give a meaningful rebuttal to what I’ve been saying.” Even Ovadia’s measured criticism–“she’s a little better at technical knowledge and definitely weaker at principled argumentation”–is respectful, down to the use of the feminine pronoun.
The business of debate
At the post-debate reception, Krishna told me that Project Debater’s creators continue to make progress so rapidly that the software should be noticeably more facile and polished even six months from now. No matter how much progress they make, though, they won’t run out of ways to push this wildly ambitious idea further: “This is the beginning of something that we can explore for many, many years,” says Slonim.
From the start, that exploration was designed to lead to important new businesses for IBM, and at some point the company will begin to monetize what it’s learned. Back in 2011, Watson’s Jeopardy win proved so transfixing a demonstration of the power of IBM’s AI that the company has since applied the Watson brand to commercial AI cloud services it’s rolled out in healthcare, financial services, education, and other areas. These initiatives, while not without their tribulations, are key to IBM’s present and future. And at some point, they’re likely to be joined by Watson services leveraging what the company has learned from Project Debater.
Krishna, who has two titles at IBM—director of research and senior VP, hybrid cloud—says that his responsibilities for both pure research and business priorities require a willingness to be “schizophrenic.” But he emphasizes that the people responsible for an effort such as Project Debater shouldn’t be too encumbered by the question of how their technology might be monetized. “You’ve got to have a glimmer of what this is going to be, but you can’t burden the team with that,” he says. “Most people, under that weight, will not be able to make progress.”
Still, he sees plenty of ways this research undertaking could lead to profits. All sorts of organizations might find value in software that can synthesize information and summarize the pros and cons of an issue, he says. The technology might even serve as an antidote to the spread of misleading information online, analyzing text and detecting biases.
IBM is far from alone in looking at teaching computers to debate as an important part of the overarching quest to make them smarter. Among the audience members for the San Francisco debates was Chris Reed, who heads the University of Dundee’s Center for Argument Technology. Reed, who’d traveled from Scotland for the event, is not involved with Project Debater, but was exuberant nonetheless: “To see so many pieces of the puzzle coming together here is really impressive,” he said.
As Project Debater was wrapping up its argument in favor of increased use of telemedicine, it quoted a famous Arthur C. Clarke aphorism: “Any sufficiently advanced technology is indistinguishable from magic.” The reference came off as more charmingly nerdy than apt in the context at hand. But the current state of IBM’s research project proves a different point: Technology doesn’t need to be magical to be meaningful.
Or as the software itself put it, while trying to paint its human opponent Zafrir as a bit of a Luddite, “Don’t be afraid, and let us move on with the changing world.”