Baiting the bot, part II: Customer Service
In which simple text generation bots and corporate customer service chatbots converse
A previous article on this blog, “Baiting the bot”, discussed the topic of keeping LLM-based chatbots engaged in meaningless conversations indefinitely though the use of far simpler and less intelligent bots, and described an experiment wherein four such bots conversed ad nauseam with an AI assistant based on Meta’s Llama 3.1 model. What happens if one performs the same experiment on real-world AI customer service chatbots deployed by large corporations?
For the sake of this experiment, the following four basic text generation bots were used:
a bot which asks the question “which is better on a cheeseburger, cheddar or Swiss?” over and over
a bot that quotes portions of random episodes of Star Trek: The Next Generation and Star Trek: Deep Space Nine
a bot that asks randomly generated questions based on a set of sentence templates
a bot that asks “what do you mean by X?”, where X is a sentence fragment from whatever the bot is replying to, and occasionally asks for more information
English-language customer service chatbots operated by five different companies, Amazon, Greyhound, SoundCloud, United Airlines, and Walmart, were used as test subjects. The Selenium web testing framework was utilized to automate the process of feeding the bait messages from the simple text generation bots to the customer service chatbots, and the output from the chatbots back into the simple text generation bots. Two of the five customer service chatbots rejected the attempts at automated interaction; the United Airlines chatbot attempted to transfer the text generation bot to a human agent after just a few responses, while the Walmart website quickly started displaying too many CAPTCHAs to be usable.
The remaining three customer service chatbots, Amazon “Rufus”, Soundcloud “Otto”, and Greyhound “Gloria”, all engaged with the bait bots to varying degrees. Each combination of customer service chatbot and bait bot was tested three times, with each trial consisting of a conversation that continued until the customer service chatbot had responded 100 times, for a total of 300 responses for each pairing. Some of the bait bots were more successful than others in prompting the customer service chatbots to produce unique responses, with the “which is better on a cheeseburger, cheddar or Swiss” bot being the least effective and the “what do you mean” bot the most.
Amazon’s “Rufus” customer service chatbot was particularly responsive to the random question bot and the “what do you mean” bot, with less than 30% of responses from the customer service bot to these bots being repeated. Many of these responses include links to product listings on the Amazon website, such as a link to a self-help book in response to the question “how many topics are there” and a link to a teddy bear as a reply to “what is a large uncle?”. Even the “which is better on a cheeseburger, cheddar or Swiss” bot received links to Amazon searches for various cheeses, although the wording of these replies was much more repetitive.
Soundcloud’s “Otto” customer service bot responded somewhat differently; the random question and “what do you mean” bots yielded a much higher degree of repetition than with Amazon’s “Rufus”, with 81% of responses being duplicates. The most effective bot at getting “Otto” to generate unique responses was the Star Trek bot, which only yielded repeated replies 43.7% of the time, and often prompted the Soundcloud bot to incorporate basic information about the Star Trek fictional universe into its replies.
The “what do you mean” bot was accidentally somewhat effective at causing Soundcloud’s“ Otto” bot to offer information on Soundcloud features. “Otto” also appears to use some sort of per-session caching, as asking the exact same question repeatedly within a given conversation (i.e., “which is better on a cheeseburger, cheddar or Swiss”) always yielded the same response, but asking the same question in a new conversation often resulted in a new response.
All of the simple text generation bots other than the “which is better on a cheeseburger” bot occasionally caused the Soundcloud bot to offer to escalate to a human agent, but it invariably discarded this notion after being sent a few additional messages from the bait bot.
The “what do you mean” bot was the most effective of the four text generation bots at eliciting unique responses from Greyhound’s “Gloria” chat bot, with only 18.3% of the Greyhound bot’s replies being repeated. This bot was also somewhat responsive to the Star Trek bot, frequently misinterpreting the fragments of Star Trek scripts as questions or concerns regarding Greyhound bus service. “Gloria” was the least responsive of all the customer service bots to the “which is better on a cheeseburger” bot, simply replying over and over with the message “I’m not sure, as this is outside my scope. Let me know if I can help with anything related to Greyhound services”.