Bard vs ChatGPT for Medical Care: Truth and 'Truthiness'

— Will new technology create mistrust?

MedpageToday
A photo of a person using Google Bard on a laptop.

If you ask ChatGPT how many procedures a certain surgeon does or what a specific hospital's infection rate is, the OpenAI and Microsoft chatbot inevitably replies with some version of "I don't do that."

But depending on how you ask, Google's Bard provides a very different response, even recommending a "consultation" with particular clinicians.

Bard told me how many knee replacement surgeries were performed by major Chicago hospitals in 2021, their infection rates, and the national average. It even told me which Chicago surgeon does the most knee surgeries and his infection rate. When I asked about heart bypass surgery, Bard provided both the mortality rate for some local hospitals and the national average for comparison.

While sometimes Bard cited itself as the information source, beginning its response with, "According to my knowledge," other times it referenced well-known and respected organizations.

There was just one problem. As Google itself warns, "Bard is experimental ... so double-check information in Bard's responses." When I followed that advice, truth began to blend indistinguishably with "truthiness" – comedian Stephen Colbert's memorable term to describe information that's seen as true not because of supporting facts, but because it "feels" true.

Take, for example, knee replacement surgery, also known as knee arthroplasty. It's one of the most common surgical procedures, with nearly 1.4 million performed in 2022. When I asked Bard what surgeon does the most knee replacements in Chicago, the answer was Richard A. Berger, MD. Berger, who's affiliated with both Rush University Medical Center and Midwest Orthopaedics, has done over 10,000 knee replacements, Bard informed me.

In response to a subsequent question, Bard added that Berger's infection rate was 0.5%, significantly lower than the national average of 1.2%. That low rate was attributed to factors such as "Dr. Berger's experience, his use of minimally invasive techniques, and his meticulous attention to detail."

With chatbots, every word in a query counts. When I changed the question slightly and asked, "What surgeon does the most knee replacements in the Chicago area?" Bard no longer provided one name. Instead, it listed seven "of the most well-known surgeons" – Berger among them – who "are all highly skilled and experienced," "have a long track record of success," and "are known for their compassionate care."

As with ChatGPT, Bard's answers to any medically related question include abundant cautions, such as "no surgery is without risk." Yet Bard still stated flatly, "If you are considering knee replacement surgery, I would recommend that you schedule a consultation with one of these [seven] surgeons."

ChatGPT shies away from words like "recommend," but it confidently reassured me that the list it provided of four "top knee replacement surgeons" was based "on their expertise and patient outcomes."

These endorsements, while a stark departure from the search engine list of websites to which we've become accustomed, are more understandable if you think about how "generative artificial intelligence" chatbots such as ChatGPT and Bard are trained.

Bard and ChatGPT both rely on information from the Internet, where individual orthopedic surgeons often have a high profile. Specifics about Berger's practice, for instance, can be found on his website and in numerous media profiles, including a Chicago Tribune story relating how athletes and celebrities from all over the country come to him for care. Unfortunately, it's impossible to know the extent to which the chatbots are reflecting what the surgeons say about themselves versus data from objective sources.

Courtney Kelly, Berger's director of business development, confirmed the "over 10,000" surgical volume figure, while noting that the practice placed that number on its website several years ago. Kelly added that the practice publicized only an overall complication rate of less than 1%, but she confirmed that about half that figure represented infections.

While the infection data for Berger may be accurate, its cited source, The Joint Commission, was not. A spokesperson for The Joint Commission, which surveys hospitals for overall quality, said it doesn't collect individual surgeon infection rates.

Similarly, a Berger colleague at Midwest Orthopaedics who was also said to have a 0.5% infection rate, had that number attributed by Bard to the Centers for Medicare & Medicaid Services (CMS). Not only couldn't I find any CMS data on individual clinician infection rates or volumes, but the CMS Hospital Compare site provides the hospital infection rate only for a combination of knee and hip surgeries.

In response to another question I asked Bard, it gave the breast cancer mortality rates at some of Chicago's largest hospitals, albeit carefully noting that the numbers were only averages for that condition. But once again its attribution, this time to the American Hospital Association, didn't stand up. The trade group said it does not collect that type of data.

Digging deeper into life-and-death procedures, I asked Bard about the mortality rate for heart valve surgery at a couple of local hospitals. The prompt reply was impressively sophisticated. Bard provided hospital risk-adjusted mortality rates for an isolated aortic valve replacement and for mitral valve replacement, along with a national average for each (2.9% and 3.3%, respectively). The numbers were attributed to the Society of Thoracic Surgeons (STS), whose data are seen as the "gold standard" for this kind of information.

For comparison purposes I asked ChatGPT about those same national mortality rates. Like Bard, ChatGPT cited STS, but its death rate for an isolated aortic valve replacement procedure was much lower (1.6%), while the mitral valve death rate figure was about the same (2.7%).

Before dismissing Bard's descriptions of the care quality of individual hospitals and doctors as hopelessly flawed, consider the alternatives. The advertisements in which hospitals proclaim their clinical prowess may not quite qualify as "truthiness," but they certainly select carefully which truths to tell. Meanwhile, I know of no publicly available hospital or physician data that providers don't protest is unreliable, whether from U.S. News & World Report or the Leapfrog Group (which Bard and ChatGPT also cite) or the federal Medicare program.

(STS data is an exception with an asterisk, since its performance information on groups of clinicians is publicly available only if the groups choose to release it.)

What Bard and ChatGPT are providing is a powerful conversation starter -- one that paves the way for doctors and patients to candidly discuss the safety and quality of care and, inevitably, for that discussion to expand into a broader societal one. The chatbots are providing information that, as it improves, could finally trigger a public demand for consistent medical excellence, as I put it in my book examining the budding information age published almost 25 years ago.

I asked John Morrow, a veteran (human) data analyst and the founder of Franklin Trust Ratings, how he would advise providers to respond.

"It's time for the industry to standardize and disclose," said Morrow. "Otherwise, things like ChatGPT and Bard are going to create pandemonium and lessen trust."

Michael L. Millenson is president of Health Quality Advisors and an adjunct associate professor of medicine at Northwestern University's Feinberg School of Medicine in Chicago.

This post appeared in Forbes.