ChatGPT Radiologist? Researchers Test AI Model in Exam, and It Did Quite Well

Researchers at Toronto General Hospital in Canada recently conducted a fascinating experiment involving ChatGPT, the popular conversational chatbot. In an effort to assess its capabilities, the researchers administered a standard exam to the AI model, simulating the radiology boards’ examinations conducted in both Canada and the United States. The results were quite impressive, as ChatGPT achieved a remarkable score of 81 percent, surpassing the 70 percent passing threshold.

Ever since its launch, ChatGPT has captivated users with its remarkable ability to comprehend information and provide accurate responses to queries. This prompted researchers to push the boundaries further and test its performance on the U.S. Medical Licensing Exam (USMLE) and even the MBA exam at the prestigious Wharton Business School. However, its performance in those tests turned out to be less stellar.

Considering the widespread adoption of ChatGPT in various fields, the team of researchers at the University Medical Imaging Toronto recognized the need to explore the AI model’s potential in the field of radiology. This prompted them to undertake the examination, aiming to evaluate ChatGPT’s abilities within the context of radiological practice.

ChatGPT answers radiology questions

The researchers set up a 150-question test for ChatGPT, much like how the radiology boards in Canada and U.S. do for students. Since the AI bot cannot process images as input, the researchers provided only text in the question, which were grouped into lower-order and higher-order questions.

Questions in the lower-order group test the chatbot on knowledge recall and basic understanding of the subject, while those in the higher order required it to apply, analyze, and synthesize information.

ChatGPT radiologist? Researchers test AI model in exam, and it did quite well — Could AI carry out examinations soon? Researchers are wary

Since there are two versions of GPT currently available, the researchers tested both of them on the same question set to see if one was better than the other.

ChatGPT-powered by the older version, i.e., GPT 3.5, could only score 69 percent on the question set, scoring well on the lower order questions (84 percent, 51 correct out of 61) but struggled with higher order questions managing only 60 percent (53 out of 89).

After GPT-4 was released in March 2023, the researchers tested the improved version of ChatGPT again, which scored 81 percent after getting 121 of the 150 questions correct. As claimed by OpenAI about GPT-4’s superior reasoning capabilities, the newly launched large language model scored 81 percent on the higher-order questions. the press release said.

What stumped the researchers, though, is the performance of GPT-4 on lower order questions, where it got 12 questions wrong that GPT3.5 had answered correctly. “We were initially surprised by ChatGPT’s accurate and confident answers to some challenging radiology questions, but then equally surprised by some very illogical and inaccurate assertions,” said Rajesh Bhayana, a radiologist and technology lead at Toronto General Hospital.

While the tendency to confidently deliver incorrect information, dubbed hallucinations, has reduced in GPT-4, it has not been eliminated yet. In medical practice, this can be dangerous, especially when used by novices who may not be able to spot replies as inaccurate, the researchers added.