In the pursuit of advancements in generative AI, large language models (LLMs) enjoyed a euphoric honeymoon period. However, a recent study conducted by researchers from Stanford and UC Berkeley has cast a shadow over their performance, specifically focusing on OpenAI’s LLMs.
The study aimed to gauge the progress of LLMs, given their capacity for updates based on data, user feedback, and design modifications. The researchers evaluated the behavior of GPT-3.5 and GPT-4 in their March 2023 and June 2023 versions across four distinct tasks: solving math problems, answering sensitive/dangerous questions, generating code, and visual reasoning.
Diverse Capabilities of LLMs OpenAI’s report during the introduction of GPT-4 touted its enhanced reliability, creativity, and ability to comprehend nuanced instructions compared to GPT-3.5. Additionally, GPT-4 demonstrated success in passing challenging exams in specialized domains such as medicine and law.
However, the study revealed intriguing fluctuations in the performance and behavior of both GPT-3.5 and GPT-4 between their March and June releases.
Unexpected Variations in Performance For instance, GPT-4 in its March 2023 version exhibited remarkable accuracy (97.6%) in identifying prime numbers, but this sharply declined to a mere 2.4% accuracy in its June 2023 version. Conversely, GPT-3.5 demonstrated notable improvement in the same task from its March to June 2023 versions.
Furthermore, GPT-4 appeared less inclined to answer sensitive questions in June than it did in March. Both GPT-4 and GPT-3.5 showcased an increased number of formatting errors in code generation in their June releases compared to March.
Concerns Surrounding LLMs’ Tendency to Hallucinate On a positive note, the update in GPT-4 proved more resilient to jailbreaking attacks than GPT-3.5. Jailbreaking, a form of manipulation, involves crafting a prompt to conceal a malicious question and surpassing protection boundaries to influence LLM responses that could inadvertently assist in malware creation.
A Call for Continuous Evaluation While ChatGPT and similar LLMs captivate the world, this study serves as a potent reminder for developers to continually assess and scrutinize LLM behavior when deployed in real-world applications. The researchers recommend implementing monitoring analysis for ongoing workflows that rely on LLM services.
Contrasting Perspectives Interestingly, another study, conducted by researchers at Microsoft, a significant investor in OpenAI, claimed that GPT-4 represents a significant step toward artificial general intelligence (AGI). This claim sparked controversy, with many in the AI industry labeling it as dangerous.
Conclusion The study sheds light on the evolving nature of large language models and emphasizes the importance of continuous monitoring to ensure their quality and reliability in practical applications. As the landscape of AI continues to evolve, such investigations will remain essential for the responsible development and deployment of LLMs.