Loading
Groundbreaking advancements in frontier language models are progressing rapidly, paving the way for boosts in accuracy and reliability of generalist models, making them highly effective in specialized domains. As part of our ongoing exploration of foundation model capabilities, we developed Medprompt last year—a novel approach to maximize model performance on specialized domain and tasks without fine-tuning. By leveraging multiphase prompting, Medprompt optimizes inference by identifying the most effective chain-of-thought (CoT) examples at run time and drawing on multiple calls to refine output. When deployed with GPT-4, Medprompt achieved an impressive 90.2% accuracy on the MedQA benchmark (USMLE-style), outperforming all other methods.
Less than a year later, our tests show the OpenAI o1-preview demonstrated superior performance over Medprompt, reaching 96% on the same benchmark (Figure 1)—without using sophisticated prompt guidance and control. This advancement is driven by the model’s integration of run-time strategies at its core, enabling state-of-the-art results on medical licensing exams in the United States and Japan, medical subsets of the Massive Multitask Language Understanding (MMLU) benchmark, and nursing exams (NCLEX) as shown in Figure 2.
These results are notable, prompting us to publish our recent study, findings, and analyses, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond (opens in new tab). But the numbers are only part of the story. In this blog, we discuss prompting strategies to make the most of o1-preview models and other factors to consider as well as directions forward for run-time strategies.
The introduction of the OpenAI o1 model series marks a significant shift from prior GPT models. Unlike GPT, o1 models are trained using reinforcement learning (RL) techniques that enable them to “think” before generating outputs. While Medprompt relies on a cascade of operations with GPT-4 at run time guided by a multistage prompt, the o1 series incorporates this run-time reasoning directly into its RL-based design. The built-in functionality enables the o1 models to significantly outperform even the best results using GPT-4 and Medprompt. The performance gains come with a notable tradeoff: its per-token cost was approximately six times that of GPT-4o at the time of our evaluation. While the results for GPT-4o with Medprompt fall short of o1-preview model performance, the combination offers a more cost-effective alternative. The cost-benefit tradeoffs are highlighted in the following figure, with the x-axis presented on a logarithmic scale.
The o1-preview model exhibits distinct run-time behaviors compared to the GPT series. While some of our more dynamic prompting strategies performed better than expected with o1-preview models, our most tried-and-true strategy was anything but consistent throughout our evaluation. Figure 4 captures specific performance results for Tailored Prompt, Ensembling, and Few-Shot Prompting on o1-preview. Here’s a summary of our findings:
Spotlight: Blog post
Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.
We expanded our research to include a new multilingual benchmark based on the Japanese national medical licensing exam. The JMLE (Japanese Medical Licensing Examination) is written in Japanese and administered in February 2024, after the o1-preview model’s knowledge cutoff. Even without translation to English, the o1-preview model achieved a remarkable score of 98.2% accuracy (Figure 5), well above the exam’s minimum passing score of approximately 80%.
For fun, we conducted tests to determine whether increasing the number of reasoning tokens could improve performance. Our findings showed that by adjusting the prompt, we were able to consistently increase the number of reasoning tokens used by o1-preview, and the increase was directly correlated with improved performance as demonstrated in Figure 6.
Bottom line: There’s a little something for everyone when it comes to run-time strategies. We’re excited by the performance gains from GPT models to o1-preview models. While these improvements are significant, so is the cost. For those needing proven accuracy on a budget, Medprompt leveraging calls to GPT-4 is a viable option for medicine and beyond. We summarize the relative performance of prompting strategies in Figure 7 to determine the best option, or check out the paper for a detailed breakdown of every dataset, experimental configuration, and prompt template (opens in new tab).
We highlighted several considerations in the paper that are worth checking out. Here are three opportunities that are top of mind:
The post Advances in run-time strategies for next-generation foundation models appeared first on Microsoft Research.