Loading
AI is reshaping industries—from education to healthcare—thanks to advancements in large language models (LLMs). These models rely on prompts, carefully crafted inputs that guide them to produce relevant and meaningful outputs. While the impact of prompts is profound, creating prompts that can help with complex tasks is a time-intensive and expertise-heavy process, often involving months of trial and error.
This challenge grows as new tasks arise and models evolve rapidly, making manual methods for prompt engineering increasingly unsustainable. The question then becomes: How can we make prompt optimization faster, more accessible, and more adaptable across diverse tasks?
To address this challenge, we developed PromptWizard (PW), a research framework that automates and streamlines the process of prompt optimization. We are open sourcing the PromptWizard codebase (opens in new tab) to foster collaboration and innovation within the research and development community.
PromptWizard (PW) is designed to automate and simplify prompt optimization. It combines iterative feedback from LLMs with efficient exploration and refinement techniques to create highly effective prompts within minutes.
PromptWizard optimizes both the instruction and the in-context learning examples. Central to PW is its self-evolving and self-adaptive mechanism, where the LLM iteratively generates, critiques, and refines prompts and examples in tandem. This process ensures continuous improvement through feedback and synthesis, achieving a holistic optimization tailored to the specific task at hand. By evolving both instructions and examples simultaneously, PW ensures significant gains in task performance.
Three key insights behind PromptWizard:
PromptWizard begins with a user input: a problem description, an initial prompt instruction, and a few training examples that serve as a foundation for the task at hand.
Its output is a refined, optimized set of prompt instructions paired with carefully curated in-context few-shot examples. These outputs are enriched with detailed reasoning chains, task intent, and an expert profile that bridges human-like reasoning with the AI’s responses.
Stage 1: Refinement of prompt instruction
The first stage focuses on refining the task instructions of a prompt. PromptWizard generates multiple candidate instructions, evaluates them using feedback from the LLM, and iteratively synthesizes improved versions. This process balances exploration—trying diverse ideas—and exploitation—refining the most promising ones.
For example, if an initial instruction yields suboptimal results, PW incorporates feedback to identify its shortcomings and generates an improved version. Over three to five iterations, this iterative cycle ensures that the instruction converges to an optimal state.
Stage 2: Joint optimization of instructions and examples
The refined prompt obtained from Stage 1 is combined with carefully selected examples, and both are optimized together. Through the critique-and-synthesis mechanism, PromptWizard ensures alignment between the prompt and examples, simultaneously synthesizing new examples to enhance task performance.
This structured approach makes PromptWizard highly versatile, adapting to tasks as varied as solving math problems or generating creative content.
PromptWizard stands out for its feedback-driven refinement and systematic exploration, delivering exceptional results across a wide variety of tasks while maintaining computational efficiency.
PromptWizard was rigorously evaluated on over 45 tasks, spanning both general and domain-specific challenges. Benchmarked against state-of-the-art techniques—including Instinct, InstructZero, APE, PromptBreeder, EvoPrompt, DSPy, APO, and PromptAgent—PW consistently outperformed competitors in accuracy, efficiency, and adaptability. Please see detailed results in our paper.
Methods | API calls | Total tokens |
---|---|---|
Instinct | 1730 | 115k |
PromptBreeder | 18600 | 1488k |
EvoPrompt | 5000 | 400k |
PW | 69 | 24k |
We also have conducted numerous experiments to highlight PromptWizard’s efficacy with limited training data and smaller LLMs.
Real-world scenarios often lack abundant training data. PW excels in such conditions, requiring as few as five examples to produce effective prompts. Across five diverse datasets, PW demonstrated an average accuracy drop of only 5% when using five examples compared to 25 examples—highlighting its adaptability and efficiency (see Table 2).
Datasets | 5 Examples | 25 Examples |
---|---|---|
MMLU | 80.4 | 89.5 |
GSM8k | 94 | 95.4 |
Ethos | 86.4 | 89.4 |
PubMedQA | 68 | 78.2 |
MedQA | 80.4 | 82.9 |
Average | 81.9 | 87 |
PromptWizard also reduces computational costs by using smaller LLMs for prompt generation, reserving more powerful models for inference. For example, using Llama-70B for prompt generation resulted in negligible performance differences compared to GPT-4, while significantly lowering resource usage (see Table 3).
Dataset | Prompt Gen: Llama-70B | Prompt Gen: GPT4 |
---|---|---|
GSM8k | 94.6 | 95.4 |
Ethos | 89.2 | 89.4 |
Average | 91.9 | 92.4 |
PromptWizard shows that effective prompts combine optimized instructions refined through iterative feedback, thoughtfully chosen in-context examples, and a modular design that incorporates expert knowledge and task-specific intent. This approach enables the framework to handle a broad range of tasks, from simple to highly complex, with exceptional efficiency and flexibility.
Whether you are a researcher addressing cutting-edge challenges or an organization looking to streamline workflows, PromptWizard provides a practical, scalable, and impactful solution for enhancing model performance.
The post PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts appeared first on Microsoft Research.