Skip to content

Added delta for Prompt Migration Cookbook #1942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 80 additions & 1 deletion examples/Prompt_migration_guide.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -698,6 +698,85 @@
"Consistent testing and refinement ensure your prompts consistently achieve their intended results."
]
},
{
"cell_type": "markdown",
"id": "cac0dc7f",
"metadata": {},
"source": [
"### Current Example\n",
"\n",
"Let’s evaluate whether our current working prompt has improved as a result of prompt migration. The original prompt, drawn from this [paper](https://arxiv.org/pdf/2306.05685), is designed to serve as a judge between two assistants’ answers. Conveniently, the paper provides a set of human-annotated judgments and assesses the LLM judge based on its agreement with these human ground truths.\n",
"\n",
"Our goal here is to measure how closely the judgments generated by our migrated prompt align with human evaluations. For context, the benchmark we’re using is a subset of MT-Bench, which features multi-turn conversations. In this example, we’re evaluating 200 conversation rows, each comparing the performance of different model pairs.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "6f50f9a0",
"metadata": {},
"source": [
"On our evaluation subset, a useful reference point is human-human agreement, since each conversation is rated by multiple annotators. For turn 1 (without ties), humans agree with each other in 81% of cases, and for turn 2, in 76% of cases."
]
},
{
"cell_type": "markdown",
"id": "20e4d610",
"metadata": {},
"source": [
"![Graph 1 for Model Agreement](../images/prompt_migrator_fig1.png)"
]
},
{
"cell_type": "markdown",
"id": "a3eccb6c",
"metadata": {},
"source": [
"Comparing this to our models before migration, GPT-4 (as used in the paper) achieves an agreement with human judgments of 74% on turn 1 and 71% on turn 2, which is not bad, but still below the human-human upper bound."
]
},
{
"cell_type": "markdown",
"id": "91dc3d38",
"metadata": {},
"source": [
"![Graph 2 for Model Agreement](../images/prompt_migrator_fig2.png)"
]
},
{
"cell_type": "markdown",
"id": "9f7e206f",
"metadata": {},
"source": [
"\n",
"Switching to GPT-4.1 (using the same prompt) improves the agreement: 78% (65/83) on turn 1 and 72% (61/85) on turn 2."
]
},
{
"cell_type": "markdown",
"id": "800da674",
"metadata": {},
"source": [
"\n",
"Finally, after migrating and tuning our prompt specifically for GPT-4.1, the agreement climbs further, reaching 80% on turn 1 and 72% on turn 2, very close to matching the level of agreement seen between human annotators."
]
},
{
"cell_type": "markdown",
"id": "7af0337b",
"metadata": {},
"source": [
"![Graph 3 for Model Agreement](../images/prompt_migrator_fig3.png)"
]
},
{
"cell_type": "markdown",
"id": "43ae2ba5",
"metadata": {},
"source": [
"Viewed all together, we can see that prompt migration and model upgrades improve agreement on our sample task. Go ahead and try it on yours!"
]
},
{
"cell_type": "markdown",
"id": "c3ed1776",
Expand Down Expand Up @@ -883,7 +962,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
"version": "3.12.9"
}
},
"nbformat": 4,
Expand Down
Binary file added images/prompt_migrator_fig1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/prompt_migrator_fig2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/prompt_migrator_fig3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.