openai · CorwinCheung · Jul 12, 2025
diff --git a/examples/Prompt_migration_guide.ipynb b/examples/Prompt_migration_guide.ipynb
@@ -698,6 +698,85 @@
     "Consistent testing and refinement ensure your prompts consistently achieve their intended results."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "cac0dc7f",
+   "metadata": {},
+   "source": [
+    "### Current Example\n",
+    "\n",
+    "Let’s evaluate whether our current working prompt has improved as a result of prompt migration. The original prompt, drawn from this [paper](https://arxiv.org/pdf/2306.05685), is designed to serve as a judge between two assistants’ answers. Conveniently, the paper provides a set of human-annotated judgments and assesses the LLM judge based on its agreement with these human ground truths.\n",
+    "\n",
+    "Our goal here is to measure how closely the judgments generated by our migrated prompt align with human evaluations. For context, the benchmark we’re using is a subset of MT-Bench, which features multi-turn conversations. In this example, we’re evaluating 200 conversation rows, each comparing the performance of different model pairs.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f50f9a0",
+   "metadata": {},
+   "source": [
+    "On our evaluation subset, a useful reference point is human-human agreement, since each conversation is rated by multiple annotators. For turn 1 (without ties), humans agree with each other in 81% of cases, and for turn 2, in 76% of cases."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "20e4d610",
+   "metadata": {},
+   "source": [
+    "![Graph 1 for Model Agreement](../images/prompt_migrator_fig1.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a3eccb6c",
+   "metadata": {},
+   "source": [
+    "Comparing this to our models before migration, GPT-4 (as used in the paper) achieves an agreement with human judgments of 74% on turn 1 and 71% on turn 2, which is not bad, but still below the human-human upper bound."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91dc3d38",
+   "metadata": {},
+   "source": [
+    "![Graph 2 for Model Agreement](../images/prompt_migrator_fig2.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9f7e206f",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Switching to GPT-4.1 (using the same prompt) improves the agreement: 78% (65/83) on turn 1 and 72% (61/85) on turn 2."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "800da674",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Finally, after migrating and tuning our prompt specifically for GPT-4.1, the agreement climbs further, reaching 80% on turn 1 and 72% on turn 2, very close to matching the level of agreement seen between human annotators."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7af0337b",
+   "metadata": {},
+   "source": [
+    "![Graph 3 for Model Agreement](../images/prompt_migrator_fig3.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43ae2ba5",
+   "metadata": {},
+   "source": [
+    "Viewed all together, we can see that prompt migration and model upgrades improve agreement on our sample task. Go ahead and try it on yours!"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "c3ed1776",
@@ -883,7 +962,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.8"
+   "version": "3.12.9"
   }
  },
  "nbformat": 4,

diff --git a/images/prompt_migrator_fig1.png b/images/prompt_migrator_fig1.png
diff --git a/images/prompt_migrator_fig2.png b/images/prompt_migrator_fig2.png
diff --git a/images/prompt_migrator_fig3.png b/images/prompt_migrator_fig3.png