Skip to content

Commit 61353e1

Browse files
committed
f
Signed-off-by: Chris Abraham <[email protected]>
1 parent 8e1006e commit 61353e1

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

_posts/2025-02-25-accelerating-generative-ai-2.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -714,8 +714,10 @@ MPS
714714
</td>
715715
</tr>
716716
<tr>
717-
<td>AO \
718-
+ batching \
717+
<td>AO
718+
<br/>
719+
+ batching
720+
<br/>
719721
+ compile (warm)
720722
</td>
721723
<td>113
@@ -1160,7 +1162,7 @@ Finally, we deployed our optimized inference onto [Modal](https://modal.com), a
11601162

11611163
In particular, compilation and AOTI via torch.export requires extra work. In a naïve deployment that work might be added to every single inference execution, adding latency that dwarfs any improvements from a faster model. This is particularly challenging with elastic or autoscaling infrastructure, where replicas of our inference service need to be regularly and automatically created and destroyed.
11621164

1163-
We share a deployment script in the torchao repository (<code>[cli_on_modal.py](https://github.com/pytorch/ao/tree/main/examples/sam2_amg_server)</code>) to demonstrate one pattern for an elastic deployment. We build the exported models ahead of time and then upload them to [distributed storage](https://modal.com/docs/guide/volumes). Relative to eager execution, this adds a bit of extra work when replicas spin up since they need to read this data over a network, but this is far less costly than compilation or export.
1165+
We share a deployment script in the torchao repository ([cli_on_modal.py](https://github.com/pytorch/ao/tree/main/examples/sam2_amg_server)) to demonstrate one pattern for an elastic deployment. We build the exported models ahead of time and then upload them to [distributed storage](https://modal.com/docs/guide/volumes). Relative to eager execution, this adds a bit of extra work when replicas spin up since they need to read this data over a network, but this is far less costly than compilation or export.
11641166

11651167
We benchmarked this deployment with a large batch inference workload: sending 1000 images for concurrent processing. The deployment scales up to ten replicas on ten GPUs at peak and scales down to zero GPUs when inactive.
11661168

0 commit comments

Comments
 (0)