Skip to content

Update Flash-Decoding blogpost #1492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 20, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions _posts/2023-10-13-flash-decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ We also micro-benchmark the scaled multi-head attention for various sequence len

| | | | |
| ------------------- | ------------- | ---------------------- | -------------- |
| Setting \ Algorithm | PyTorch Eager | Flash-Attention v2.0.9 | Flash-Decoding |
| Setting \ Algorithm | PyTorch Eager (us) | Flash-Attention v2.0.9 (us) | Flash-Decoding (us) |
| B=256, seqlen=256 | 3058.6 | 390.5 | 63.4 |
| B=128, seqlen=512 | 3151.4 | 366.3 | 67.7 |
| B=64, seqlen=1024 | 3160.4 | 364.8 | 77.7 |
Expand Down Expand Up @@ -105,4 +105,4 @@ A full example of decoding with LLaMa v2 / CodeLLaMa is available in the FlashAt

### Acknowledgements

Thanks to Erich Elsen, Ashish Vaswani, and Michaël Benesty for suggesting this idea of splitting the KVcache loading. We want to thank Jeremy Reizenstein, Patrick Labatut and Andrew Tulloch for the valuable discussions. We also want to thank Geeta Chauhan and Gregory Chanan for helping with the writing and more broadly contributing to getting this published on the PyTorch blog.
Thanks to Erich Elsen, Ashish Vaswani, and Michaël Benesty for suggesting this idea of splitting the KVcache loading. We want to thank Jeremy Reizenstein, Patrick Labatut and Andrew Tulloch for the valuable discussions, and Quentin Carbonneaux for contributing the efficient decoding example to xFormers. We also want to thank Geeta Chauhan and Gregory Chanan for helping with the writing and more broadly contributing to getting this published on the PyTorch blog.