You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pr mentioned a while back that, since Llama 70b used GQA, there is a specific k-quantization trick that allows them to quantize with marginal model size increases:
Mistral 7b, a very popular model released after this PR was made, also uses Grouped Query Attention.
Checking for this if the 7b is a Mistral model and applying the same treatment should theoretically provide similar gains unless I am mistaken.
I think in general quantization optimization is sorely overlooked, lots of low hanging fruit there for sure....