-
Notifications
You must be signed in to change notification settings - Fork 893
Sentence Transformers v5.0 - Sparse Encoder Models #2924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e-free Splade for clarity
…ing module and chnage lambda associated code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format looks ok, approving to unlock. I'll read it in depth on Monday.
|
||
``` | ||
'##lal', 'severe', '##pha', 'ce', '##gia', 'patient', 'complaint', 'patients', 'complained', 'warning', 'suffered', 'had', 'disease', 'complain', 'diagnosis', 'syndrome', 'mild', 'pain', 'hospital', 'injury' | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super interesting. Is this tokenizer related, or not at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the models thinks these tokens best describe the text. Very broadly, the model outputs a sparse embedding, and every non-zero value corresponds to the tokens above. The ##lal
is just how the tokenizer denotes subwords. If you rearrange them a bit, you can see that it spelled cephalalgia
with tokens 1, 3, 4, and 5.
Beyond just taking tokens from the input text, it also takes related/synonym tokens (complaint, warning, suffered) that might also be used to describe other related documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I noticed that. My question was more about whether having cephalalgia
as a single token would help the model expand better or not (i.e., does tokenization have an impact on performance, for this particular use).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, oops. In that case: yes. A very big impact, actually. To the extent that more modern tokenizers sometimes actually perform worse when turned into sparse encoders, because their tokens are "reused" too often between words with wildly different meanings. On paper, a tokenizer with as few subword tokens as possible is likely going to be best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks a lot for the clarification! Makes sense, I can ~visualize how the model "activates" for the tokens in the sequence, and there's no concept of sorting or grouping.
Hello!
Pull Request overview
Details
This blogpost is our latest "blogpostification" of our "Training Overview" documentation, much like I've done for https://huggingface.co/blog/train-sentence-transformers and https://huggingface.co/blog/train-reranker for v3 and v4, respectively. They are meant to be solid for SEO, with each of the prior blogposts showing up when looking for training embedding models or rerankers.
Because of this approach, the text has already been reviewed a few times, so a very thorough review may not be necessary.
Preparing the Article
You're not quite done yet, though. Please make sure to follow this process (as documented here):
There are no requirements here, but there is a template if it's helpful.I'm reusing the older ones, as I've done for the v3.0 and v4.0 blogposts.md
file. You can also specifyguest
ororg
for the authors.Here is an example of a complete PR: #2382
Getting a Review
cc @pcuenca
As mentioned above, a thorough review may not be necessary as the content is very similar to e.g. https://huggingface.co/blog/train-reranker. A double-check of the _blog.yml and metadata above the blog itself is definitely appreciated, though!
cc @arthurbr11 my co-author.