Skip to content

Sentence Transformers v5.0 - Sparse Encoder Models #2924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jul 1, 2025

Conversation

tomaarsen
Copy link
Member

Hello!

Pull Request overview

  • Add the Sentence Transformers v5.0 blogpost introducing Sparse Encoder Models

Details

This blogpost is our latest "blogpostification" of our "Training Overview" documentation, much like I've done for https://huggingface.co/blog/train-sentence-transformers and https://huggingface.co/blog/train-reranker for v3 and v4, respectively. They are meant to be solid for SEO, with each of the prior blogposts showing up when looking for training embedding models or rerankers.

Because of this approach, the text has already been reviewed a few times, so a very thorough review may not be necessary.

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

  • Add an entry to _blog.yml.
  • Add a thumbnail. There are no requirements here, but there is a template if it's helpful. I'm reusing the older ones, as I've done for the v3.0 and v4.0 blogposts.
  • Check you use a short title and blog path. Short enough, I hope
  • Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
  • Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
  • Ensure the publication date is correct.
  • Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

cc @pcuenca
As mentioned above, a thorough review may not be necessary as the content is very similar to e.g. https://huggingface.co/blog/train-reranker. A double-check of the _blog.yml and metadata above the blog itself is definitely appreciated, though!

cc @arthurbr11 my co-author.

  • Tom Aarsen

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format looks ok, approving to unlock. I'll read it in depth on Monday.


```
'##lal', 'severe', '##pha', 'ce', '##gia', 'patient', 'complaint', 'patients', 'complained', 'warning', 'suffered', 'had', 'disease', 'complain', 'diagnosis', 'syndrome', 'mild', 'pain', 'hospital', 'injury'
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super interesting. Is this tokenizer related, or not at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the models thinks these tokens best describe the text. Very broadly, the model outputs a sparse embedding, and every non-zero value corresponds to the tokens above. The ##lal is just how the tokenizer denotes subwords. If you rearrange them a bit, you can see that it spelled cephalalgia with tokens 1, 3, 4, and 5.

Beyond just taking tokens from the input text, it also takes related/synonym tokens (complaint, warning, suffered) that might also be used to describe other related documents.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed that. My question was more about whether having cephalalgia as a single token would help the model expand better or not (i.e., does tokenization have an impact on performance, for this particular use).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, oops. In that case: yes. A very big impact, actually. To the extent that more modern tokenizers sometimes actually perform worse when turned into sparse encoders, because their tokens are "reused" too often between words with wildly different meanings. On paper, a tokenizer with as few subword tokens as possible is likely going to be best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks a lot for the clarification! Makes sense, I can ~visualize how the model "activates" for the tokens in the sequence, and there's no concept of sorting or grouping.

@tomaarsen tomaarsen merged commit ea837a5 into huggingface:main Jul 1, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants