Sentence Transformers v5.0 - Sparse Encoder Models #2924

tomaarsen · 2025-06-27T11:41:52Z

Hello!

Pull Request overview

Add the Sentence Transformers v5.0 blogpost introducing Sparse Encoder Models

Details

This blogpost is our latest "blogpostification" of our "Training Overview" documentation, much like I've done for https://huggingface.co/blog/train-sentence-transformers and https://huggingface.co/blog/train-reranker for v3 and v4, respectively. They are meant to be solid for SEO, with each of the prior blogposts showing up when looking for training embedding models or rerankers.

Because of this approach, the text has already been reviewed a few times, so a very thorough review may not be necessary.

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

Add an entry to _blog.yml.
Add a thumbnail. ~~There are no requirements here, but there is a template if it's helpful.~~ I'm reusing the older ones, as I've done for the v3.0 and v4.0 blogposts.
Check you use a short title and blog path. Short enough, I hope
Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
Ensure the publication date is correct.
Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

cc @pcuenca
As mentioned above, a thorough review may not be necessary as the content is very similar to e.g. https://huggingface.co/blog/train-reranker. A double-check of the _blog.yml and metadata above the blog itself is definitely appreciated, though!

cc @arthurbr11 my co-author.

Tom Aarsen

…e-free Splade for clarity

…R paper

…ing module and chnage lambda associated code

pcuenca

Format looks ok, approving to unlock. I'll read it in depth on Monday.

train-sparse-encoder.md

pcuenca · 2025-07-01T12:45:14Z

train-sparse-encoder.md

+
+```
+'##lal', 'severe', '##pha', 'ce', '##gia', 'patient', 'complaint', 'patients', 'complained', 'warning', 'suffered', 'had', 'disease', 'complain', 'diagnosis', 'syndrome', 'mild', 'pain', 'hospital', 'injury'
+```


Super interesting. Is this tokenizer related, or not at all?

Yeah, the models thinks these tokens best describe the text. Very broadly, the model outputs a sparse embedding, and every non-zero value corresponds to the tokens above. The ##lal is just how the tokenizer denotes subwords. If you rearrange them a bit, you can see that it spelled cephalalgia with tokens 1, 3, 4, and 5.

Beyond just taking tokens from the input text, it also takes related/synonym tokens (complaint, warning, suffered) that might also be used to describe other related documents.

Yes, I noticed that. My question was more about whether having cephalalgia as a single token would help the model expand better or not (i.e., does tokenization have an impact on performance, for this particular use).

Oh, oops. In that case: yes. A very big impact, actually. To the extent that more modern tokenizers sometimes actually perform worse when turned into sparse encoders, because their tokens are "reused" too often between words with wildly different meanings. On paper, a tokenizer with as few subword tokens as possible is likely going to be best.

Awesome, thanks a lot for the clarification! Makes sense, I can ~visualize how the model "activates" for the tokens in the sequence, and there's no concept of sorting or grouping.

tomaarsen and others added 17 commits June 12, 2025 17:31

Add an initial draft for Sentence Transformers v5.0

43796ed

Finish the Trainer section, add evaluation section

7739bb8

Add a section on what Sparse Embedding models are

8c2e0b9

Add table of contents

f975dc9

Relocate Contrastive Sparse Representation (CSR) section and Inferenc…

e29ed5b

…e-free Splade for clarity

Update table of contents

012db51

Add Vector Database Integration section with Qdrant example + cite CS…

6592135

…R paper

Add training examples and resources for sparse embedding models

5712e88

Add some commas

4181a23

Unindent inference script

5a27e1c

Use human-first titles in additional resources

1538c97

Add section on pretrained sparse encoders and SPLADE Models collection

d8df80f

Update inference-free Splade implementation to use SparseStaticEmbedd…

9d5b0cd

…ing module and chnage lambda associated code

corpus_regularizer -> document_regularizer

3b0187b

Merge branch 'main' of https://github.com/huggingface/blog into st-v5

e095624

Add _blog.yml with release date

1021326

Remove incorrect documentation list item

caa8189

pcuenca approved these changes Jun 27, 2025

View reviewed changes

tomaarsen and others added 4 commits June 30, 2025 18:01

Update model link

4dfa5f8

Update SPLADE Models collection link

5a3722f

CSRSparsity -> SparseAutoEncoder

72f4e01

"generally" use expansion

f7ca768

pcuenca reviewed Jul 1, 2025

View reviewed changes

train-sparse-encoder.md Show resolved Hide resolved

pcuenca reviewed Jul 1, 2025

View reviewed changes

train-sparse-encoder.md Show resolved Hide resolved

Update variable name

6a198ea

pcuenca reviewed Jul 1, 2025

View reviewed changes

tomaarsen added 3 commits July 1, 2025 15:19

Fix link

8468856

Update link to Loss Overview

db56709

Update HPO link

6443506

tomaarsen merged commit ea837a5 into huggingface:main Jul 1, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sentence Transformers v5.0 - Sparse Encoder Models #2924

Sentence Transformers v5.0 - Sparse Encoder Models #2924

Uh oh!

tomaarsen commented Jun 27, 2025

Uh oh!

pcuenca left a comment

Uh oh!

Uh oh!

Uh oh!

pcuenca Jul 1, 2025

Uh oh!

tomaarsen Jul 1, 2025

Uh oh!

pcuenca Jul 1, 2025

Uh oh!

tomaarsen Jul 1, 2025

Uh oh!

pcuenca Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

Sentence Transformers v5.0 - Sparse Encoder Models #2924

Sentence Transformers v5.0 - Sparse Encoder Models #2924

Uh oh!

Conversation

tomaarsen commented Jun 27, 2025

Pull Request overview

Details

Preparing the Article

Getting a Review

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pcuenca Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

tomaarsen Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

tomaarsen Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!