Skip to content

Fix spm whitespaces #2806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 26, 2023
Merged

Fix spm whitespaces #2806

merged 3 commits into from
Aug 26, 2023

Conversation

klosax
Copy link
Contributor

@klosax klosax commented Aug 26, 2023

Fix so prepending whitespace to prompt when using spm tokenizer is done in user code instead of in the tokenizer.
This will restore pre-gguf accuracy in HellaSwag with models using the spm tokenizer (#2321 (comment)).

Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my knowledge level for tokenizing stuff falls short of being able to review something like this. I really only have basic familiarity with the broad strokes.

@klosax klosax merged commit 2ba83c8 into ggml-org:master Aug 26, 2023
@ggerganov
Copy link
Member

ggerganov commented Aug 26, 2023

We are still doing something wrong somewhere. When I test the OG sentencepiece tokenizer with the following python code:

import os
import sys
import argparse

from sentencepiece import SentencePieceProcessor

parser = argparse.ArgumentParser()
parser.add_argument("dir_tokenizer", help="directory containing 'tokenizer.model' file")
args = parser.parse_args()

dir_tokenizer = args.dir_tokenizer

tokenizer = SentencePieceProcessor(dir_tokenizer + '/tokenizer.model')

text = 'Hello, world!'
print(text)
print(tokenizer.encode(text, add_bos=True))
print(tokenizer.decode(tokenizer.encode(text, add_bos=True)))

I get the output:

Hello, world!
[1, 15043, 29892, 3186, 29991]
Hello, world!

Our tokenizer tokenizes correctly, however the de-tokenization incorrectly prefixes the text with a whitespace:

main : 'Hello, world!' tokenized to ' Hello, world!'

Also, I don't think that manually having to prefix with a whitespace in user code is correct.
The Python tokenizer does not have such a requirement

@klosax
Copy link
Contributor Author

klosax commented Aug 26, 2023

Also, I don't think that manually having to prefix with a whitespace in user code is correct.
The Python tokenizer does not have such a requirement

I just restored the pre-gguf behavior.

I can not find any sources that says that the correct way is to prepend a whitespace to the prompt.
Maybe we could remove it? Or at least have it optional like the add_bos param to the tokenizer.

@ggerganov
Copy link
Member

I can not find any sources that says that the correct way is to prepend a whitespace to the prompt.

I don't know what was the intent when it was first added. The comment says "to match the OG tokenizer", but not sure in what way it matches it.

Regarding the extra leading space during de-tokenization, I can see that in llama2.c they explicitly remove the leading space if the previous token is BOS:

https://github.com/karpathy/llama2.c/blob/49daf18f2f85cab239e80b8a452a2f78f295032f/run.c#L423-L426

I guess I'll add this to our de-tokenization code as well. This does not affect any of the generation results, but fixes a display problem and makes it consistent with the OG behaviour.

mattgauf added a commit to mattgauf/llama.cpp that referenced this pull request Aug 26, 2023
* master: (773 commits)
  server : add `/detokenize` endpoint (ggml-org#2802)
  convert.py : advanced option (ggml-org#2753)
  llama : use Unicode Escape Sequence to replace encoded characters (ggml-org#2814)
  flake.nix : add rocm support and cleanup (ggml-org#2808)
  llama : move #includes out of _GNU_SOURCE conditional (ggml-org#2817)
  main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggml-org#1528)
  llama : use std::abs in llama_sample_tail_free (ggml-org#2800)
  k-quants : remove unnecessary tensor shape restrictions (ggml-org#2811)
  Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggml-org#2807)
  Fix HellaSwag (ggml-org#2805)
  flake : build llama.cpp on Intel with nix (ggml-org#2795)
  Handle null rope scaling value (ggml-org#2793)
  Fix spm whitespaces (ggml-org#2806)
  examples : skip unnecessary external lib in server README.md how-to (ggml-org#2804)
  llama : fix struct decl (ggml-org#2790)
  Faster perplexity computation (ggml-org#2786)
  llama : add llama_beam_search() (ggml-org#2267)
  convert.py : Get rope scale from HuggingFace models (ggml-org#2772)
  llama-bench : add model sizes (ggml-org#2771)
  convert.py : export rope freq_base when converting CodeLlama from an HF model (ggml-org#2773)
  ...
akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* llama.cpp : fix spm whitespace escaping + clean up

* main.cpp : spm - add whitespace in front of prompt

* test-tokenizer-0.cpp : spm - add whitespace in front of prompt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants