Fix spm whitespaces #2806

klosax · 2023-08-26T10:19:56Z

Fix so prepending whitespace to prompt when using spm tokenizer is done in user code instead of in the tokenizer.
This will restore pre-gguf accuracy in HellaSwag with models using the spm tokenizer (#2321 (comment)).

KerfuffleV2

Sorry, my knowledge level for tokenizing stuff falls short of being able to review something like this. I really only have basic familiarity with the broad strokes.

ggerganov · 2023-08-26T12:54:30Z

We are still doing something wrong somewhere. When I test the OG sentencepiece tokenizer with the following python code:

import os
import sys
import argparse

from sentencepiece import SentencePieceProcessor

parser = argparse.ArgumentParser()
parser.add_argument("dir_tokenizer", help="directory containing 'tokenizer.model' file")
args = parser.parse_args()

dir_tokenizer = args.dir_tokenizer

tokenizer = SentencePieceProcessor(dir_tokenizer + '/tokenizer.model')

text = 'Hello, world!'
print(text)
print(tokenizer.encode(text, add_bos=True))
print(tokenizer.decode(tokenizer.encode(text, add_bos=True)))

I get the output:

Hello, world!
[1, 15043, 29892, 3186, 29991]
Hello, world!

Our tokenizer tokenizes correctly, however the de-tokenization incorrectly prefixes the text with a whitespace:

main : 'Hello, world!' tokenized to ' Hello, world!'

Also, I don't think that manually having to prefix with a whitespace in user code is correct.
The Python tokenizer does not have such a requirement

klosax · 2023-08-26T13:26:01Z

Also, I don't think that manually having to prefix with a whitespace in user code is correct.
The Python tokenizer does not have such a requirement

I just restored the pre-gguf behavior.

I can not find any sources that says that the correct way is to prepend a whitespace to the prompt.
Maybe we could remove it? Or at least have it optional like the add_bos param to the tokenizer.

ggerganov · 2023-08-26T13:35:43Z

I can not find any sources that says that the correct way is to prepend a whitespace to the prompt.

I don't know what was the intent when it was first added. The comment says "to match the OG tokenizer", but not sure in what way it matches it.

Regarding the extra leading space during de-tokenization, I can see that in llama2.c they explicitly remove the leading space if the previous token is BOS:

https://github.com/karpathy/llama2.c/blob/49daf18f2f85cab239e80b8a452a2f78f295032f/run.c#L423-L426

I guess I'll add this to our de-tokenization code as well. This does not affect any of the generation results, but fixes a display problem and makes it consistent with the OG behaviour.

* master: (773 commits) server : add `/detokenize` endpoint (ggml-org#2802) convert.py : advanced option (ggml-org#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggml-org#2814) flake.nix : add rocm support and cleanup (ggml-org#2808) llama : move #includes out of _GNU_SOURCE conditional (ggml-org#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggml-org#1528) llama : use std::abs in llama_sample_tail_free (ggml-org#2800) k-quants : remove unnecessary tensor shape restrictions (ggml-org#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggml-org#2807) Fix HellaSwag (ggml-org#2805) flake : build llama.cpp on Intel with nix (ggml-org#2795) Handle null rope scaling value (ggml-org#2793) Fix spm whitespaces (ggml-org#2806) examples : skip unnecessary external lib in server README.md how-to (ggml-org#2804) llama : fix struct decl (ggml-org#2790) Faster perplexity computation (ggml-org#2786) llama : add llama_beam_search() (ggml-org#2267) convert.py : Get rope scale from HuggingFace models (ggml-org#2772) llama-bench : add model sizes (ggml-org#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggml-org#2773) ...

* llama.cpp : fix spm whitespace escaping + clean up * main.cpp : spm - add whitespace in front of prompt * test-tokenizer-0.cpp : spm - add whitespace in front of prompt

klosax added 2 commits August 26, 2023 12:08

llama.cpp : fix spm whitespace escaping + clean up

d528946

main.cpp : spm - add whitespace in front of prompt

43f7c16

klosax requested review from KerfuffleV2 and ikawrakow August 26, 2023 10:23

KerfuffleV2 reviewed Aug 26, 2023

View reviewed changes

test-tokenizer-0.cpp : spm - add whitespace in front of prompt

c50b1ae

klosax merged commit 2ba83c8 into ggml-org:master Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix spm whitespaces #2806

Fix spm whitespaces #2806

Uh oh!

klosax commented Aug 26, 2023 •

edited

Loading

Uh oh!

KerfuffleV2 left a comment

Uh oh!

ggerganov commented Aug 26, 2023 •

edited

Loading

Uh oh!

klosax commented Aug 26, 2023 •

edited

Loading

Uh oh!

ggerganov commented Aug 26, 2023

Uh oh!

Uh oh!

Fix spm whitespaces #2806

Fix spm whitespaces #2806

Uh oh!

Conversation

klosax commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KerfuffleV2 left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klosax commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 26, 2023

Uh oh!

Uh oh!

klosax commented Aug 26, 2023 •

edited

Loading

ggerganov commented Aug 26, 2023 •

edited

Loading

klosax commented Aug 26, 2023 •

edited

Loading