Closed
Description
Hello, I have pulled today and build on windows using:
cmake -DLLAMA_CUBLAS=1
cmake --build . --config Release
then
$ ./main.exe -t 6 -ngl 18 -m ../../../models/gpt4-x-vicuna-13B.ggml.q5_1.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -r "user:" -f ./chat-with-sam.txt
main: build = 606 (7552ac5)
main: seed = 1685398340
llama.cpp: loading model from ../../../models/gpt4-x-vicuna-13B.ggml.q5_1.bin
llama_model_load_internal: format = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 7274.60 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 18 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 4084 MB
.............................................
llama_init_from_file: kv self size = 1600.00 MB
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0
You are Sam, a rude AI.
User: hello!
Sam: Hey. What do you want? I'm busy watching cat videos. [end of text]
llama_print_timings: load time = 7954.60 ms
llama_print_timings: sample time = 4.00 ms / 16 runs ( 0.25 ms per token)
llama_print_timings: prompt eval time = 4103.94 ms / 20 tokens ( 205.20 ms per token)
llama_print_timings: eval time = 4712.46 ms / 15 runs ( 314.16 ms per token)
llama_print_timings: total time = 12674.03 ms
previously it did stop at User:
waiting for my inputs, now it just ends the conversation, I have no idea what have been changed, cab you please help?
Metadata
Metadata
Assignees
Labels
No labels