Description
Hi,
We need some advice from the community to be able to fix this issue.
We are running the server :
./server -t 32 --threads-http 32 --no-mmap -ngl 999 --batch-size 32 -m /opt/models/mixtral_ollama/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -c 131072 --parallel 512 --host 0.0.0.0 --port 8091
We have configured Huggingface chat-ui for user interaction.
If we try a stress test asking 20-30 users to write at the same time we see that the memory is accumulating and once everyone stops, the memory is not released, it stays there. At some point, we have OOM because the memory is not released at any point.
My question is, how we can tune this so that the memory usage can be decreased if no one is writing in the chat and avoiding outofmemory issue at CUDA level.
Mar 11 11:45:24 srvmlwrkt01t systemd[1]: Started llama.cpp Service.
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: found 1 CUDA devices:
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: Device 0: GRID A100D-80C, compute capability 8.0, VMM: no
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: {"build":0,"commit":"unknown","function":"main","level":"INFO","line":2796,"msg":"build info","tid":"139841044271104","timestamp":1710150324}
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: {"function":"main","level":"INFO","line":2803,"msg":"system info","n_threads":32,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139841044271104","timestamp":1710150324,"total_threads":8}
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /opt/models/mixtral_ollama/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 0: general.architecture str = llama
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 1: general.name str = mistralai_mixtral-8x7b-instruct-v0.1
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 2: llama.context_length u32 = 32768
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 4: llama.block_count u32 = 32
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 9: llama.expert_count u32 = 8
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 10: llama.expert_used_count u32 = 2
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 13: general.file_type u32 = 17
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 24: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type f32: 65 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type f16: 32 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q8_0: 64 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q5_K: 833 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q6_K: 1 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_vocab: special tokens definition check successful ( 259/32000 ).
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: format = GGUF V3 (latest)
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: arch = llama
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: vocab type = SPM
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_vocab = 32000
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_merges = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_ctx_train = 32768
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd = 4096
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_head = 32
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_head_kv = 8
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_layer = 32
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_rot = 128
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_head_k = 128
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_head_v = 128
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_gqa = 4
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_k_gqa = 1024
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_v_gqa = 1024
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_norm_eps = 0.0e+00
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_ff = 14336
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_expert = 8
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_expert_used = 2
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: pooling type = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope type = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope scaling = linear
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: freq_base_train = 1000000.0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: freq_scale_train = 1
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_yarn_orig_ctx = 32768
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope_finetuned = unknown
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model type = 7B
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model ftype = Q5_K - Medium
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model params = 46.70 B
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model size = 30.02 GiB (5.52 BPW)
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: general.name = mistralai_mixtral-8x7b-instruct-v0.1
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: BOS token = 1 '''
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: EOS token = 2 '
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: UNK token = 0 ''
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: PAD token = 0 ''
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: LF token = 13 '<0x0A>'
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_tensors: ggml ctx size = 0.76 MiB
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloading 32 repeating layers to GPU
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloading non-repeating layers to GPU
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloaded 33/33 layers to GPU
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: CUDA_Host buffer size = 85.94 MiB
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: CUDA0 buffer size = 30649.55 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: n_ctx = 131072
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: freq_base = 1000000.0
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: freq_scale = 1
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_kv_cache_init: CUDA0 KV buffer size = 16384.00 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA_Host input buffer size = 17.50 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA0 compute buffer size = 531.25 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA_Host compute buffer size = 0.50 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: graph splits (measure): 2
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: {"function":"initialize","level":"INFO","line":426,"msg":"initializing slots","n_s
Any advice will be appreciated!