Skip to content

CUDA/OpenCL error, out of memory when reload. #1456

@edp1096

Description

@edp1096

Hello folks,

When try save-load-state example with CUDA, error occured.
It seems to necessary to add something toward llama_free function.

n_gpu_layers variable is appended at main function like below.

int main(int argc, char ** argv) {
    ...
    auto lparams = llama_context_default_params();

    lparams.n_ctx     = params.n_ctx;
    lparams.n_parts   = params.n_parts;
    lparams.n_gpu_layers = params.n_gpu_layers; // Add gpu layers count
    lparams.seed      = params.seed;
    ...
}

And tried to run as below.

D:\dev\pcbangstudio\workspace\my-llama\bin>save-load-state.exe -m ggml-vic7b-q4_0.bin -ngl 32
main: build = 548 (60f8c36)
llama.cpp: loading model from ggml-vic7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3860 MB
llama_init_from_file: kv self size  =  256.00 MB

The quick brown fox jumps over the lazy dog.

<!-- InstanceEnd -->Visible transl

llama.cpp: loading model from ggml-vic7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
CUDA error 2 at D:\dev\pcbangstudio\workspace\my-llama\llama.cpp\ggml-cuda.cu:462: out of memory

D:\dev\pcbangstudio\workspace\my-llama\bin>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghardwareHardware relatedhigh priorityVery important issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions