Skip to content

Eval bug: microsoft/bitnet-b1.58-2B-4T-gguf #12997

Open
@celsowm

Description

@celsowm

Name and Version

PS C:\Users\celso> llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 5052 (1be76e4)
built with MSVC 19.41.34120.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

RTX 3060 12GB

Models

https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf

Problem description & steps to reproduce

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 5052 (1be76e4) with MSVC 19.41.34120.0 for x64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8081, http threads: 11
main: loading model
srv load_model: loading model '.\ggml-model-i2_s.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) - 11247 MiB free
gguf_init_from_file_impl: tensor 'blk.0.ffn_down.weight' of type 36 (TYPE_IQ4_NL_4_4 REMOVED, use IQ4_NL with runtime repacking) has 6912 elements per row, not a multiple of block size (0)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from .\ggml-model-i2_s.gguf

First Bad Commit

No response

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 5052 (1be76e46) with MSVC 19.41.34120.0 for x64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8081, http threads: 11
main: loading model
srv    load_model: loading model '.\ggml-model-i2_s.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) - 11247 MiB free
gguf_init_from_file_impl: tensor 'blk.0.ffn_down.weight' of type 36 (TYPE_IQ4_NL_4_4 REMOVED, use IQ4_NL with runtime repacking) has 6912 elements per row, not a multiple of block size (0)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from .\ggml-model-i2_s.gguf

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions