Can't use the product because of error

Hi,

I can’t seem to be able to use Artilla at all. I am on the latest version. I can’t find a way to see what it is, but I actually downloaded 8.2.2 off of the GitHub releases, and I’m pasting the error that I have. I’m on Windows 10.

Error from local LLM: llama-server exited prematurely with status: exit code: 1.
Stderr: load_backend: loaded RPC backend from C:\Users\Rodolfo\AppData\Local\RTILA\runtime\x86_64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 780 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from C:\Users\Rodolfo\AppData\Local\RTILA\runtime\x86_64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\Rodolfo\AppData\Local\RTILA\runtime\x86_64\ggml-cpu-haswell.dll
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8300 (f2ab047f2) with Clang 19.1.5 for Windows x86_64
system info: n_threads = 4, n_threads_batch = 4, total_threads = 8

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Running without SSL
init: using 7 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model ‘C:\Users\Rodolfo.rtila-x/models\qwen3.5-9b-Q4_K_M.gguf’
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 6026 MiB of device memory vs. 2369 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 4680 MiB
llama_params_fit_impl: context size set by user to 16384 → no change
llama_params_fit: failed to fit params to free device memory: n_gpu_layers already set by user to 999, abort
llama_params_fit: fitting params to free memory took 1.02 seconds
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA GeForce GTX 780 Ti) (0000:01:00.0) - 2570 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 427 tensors from C:\Users\Rodolfo.rtila-x/models\qwen3.5-9b-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Temp_Gguf
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 9.0B
llama_model_loader: - kv 5: general.repo_url str = unsloth (Unsloth AI)
llama_model_loader: - kv 6: general.tags arr[str,2] = [“unsloth”, “llama.cpp”]
llama_model_loader: - kv 7: qwen35.block_count u32 = 32
llama_model_loader: - kv 8: qwen35.context_length u32 = 262144
llama_model_loader: - kv 9: qwen35.embedding_length u32 = 4096
llama_model_loader: - kv 10: qwen35.feed_forward_length u32 = 12288
llama_model_loader: - kv 11: qwen35.attention.head_count u32 = 16
llama_model_loader: - kv 12: qwen35.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: qwen35.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 14: qwen35.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 15: qwen35.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 16: qwen35.attention.key_length u32 = 256
llama_model_loader: - kv 17: qwen35.attention.value_length u32 = 256
llama_model_loader: - kv 18: qwen35.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 19: qwen35.ssm.state_size u32 = 128
llama_model_loader: - kv 20: qwen35.ssm.group_count u32 = 16
llama_model_loader: - kv 21: qwen35.ssm.time_step_rank u32 = 32
llama_model_loader: - kv 22: qwen35.ssm.inner_size u32 = 4096
llama_model_loader: - kv 23: qwen35.full_attention_interval u32 = 4
llama_model_loader: - kv 24: qwen35.rope.dimension_count u32 = 64
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,248320] = [“!”, “"”, “#”, “$”, “%”, “&”, “'”, …
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,247587] = [“Ġ Ġ”, “ĠĠ ĠĠ”, “i n”, “Ġ t”,…
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 248055
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- set image_count = namespace(value…
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: general.file_type u32 = 15
llama_model_loader: - type f32: 177 tensors
llama_model_loader: - type q4_K: 217 tensors
llama_model_loader: - type q6_K: 33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 5.23 GiB (5.02 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load: - 248044 (‘<|endoftext|>’)
load: - 248046 (‘<|im_end|>’)
load: - 248063 (‘<|fim_pad|>’)
load: - 248064 (‘<|repo_name|>’)
load: - 248065 (‘<|file_sep|>’)
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch = qwen35
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 32
print_info: n_head = 16
print_info: n_head_kv = 4
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 40
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: mrope sections = [11, 11, 10, 0]
print_info: ssm_d_conv = 4
print_info: ssm_d_inner = 4096
print_info: ssm_d_state = 128
print_info: ssm_dt_rank = 32
print_info: ssm_n_group = 16
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 9B
print_info: model params = 8.95 B
print_info: general.name = Temp_Gguf
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ‘,’
print_info: EOS token = 248046 ‘<|im_end|>’
print_info: EOT token = 248046 ‘<|im_end|>’
print_info: PAD token = 248055 ‘<|vision_pad|>’
print_info: LF token = 198 ‘Ċ’
print_info: FIM PRE token = 248060 ‘<|fim_prefix|>’
print_info: FIM SUF token = 248062 ‘<|fim_suffix|>’
print_info: FIM MID token = 248061 ‘<|fim_middle|>’
print_info: FIM PAD token = 248063 ‘<|fim_pad|>’
print_info: FIM REP token = 248064 ‘<|repo_name|>’
print_info: FIM SEP token = 248065 ‘<|file_sep|>’
print_info: EOG token = 248044 ‘<|endoftext|>’
print_info: EOG token = 248046 ‘<|im_end|>’
print_info: EOG token = 248063 ‘<|fim_pad|>’
print_info: EOG token = 248064 ‘<|repo_name|>’
print_info: EOG token = 248065 ‘<|file_sep|>’
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while… (mmap = true, direct_io = false)
ggml_vulkan: Device memory allocation of size 1050944000 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
alloc_tensor_range: failed to allocate Vulkan0 buffer of size 1050944000
llama_model_load: error loading model: unable to allocate Vulkan0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model ‘C:\Users\Rodolfo.rtila-x/models\qwen3.5-9b-Q4_K_M.gguf’
srv load_model: failed to load model, ‘C:\Users\Rodolfo.rtila-x/models\qwen3.5-9b-Q4_K_M.gguf’
srv operator(): operator(): cleaning up before exit…
main: exiting due to model loading error

Hi,

The logs show you’re hitting a “VRAM bottleneck.” Your GTX 780 Ti is a classic card, but it only has 3GB of memory, while the Assistant Lite 1.5 model you’re trying to run needs about 6GB. When RTILA tries to force that model into your 3GB card, it simply runs out of space and crashes.

The best way to get up and running right now is to switch over to the RTILA Assistant Mini. It’s specifically designed to fit into 2-3GB of VRAM and should run much smoother on your current setup.

If your hardware still struggles with local models, you can also try using OpenRouter. They have plenty of free models available that run in the cloud, so you won’t have to worry about your local specs at all.

Give the Mini version or OpenRouter a shot and let me know if that gets you back on track!

Thank you so much for your response and explanation. Yes, this is definitely an older machine, but I appreciate the option or options we have as workarounds. I’ll definitely be using the Assistant Mini for now.

1 Like