J A B B Y A I

Loading

Hi everyone,
I’ve got a local dev box with:

OS: Linux 5.15.0-130-generic CPU: AMD Ryzen 5 5600G (12 threads) RAM: 48 GiB total Disk: 1 TB NVME + 1 Old HDD GPU: AMD Radeon (no NVIDIA/CUDA) I have ollama installed and currently I have 2 local llm installed deepseek-r1:1.5b & llama2:7b (3.8G) 

I’m already running llama2:7B (Q4_0, ~3.8 GiB model) at ~50% CPU load per prompt, which works well but it’s not too smart I want smarter then this model. I’m building a VS Code extension that embeds a local LLM and in extenstion I have context manual capabilities and working on (enhanced context, mcp, basic agentic mode & etc) and need a model that:

  • Fits comfortably in RAM
  • Maximizes inference speed on 12 cores (no GPU/CUDA)
  • Yields strong conversational accuracy

Given my specs and limited bandwidth (one download only), which OLLAMA model (and quantization) would you recommend?

Please let me know any additional info needed.

TLDR;

As per my findings I found below things (some part is ai sugested as per my specs):

  • Qwen2.5-Coder 32B Instruct with Q8_0 quantization is the best model (I don’t confirm it, but as per my findings I found this but I am not sure)
  • models like Gemma 3 27B or Mistral Small 3.1 24B as alternatives, but Qwen2.5-Coder excels (I don’t confirm it, but as per my findings I found this but I am not sure)

Memory and Model Size Constraints

The memory requirement for LLMs is primarily driven by the model’s parameter count and quantization level. For a 7B model like LLaMA 2:7B, your current 3.8GB usage suggests a 4-bit quantization (approximately 3.5GB for 7B parameters at 4 bits, plus overhead). General guidelines from Ollama GitHub indicate 8GB RAM for 7B models, 16GB for 13B, and 32GB for 33B models, suggesting you can handle up to 33B parameters with your 37Gi (39.7GB) available RAM. However, larger models like 70B typically require 64GB.

Model Options and Quantization

  • LLaMA 3.1 8B: Q8_0 at 8.54GB
  • Gemma 3 27B: Q8_0 at 28.71GB, Q4_K_M at 16.55GB
  • Mistral Small 3.1 24B: Q8_0 at 25.05GB, Q4_K_M at 14.33GB
  • Qwen2.5-Coder 32B: Q8_0 at 34.82GB, Q6_K at 26.89GB, Q4_K_M at 19.85GB

Given your RAM, models up to 34.82GB (Qwen2.5-Coder 32B Q8_0) are feasible (AI Generated)

Model Parameters Q8_0 Size (GB) Coding Focus General Capabilities Notes
LLaMA 3.1 8B 8B 8.54 Moderate Strong General purpose, smaller, good for baseline.
Gemma 3 27B 27B 28.71 Good Excellent, multimodal Supports text and images, strong reasoning, fits RAM.
Mistral Small 3.1 24B 24B 25.05 Very Good Excellent, fast Low latency, competitive with larger models, fits RAM.
Qwen2.5-Coder 32B 32B 34.82 Excellent Strong SOTA for coding, matches GPT-4o, ideal for VS Code extension, fits RAM.

I have also checked:

submitted by /u/InsideResolve4517
[link] [comments]

Leave a Comment