marvin78
10 kW
Gtx560 is Fermi and most certainly has cuda support. Maybe/probably not the wanted feature set/version.
I just wish this whole llm thing would run on openCL instead, and be vendor agnostic
edit: Seems I can play with this a bit even with my very limited hardware..
I bet it would be dog slow but still, might be fun.
I just wish this whole llm thing would run on openCL instead, and be vendor agnostic
edit: Seems I can play with this a bit even with my very limited hardware..
With an Intel i5-10400 (6 cores, 12 threads) + RX 580 (4GB VRAM), you have a decent setup for running small LLMs with CPU+OpenCL acceleration.
Best Model Choices (4GB VRAM Limit)
- Tiny LLMs (Best Fit)
- Phi-2 (2.7B) – Efficient and should run smoothly.
- Llama 2 7B (4-bit) – Will run with CPU help.
- Mistral 7B (4-bit quantized) – Might work but will be slow.
- Optimized Inference Tools for OpenCL
- llama.cpp (has OpenCL support)
- KoboldCPP (for chat-based AI, supports OpenCL)
- Text-generation-webui (use CPU+OpenCL hybrid mode)
How to Run LLMs on Your Hardware
- Install llama.cpp (best OpenCL-compatible runtime)
- Download: GitHub - ggerganov/llama.cpp: LLM inference in C/C++
- Compile with OpenCL:
sh
KopiërenBewerken
make LLAMA_OPENCL=1- Run a quantized model (e.g., Phi-2):
sh
KopiërenBewerken
./main -m phi2-2.7b.Q4_K.gguf -t 6 -ngl 1- Use GGUF Models (Pre-Quantized for Small VRAM)
- Get 4-bit GGUF models from TheBloke on Hugging Face
- Load them with llama.cpp or KoboldCPP.
Performance Notes
Let me know if you want a detailed setup guide!
- RX 580 will help, but CPU will do most work
- Use 4-bit models to fit in VRAM
- Enable swap space if running out of RAM
![]()
I bet it would be dog slow but still, might be fun.
Last edited: