Apparently ChatGPT can help you write code / programs?

Reporting in.

Been nerding out hard tonight.

I've been researching locally hosted LLMs and trying them on my Nvidia 4070 for years now, and only until today was i actually impressed.

New open source models like Alibaba's Qwen3 pack a huge punch for their size and run great on limited hardware.
There's also unsloth re-quantizations of these models that substantially increase performance + let you pack a bigger model into a GPU.

This combined with an 'agentic' VS code plugin ( continue.dev ) connected to my local LLM allows me to chat with source code and gradually compose the code by telling the AI what to do.

View attachment 369925

This test for memory cache engines is 95% written by AI and it wrote the code a little faster than i would have done by hand.
It was completely accurate and the LLM required minimal steering on my part.

I can't imagine what kind of better model you could run on a Nvidia 5090.. as the model's parameter size increases, so does the intelligence and accuracy. With the right hardware, these open source models can start to compete with stuff that big companies are selling.

So basically yes, i'd say we're past the threshold where these tools are useful !
Are you using ollama for the local hosting?

Do you have equipment to measure power draw during longer requests?
 
Yep, but you need to know specifics about doing that beforehand, otherwise you'll get slop.
 
Are you using ollama for the local hosting?
Do you have equipment to measure power draw during longer requests?

Yep.
I do.. on my Nvidia 4070 + i7-10700, i'm drawing almost 200w while generating an answer.
 
Here's how fast Qwen3 is when you run an optimized tune 14B model ( this is the maximum size i can load into my $600 GPU )
We tried this on a new mid range mac mini and saw about half the speed.

Screencast_20250508_100420.gif
 
If you ever have the time, I'd be interested in of a list of required software and maybe a "set of instructions", even loose ones, to setup the required software in a local computer. (I guess I could ask one of the AIs...)

I know I don't have a good enough GPU to run anything at a reasonable speed, but if I could setup something like this to help me code the wolfy project, it would be still better than the pace i'm going at now (which might as well be in reverse).

It'd be going onto one of those ancient HP Proliants (because I have four, and no budget to buy anything, so...).
 
If you get a GPU that's 5 years old and has at least 8gb of ram ( 2070? ), that's the entry point. You can run it on CPU, but it's going to be extremely slow compared to GPU.
I recommend downloading ollama after that point - it makes installing and using open source models very easy.
Once you have ollama and your model of choice running, you can hook it into an IDE like VSCode ( free ) using a plugin like continue.dev, which is what i used here:

GqZrMngXoAAMfJx.jpeg

Otherwise without the IDE integration, you just got a chat interface and have to copy/paste code in and out to get help with it.
There is a google chrome extension called 'page assist' that's very good at that. The above white animated gif is using 'page assist', which is getting those answers out of Qwen3. :)
 
If you ever have the time, I'd be interested in of a list of required software and maybe a "set of instructions", even loose ones, to setup the required software in a local computer. (I guess I could ask one of the AIs...)

I know I don't have a good enough GPU to run anything at a reasonable speed, but if I could setup something like this to help me code the wolfy project, it would be still better than the pace i'm going at now (which might as well be in reverse).

It'd be going onto one of those ancient HP Proliants (because I have four, and no budget to buy anything, so...).
 
Want to add some addendum to the above video..

Currently, systems like ollama and lmstudio that act like an abstraction layer for various AI models are not good at splitting AI models between two GPUs..
They will split the model between GPU RAM, but the actual processing of the query happens on a single GPU at a time..
And as you increase the AI model size ( aka the billions of parameters count ), the computational power increases.
So with multiple consumer GPUs.. you can stuff a big model in.. but it will run unacceptably slow.

You would be better off with a single 5090.. or some workstation card, if you want to get serious about this.

But.. AI models continue to increase in efficiency and intelligence per unit of hardware/watt in the last few years, so the hardware requirements are generally going down.

Qwen3 is the leader of AI efficiency right now, which is why i recommend it.
Check this out.. The 32B parameter Qwen ( easily fits into 24gb vram, aka a single 4090 ) currently outperforms some models from OpenAI that require GPUs with 100's of GB of memory each.

It even puts my favorite, Deepseek R1, ( which is a 200B+ model ) to shame!
1746803909080.png
Also.. you don't need a super fast CPU.. unless you are over-stuffing the model into the GPU.. in which case, ollama will run the model partially on CPU, and performance will degrade more as the model spills into CPU/RAM.. only then will the CPU speed start to matter.

But GPUs run AI massively faster than CPUs do even if you have some AMD Threadripper or Epyc to compensate.. CPUs also consume more power per query.. just not the right hardware.

So don't load models that are larger than your GPU vram.. there is no way to have a good time!

Also,, don't bother with an AMD or Intel GPU yet.. they don't support CUDA.. and 100% of AI models so far are designed specifically for CUDA which is a Nvidia standard. When you get a LLM working on some other brand GPU via anothe rroute, it will still not perform well.

Hopefully this changes in the future because it's very bad for one hardware manufacturer to basically have a monopoly on this.

But anything but Nvidia GPU or AI specific hardware is basically a dead end right now.
 
Want to add some addendum to the above video..

Currently, systems like ollama and lmstudio that act like an abstraction layer for various AI models are not good at splitting AI models between two GPUs..
They will split the model between GPU RAM, but the actual processing of the query happens on a single GPU at a time..
And as you increase the AI model size ( aka the billions of parameters count ), the computational power increases.
So with multiple consumer GPUs.. you can stuff a big model in.. but it will run unacceptably slow.

You would be better off with a single 5090.. or some workstation card, if you want to get serious about this.

But.. AI models continue to increase in efficiency and intelligence per unit of hardware/watt in the last few years, so the hardware requirements are generally going down.

Qwen3 is the leader of AI efficiency right now, which is why i recommend it.
Check this out.. The 32B parameter Qwen ( easily fits into 24gb vram, aka a single 4090 ) currently outperforms some models from OpenAI that require GPUs with 100's of GB of memory each.

It even puts my favorite, Deepseek R1, ( which is a 200B+ model ) to shame!
View attachment 369993
Also.. you don't need a super fast CPU.. unless you are over-stuffing the model into the GPU.. in which case, ollama will run the model partially on CPU, and performance will degrade more as the model spills into CPU/RAM.. only then will the CPU speed start to matter.

But GPUs run AI massively faster than CPUs do even if you have some AMD Threadripper or Epyc to compensate.. CPUs also consume more power per query.. just not the right hardware.

So don't load models that are larger than your GPU vram.. there is no way to have a good time!

Also,, don't bother with an AMD or Intel GPU yet.. they don't support CUDA.. and 100% of AI models so far are designed specifically for CUDA which is a Nvidia standard. When you get a LLM working on some other brand GPU via anothe rroute, it will still not perform well.

Hopefully this changes in the future because it's very bad for one hardware manufacturer to basically have a monopoly on this.

But anything but Nvidia GPU or AI specific hardware is basically a dead end right now.
NetworkChuck is really good at dumbing down concepts and projects, so much so that they're too simplistic. I actually don't like his content for more than the first 5 minutes that gives me enough knowledge to search for more detailed information. But its good enough as an overview, thanks for the clarifications.

I'll be much more interested in self-hosting LLMs when they're advanced enough to get decent performance for less than 100w.
 
We're 100 watts away from your target as of the release of Qwen3.
Literally matches the accuracy+speed of LLMs that needed 10x the memory/cpu 4 months ago.
 
Sorry i gotta revise that.

There is a LLM research lab called unsloth that produces a guide for fine-tuning Qwen3 among other open source models.
Supposedly the speed gain is 2-3x.. but doing the fine tuning requires some knowledge of AI training ( over my head + hardware capabilities at the moment )

In fact i recently had contact with the head guy at Unsloth... very cool guy!

I will be purchasing some hardware soon and will work on this fine tuning process when i have time.
I'm curious if i can meet your wattage target now :)
 
Sorry i gotta revise that.

There is a LLM research lab called unsloth that produces a guide for fine-tuning Qwen3 among other open source models.
Supposedly the speed gain is 2-3x.. but doing the fine tuning requires some knowledge of AI training ( over my head + hardware capabilities at the moment )

In fact i recently had contact with the head guy at Unsloth... very cool guy!

I will be purchasing some hardware soon and will work on this fine tuning process when i have time.
I'm curious if i can meet your wattage target now :)
Because I do little to no coding/programming, I see very little that an AI can do for me, personally. However, I'm very into self-hosting and just hobby homelabbing for fun, so i would still love to set something up local on low power just to play with. As I said before on this thread and others, I'm more than happy to be proved wrong.

For now, I have a decent ASUS motherboard leftover from a project. Maybe I'll scour my local classifieds for a used/cheap Nvidia gpu to slot into in and see what happens.
 
I think you should look for a 2070 or 2080..
You could also pay $20 for a month of LLM service or use a number of free demos.

You can still use the previous top LLM open source model for free if you sign up @
https://chat.deepseek.com/
 
Back
Top