Trying Ollama (Local AI) with a new/old GPU

I have dipped a toe in the local LLM water only a little before.

Inspired by @Geoff, I looked for a way to run something locally - and ended up at finding “gguf” and “llama.cpp”.

It was interesting, and kinda neat, to see /something/ being generated … but it was pretty awkward and sometimes got stuck in loops or went waaaaay off track.

For example :

Previously I had a gtx-1070 gpu (2016) … and just now have upgraded to a rtx-3070 (2020)
They both have 8GB memory.
I wanted to see what the difference was … but, it turns out I was never actually using the 1070 :sad_but_relieved_face:

In any case, I installed Ollama, got it to download some models (Qwen3:4B, Qwen3:8B, Qwen3:30B, Deepseek-r1:8B) and try it. (I was asking it the same question for Delphi .. but happened to record asking about it for Rust)

The results were pretty strong .. more than I’d expected :

Bigger models are even stronger, but you’ll need some very expensive hardware to run them: >30k as a start.

And it does change daily, so what you see, think or plan today would be totally disrupted tomorrow, hard to keep up…

Alex

lmstudio work pretty well and is easy to use - https://lmstudio.ai/

I have a 4060Ti with 8GB vram - and vram is a major limiting factor for local ai. I the entire model is not able to be loaded into vram the you can forget about an decent performance. Sadly the nvidia pro 6000 are well out of my budget at around $18K - they do have 96GB DDR7 vram.

The Intel Arc B60’s look promising, 24GB vram.

Another issue is that most of the new cards are PCIe 5 which for me means upgrading my pc.

One other option is the Mac M4 Pro Max - but when you spec those up it gets $$$ pretty quickly. So for now I’ll sstick with claude code and limit what it has access to.

2 Likes

I will second LM Studio.

As for model size, I find that you can generally run a model that is around the same size as total RAM (VRAM+system), so with 4GB VRAM + 32GB system RAM you can run a model ~36B params give or take (better to aim for a model that is a bit smaller). It won’t be fast without a beefy GPU & lots of VRAM but it’ll run.

There is also the possibility to spin up a model in private cloud. I’m not sure what the costs are there but might be useful to evaluate performance (smartness and speed) before purchasing hardware.

1 Like

I hope to get around the big hardware requirements by waiting for Benjamin Rosseaux to make his PasLLM available publicly. :slightly_smiling_face:

Salut, Mathias

I have a new laptop that is better suited to local AI than my desktop ( Lenovo Legion with a RTX 5070 Ti (M) )

I’ve been trying out several models as a possible some-time substitute for Claude … aiming at faster responses …

and here’s an example that I thought was pretty awesome : (real time - not sped up)

Using Ollama with an only 2.5 GB model … Qwen3:4B-Instruct-2407-q4_k_m

(sorry for the watermark, I haven’t dug up my licence key yet)

Ollama-code … entirely local AI. (2.5GB this model)

1 Like

Watching your (silent) video the performance looked great but I found myself wondering how loud those laptop fans were screaming while you recorded this. :wink:

1 Like

It doesn’t make much noise for just that. I guess it might for some sustained effort.

The back of the laptop is kinda open like a farm shed …

My P51 gets a lot hotter for a lot less effort.

I am feeling the urge to splurge… must… resist…. actually the cost of a new pc/graphics card will keep me subscribed to claude for a couple of years.

2 Likes

The laptop was 1/3rd off for Black Friday …

and the gpu is 12gb …

Each addition step up in gpu model was going to add like +$800 each step.