Local LLM experimentation
Running open language models on my own hardware. Not because it beats the cloud, but because I'd rather own the thing I'm starting to rely on.
The box it all runs on, in the corner of the flat. It earns its keep training the trading experiments too.
What it is
A desktop at home that runs open-source LLMs, so I have a capable model on tap that never leaves the flat. It feeds the same things I'd otherwise hand to a cloud model: quick questions, drafting, summarising, and a fallback brain for my other projects when I'd rather not send the data out.
Why bother?
Three reasons, roughly in order:
Privacy. Some of what I'd ask a model is personal: my notes, my finances, my home. Running it locally means none of that goes to a third party, and nothing gets logged or trained on without my say.
Cost and independence. I strongly believe that cloud subscriptions are cheap right now just because everyone's fighting for users. I don't think that will last for too much longer - I expect the models to a) get pricier and b) more restricted over time, with the good stuff moving behind higher tiers or just not being released to the public at all. Owning capable hardware and models is my way of hedging against that eventually happening.
Sovereignty. Mostly it's about not being beholden to something I lean on every day. A local model keeps working if a provider changes its terms, rate-limits me, or decides my use case isn't allowed any more. It even works without internet access!
Tinkering. A big reason is also that I'm a nerd and I love messing around with this stuff - we are lucky enough to be living through a huge technological innovation so I almost feel obliged to get my hands dirty as much as I can.
Of course, to be clear, local models are not as good as the frontier cloud models. I still reach for those for the most complex work, but hopefully that won't be for long, and for a large share of what I actually do, a local model is good enough, and the trade feels worth it.
What I've tried
I keep swapping models in as new open releases land. Qwen 3.6 27B is the one I come back to most, a good balance of reasoning and speed that still fits on one card. Gemma 4 is the other regular, lighter and quick for everyday questions. I keep a few older Qwen and Gemma builds around too, to compare how each generation handles my own tasks, and the bar climbs every time.
It all runs through llama.cpp on the machine pictured above, with Qwen quantised to roughly 5-bit and Gemma to 4-bit, sized so a 27B still fits inside the 3090's 24 GB of memory with decent throughput (~40 tok/s). Most of my smaller projects can point at it instead of a cloud endpoint whenever the data's better off staying home.
The rig
Nothing exotic, just a well-fed gaming box that doubles as my home lab workhorse.
- GPU
- NVIDIA GeForce RTX 3090 · 24 GB VRAM
- CPU
- Intel Core i5-13500 · 14 cores / 20 threads
- Memory
- 64 GB
- Storage
- 1 TB NVMe + 1 TB HDD
- Board
- Gigabyte B760 Gaming X AX
- OS
- Ubuntu 24.04 LTS
- Serving
- llama.cpp · GGUF (Qwen ~5-bit, Gemma ~4-bit)