Still a work in progress. Some images and sections are on their way.
Aazim Haque.
Personal / Projects / Local LLM experimentation
LLM · Self-hosting Ongoing

Local LLM experimentation

Running open language models on my own hardware. Not because it beats the cloud, but because I'd rather own the thing I'm starting to rely on.

The desktop, lit up, that runs the local models

The box it all runs on, in the corner of the flat. It earns its keep training the trading experiments too.

What it is

A desktop at home that runs open-source LLMs, so I have a capable model on tap that never leaves the flat. It feeds the same things I'd otherwise hand to a cloud model: quick questions, drafting, summarising, and a fallback brain for my other projects when I'd rather not send the data out.

Why bother?

Three reasons, roughly in order:

Privacy. Some of what I'd ask a model is personal: my notes, my finances, my home. Running it locally means none of that goes to a third party, and nothing gets logged or trained on without my say.

Cost and independence. I strongly believe that cloud subscriptions are cheap right now just because everyone's fighting for users. I don't think that will last for too much longer - I expect the models to a) get pricier and b) more restricted over time, with the good stuff moving behind higher tiers or just not being released to the public at all. Owning capable hardware and models is my way of hedging against that eventually happening.

Sovereignty. Mostly it's about not being beholden to something I lean on every day. A local model keeps working if a provider changes its terms, rate-limits me, or decides my use case isn't allowed any more. It even works without internet access!

Tinkering. A big reason is also that I'm a nerd and I love messing around with this stuff - we are lucky enough to be living through a huge technological innovation so I almost feel obliged to get my hands dirty as much as I can.

Of course, to be clear, local models are not as good as the frontier cloud models. I still reach for those for the most complex work, but hopefully that won't be for long, and for a large share of what I actually do, a local model is good enough, and the trade feels worth it.

What I've tried

I keep swapping models in as new open releases land. Qwen 3.6 27B is the one I come back to most, a good balance of reasoning and speed that still fits on one card. Gemma 4 is the other regular, lighter and quick for everyday questions. I keep a few older Qwen and Gemma builds around too, to compare how each generation handles my own tasks, and the bar climbs every time.

It all runs through llama.cpp on the machine pictured above, with Qwen quantised to roughly 5-bit and Gemma to 4-bit, sized so a 27B still fits inside the 3090's 24 GB of memory with decent throughput (~40 tok/s). Most of my smaller projects can point at it instead of a cloud endpoint whenever the data's better off staying home.

The rig

Nothing exotic, just a well-fed gaming box that doubles as my home lab workhorse.

GPU
NVIDIA GeForce RTX 3090 · 24 GB VRAM
CPU
Intel Core i5-13500 · 14 cores / 20 threads
Memory
64 GB
Storage
1 TB NVMe + 1 TB HDD
Board
Gigabyte B760 Gaming X AX
OS
Ubuntu 24.04 LTS
Serving
llama.cpp · GGUF (Qwen ~5-bit, Gemma ~4-bit)