In the previous post, we explored the expert-in-the-loop model—where AI doesn’t try to replace engineers but works with them to streamline decisions, surface insights, and improve DevEx (and maybe even OrgEx, if we’re lucky).

via GIPHY

But here’s the catch: most LLM workflows today rely on piping every prompt straight to the cloud. That’s fine for hobby projects, but not so great when you’re dealing with internal infrastructure, sensitive configs, or compliance constraints.

So what's the alternative you might say? Well, enter Ollama—a lightweight way to run powerful LLMs locally, whether that’s on your laptop or inside your Kubernetes cluster. In this post, we’ll look at why running models in-house matters, how to deploy Ollama using its operator, and how this setup lays the groundwork for upcoming hands-on expert-in-the-loop implementations.

Why Run LLMs Locally? (a.k.a. No, You Don’t Need to Call the Cloud for Everything)

Before we jump into YAMLs and API endpoints, let’s take a step back and ask: why run LLMs locally at all?

✅ Data Privacy & Compliance

Not every organization is thrilled about uploading proprietary infrastructure configs, user data, or incident logs to a third-party API. Running locally ensures that sensitive info stays in your sandbox.

✅ Lower Latency, Higher Availability

No rate limits, no cold starts, no “ChatGPT is at capacity” banners. Local LLMs respond faster and keep running even if your WiFi hiccups or your cloud quota hits the ceiling.

✅ Cost Control

Cloud-based GenAI usage can burn through budgets fast. Running Ollama on a GPU box you already have? That’s amortized GenAI, baby.

✅ Developer Autonomy

Need to test something? No API keys. No waiting. Just fire up the model and go. It’s like npm install for LLMs.

Ollama: The Local LLM Swiss Army Knife

Ollama makes it dead simple to run models like llama3, mistral, codellama and others on your local machine—or even better, as part of your organization’s infrastructure.

via GIPHY

You get:

  • Pre-built models with one command
  • A REST API you can call like any hosted service
  • Easy integration with tools like LangChain, Python, and curl

But instead of a quick local run, let’s get ambitious: deploy Ollama inside Kubernetes and expose it as an internal service.

Deploying Ollama with the Ollama Operator

While there are "many ways to skin the cat" when it comes to the Ollama deployment in Kubernetes, we will specifically focus on deploying its operator.

Here’s a high-level setup to get you started:

1. Install the Ollama Operator

kubectl apply \
  --server-side=true \
  -f https://raw.githubusercontent.com/nekomeowww/ollama-operator/v0.10.1/dist/install.yaml

This deploys the operator and sets up the CustomResourceDefinitions (CRDs) needed to manage model instances.

2. Deploy a Model

You can now define a custom resource to run, say, llama3:

apiVersion: ollama.io/v1alpha1
kind: Model
metadata:
  name: llama3
spec:
  model: llama3

and apply it:

kubectl apply -f llama3-model.yaml

Behind the scenes, the operator spins up a pod that pulls the model, runs Ollama, and exposes it as a service.

If you want to get more details around Ollama Operator, I suggest to read their official documentation. The good thing is - its quite small and to the point...

Talking to Ollama: API Basics

Once it’s running, Ollama exposes a REST API you can use like this:

curl http://llama3-service:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain expert-in-the-loop in one sentence.",
  "stream": false
}'

You’ll get a JSON response with the model’s output—ready to be piped into a plugin, a bot, or a Terraform reviewer.

via GIPHY

What’s Coming Next

Now that we have a local LLM service running, we’ve got the foundation for building real expert-in-the-loop tooling. In the next few posts, we’ll cover:

 🛠️ Using local LLMs to review terraform plan output and highlight risky changes

 💬 Creating a Backstage chatbot that answers internal team and service questions using real org data

 📊 Auto-generating postmortems from logs, metrics, and incident notes

 📚 Feeding and querying internal documentation to reduce Slack-dependency

Until we meet again...

via GIPHY