The API in Front of the AI

Filed under: Cloud Engineering · AI Infrastructure · Local Lab

You’ve Got APIs. Now You’ve Got AI APIs. Now What?

Picture this: you grab an Ollama model, wire it into your app locally — done. Celebrate. But two months later? You’ve got five apps, a handful of models, and absolutely no visibility into what’s being called, how often, or what it’s costing you in compute. Sound familiar?

Welcome to the reason LLM gateways exist. This is Part 1 of a two-part Field Notes series. Here we cover what an LLM gateway is, why you’d want one, and how to get Bifrost running locally on your Mac against Ollama with qwen3.5 — fully offline, fully free, fully yours.

So… What Even Is an LLM Gateway?

Think of an LLM gateway as the air traffic control tower for your AI requests.

Every time your app wants to talk to a language model, that request flies through the gateway first. The gateway decides where to send it, who is allowed to send it, how much it costs, and whether it should be cached, retried, or blocked outright.

Non-techy version: it’s a smart switchboard that sits between your apps and every AI model on your machine (or the planet), speaking everyone’s language and keeping receipts.

Techy version: it’s an OpenAI-compatible reverse proxy that normalizes request and response formats across providers, enforcing auth, rate limits, routing logic, cost tracking, and observability — all in one place.

Why Do You Need One?

🔑 One virtual key to rule them all

Without a gateway, every app talks directly to every model. With a gateway, your apps get virtual keys from the gateway itself. Rotate or revoke in one place — every downstream app is covered.

🔀 Provider flexibility without code changes

Want to swap from qwen3.5 to llama3.2 for a specific workload? That’s a config change, not a code deployment. Your app cannot tell the difference.

💸 Spend and usage tracking

Even with local models you want to know which apps are hammering your GPU, how many tokens are flowing, and which requests are slowest. Gateways give you that visibility out of the box.

🔄 Automatic fallback and load balancing

What happens when your primary model is busy or unavailable? Define a fallback chain: try qwen3.5 → if that fails, try llama3.2. The gateway handles it automatically — no retry logic in your app.

💬 Semantic caching

If ten requests ask essentially the same question within a short window, should you run the model ten times? Semantic caching returns the cached response instantly for similar queries, slashing redundant compute.

🔭 Observability

Logs, latency per model, request counts, error rates — all the stuff you’ll eventually want a dashboard for. Gateways plug this in as middleware so you don’t have to instrument every app.

Why Bifrost for a Local Lab?

There are several solid open-source LLM gateways out there. For a local lab environment, Bifrost hits the sweet spot:

Zero-config startup. One npx command and you have a running gateway with a web UI. No YAML required to get started.
Built in Go. Lightweight, fast, and doesn’t need a Python environment or virtual env to manage.
Web UI included. Add providers, configure keys, and watch requests flow through a dashboard at localhost:8080.
OpenAI-compatible. Any SDK or tool that works with OpenAI works with Bifrost. One line change in your app.
Ollama support. Bifrost treats Ollama as a first-class provider. Point it at localhost:11434 and you’re done.
Free and open source. Apache 2.0 license. Self-host forever at no cost.

Prerequisites

Before we start, make sure you have the following on your Mac:

Tool	Version	Install
Node.js	18+	`brew install node`
Ollama	Latest	`brew install ollama`
qwen3.5:latest	6.6 GB	`ollama pull qwen3.5:latest`
Docker (optional)	Latest	docs.docker.com

Verify everything is ready:

node --version       # should show v18 or higher
ollama list          # should show qwen3.5:latest
ollama serve         # start Ollama if not already running

Step 1: Verify Ollama is Running

Before adding a gateway, confirm qwen3.5 responds on its own:

curl http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3.5:latest",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "stream": false
  }'

You should get a JSON response with the model’s reply. If not, run ollama serve in a terminal first.

Step 2: Start Bifrost

Option A — npx (fastest)

Open a new terminal window and run:

npx -y @maximhq/bifrost

Bifrost downloads a pre-compiled Go binary for your architecture (arm64 for M-series Macs) and starts immediately. You should see:

Bifrost HTTP gateway starting...
Web UI available at http://localhost:8080

Tip: The first npx run may take 15–30 seconds while the binary downloads. Subsequent starts are instant.

Option B — Docker (with data persistence)

If you prefer Docker and want your configuration to survive restarts:

docker run -p 8080:8080 \
  -v $(pwd)/data:/app/data \
  maximhq/bifrost

The -v $(pwd)/data:/app/data flag persists your provider config and request logs to a local ./data folder.

Step 3: Add Ollama as a Provider

Open http://localhost:8080 in your browser. Bifrost starts in UI-config mode — everything is configured through the web interface.

In the Bifrost dashboard:

Go to Providers in the sidebar
Click Add Provider
Select Ollama from the provider list
Set the base URL to http://localhost:11434
Save — no API key needed for local Ollama

Alternatively, register it via the API:

curl -X POST http://localhost:8080/api/providers \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "ollama",
    "base_url": "http://localhost:11434",
    "keys": [{"name": "local", "value": "none", "models": ["qwen3.5:latest"]}]
  }'

Step 4: Send Your First Request Through the Gateway

With Bifrost running and Ollama registered, send a request through the gateway:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "ollama/qwen3.5:latest",
    "messages": [{"role": "user", "content": "What is an LLM gateway? One paragraph."}]
  }'

You should get a full OpenAI-format response from qwen3.5, routed through Bifrost. Check the Bifrost dashboard — you’ll see the request logged with latency, token count, and model used.

✅ You did it. Your entire AI stack is now running locally. No cloud. No API keys. No bill at the end of the month.

Step 5: Run the Test Suite

Here are the key tests to validate your setup end-to-end.

Smoke test

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "ollama/qwen3.5:latest",
    "messages": [{"role": "user", "content": "Reply with just the word PONG"}]
  }'

Streaming test

Tokens should arrive word-by-word in your terminal:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "ollama/qwen3.5:latest",
    "messages": [{"role": "user", "content": "Count from 1 to 5, one number per line."}],
    "stream": true
  }'

System prompt test

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "ollama/qwen3.5:latest",
    "messages": [
      {"role": "system", "content": "You are a pirate. Respond only in pirate speak."},
      {"role": "user", "content": "Explain what a cloud gateway is."}
    ]
  }'

Tool calling test

qwen3.5 supports native tool calling. This verifies Bifrost passes tool schemas and the model returns structured output:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "ollama/qwen3.5:latest",
    "messages": [{"role": "user", "content": "What is 47 multiplied by 13? Use the calculator."}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "calculator",
        "description": "Performs arithmetic",
        "parameters": {
          "type": "object",
          "properties": {
            "operation": {"type": "string", "enum": ["add","subtract","multiply","divide"]},
            "a": {"type": "number"},
            "b": {"type": "number"}
          },
          "required": ["operation","a","b"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Concurrent load test

Fires 10 parallel requests — check the Bifrost dashboard afterward for latency stats:

for i in {1..10}; do
  curl -s -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d "{\"model\":\"ollama/qwen3.5:latest\",\"messages\":[{\"role\":\"user\",\"content\":\"What is $i + $i?\"}]}" &
done
wait && echo "All 10 requests completed"

Python SDK drop-in test

import openai

client = openai.OpenAI(
    api_key="dummy",
    base_url="http://localhost:8080/openai"
)

response = client.chat.completions.create(
    model="ollama/qwen3.5:latest",
    messages=[{"role": "user", "content": "Explain an LLM gateway in one sentence."}]
)
print(response.choices[0].message.content)
print(f"Model: {response.model}")
print(f"Tokens: {response.usage.total_tokens}")

Test Reference

Test	What it validates
Smoke test	Gateway ↔ Ollama routing works
Streaming	SSE passthrough works correctly
System prompt	System role forwarded correctly
Tool calling	Structured function output works
Concurrent load	No dropped requests under parallel load
Python SDK	Drop-in OpenAI replacement works end-to-end

What’s Next

You now have a fully functional local LLM gateway running on your Mac. qwen3.5 is serving requests through Bifrost with observability, routing, and zero cloud dependency.

In Part 2, we’ll go deeper: building a local MCP server that exposes your own machine’s tools — system info, shell commands, custom APIs — and wiring it through Bifrost so qwen3.5 can actually do things on your behalf. That’s where it gets really interesting.

Until then, some things worth exploring in your new setup:

Add a second local model (ollama pull llama3.2) and configure a fallback chain between them
Set a virtual key with a token budget and watch the gateway enforce it
Enable semantic caching in the Bifrost UI and run the same prompt twice — watch the second one come back instantly
Check http://localhost:8080 after your test suite — the dashboard logs every request with latency and token counts

Anthony Mineer is a Senior Manager of Cloud Engineering writing about cloud infrastructure, AI, and platform engineering at anthonymineer.me. Part 2: MCP Gateway Setup coming soon.

You’ve Got APIs. Now You’ve Got AI APIs. Now What?#

So… What Even Is an LLM Gateway?#

Why Do You Need One?#

🔑 One virtual key to rule them all#

🔀 Provider flexibility without code changes#

💸 Spend and usage tracking#

🔄 Automatic fallback and load balancing#

💬 Semantic caching#

🔭 Observability#

Why Bifrost for a Local Lab?#

Prerequisites#

Step 1: Verify Ollama is Running#

Step 2: Start Bifrost#

Option A — npx (fastest)#

Option B — Docker (with data persistence)#

Step 3: Add Ollama as a Provider#

Step 4: Send Your First Request Through the Gateway#

Step 5: Run the Test Suite#

Smoke test#

Streaming test#

System prompt test#

Tool calling test#

Concurrent load test#

Python SDK drop-in test#

Test Reference#

What’s Next#