BYOK + local LLM — no AI vendor lock-in

Tech USP: All AI calls go through one OpenAI-compatible adapter. Anthropic, OpenAI, Llama via Ollama, vLLM, on-prem setups — interchangeable via environment variables, no code change in your app.

Buyer view: Your customer data, internal records, sensitive prompts never leave your data center — your own AI on your own hardware. Switching providers later is a .env change, not a sprint.

What this means concretely

One configuration, three provider worlds:

# Cloud API (kumiko.rocks hosted default)
LLM_ENDPOINT=https://api.anthropic.com/v1
LLM_API_KEY=sk-ant-...
LLM_MODEL=claude-sonnet-4-6

# BYOK Cloud (your own OpenAI account)
LLM_ENDPOINT=https://api.openai.com/v1
LLM_API_KEY=sk-...
LLM_MODEL=gpt-4o

# Local model (Ollama on your own GPU server)
LLM_ENDPOINT=http://gpu-server.internal:11434/v1
LLM_API_KEY=ollama
LLM_MODEL=llama3.1:70b

App code stays identical:

const result = await ai.generate({
  prompt: "Suggest 8D steps D2-D4 from this complaint: ...",
  schema: suggestionSchema,
});

The adapter does the rest. No provider-specific code in your features.

Three modes

Mode	When	Setup
Cloud API	kumiko.rocks hosted default, fastest start	Anthropic API with platform key — we handle it
BYOK Cloud	Power users with their own Anthropic/OpenAI account, own quotas/compliance	Set env vars, platform uses your account
Local model	Mid-market with GDPR pressure or air-gapped setup	Ollama or vLLM on your own GPU server — AI data never leaves your network

What you get for free

Provider switching without code change

Anthropic prices went up? Llama 3.1 70B is enough for your tasks? OpenAI ships a better model? — change env-var, deploy, done. No refactoring, no provider-specific SDKs in your code, no migration.

Data and AI stay in-house

Local language model on your own GPU hardware (RTX 4090, A100, or rented H100 from a sovereign provider). Customer prompts, sensitive records, business logic processed locally — no US cloud data flow, no Schrems-II risk, no “but our sub-processor”.

Air-gapped possible

Fully offline-capable. No outbound network connections. Even updates via signed container images, manual installation. Audit-ready for ISO 27001 and higher security tiers.

Cost optimization per task type

Simple classification tasks → small local model (Llama 3.1 8B, ~$5/day GPU power). Complex generation tasks → larger cloud model (Claude Opus or GPT-4o, pay-per-token). Mix configurable per feature.

Which local model when?

Model	When useful	Hardware
Llama 3.1 8B	Classification, simple extraction, prompts with clear output schema	1× RTX 4090 (~24 GB VRAM)
Llama 3.1 70B	Generating reports, multi-step reasoning, longer contexts	2× A100 (80 GB) or quantization
Mistral / Mixtral 8x7B	Code generation, multilingual, strong European-language quality	1× A100 or 2× RTX 4090
Qwen 2.5 72B	Demanding reasoning + long contexts	like Llama 70B

Recommendation for pilot setup: Llama 3.1 8B as default, cloud API as fallback for complex tasks. Scale by workload.

What it isn’t

No own LLM training — we don’t host a training pipeline. Off-the-shelf models, fine-tuned via external providers if needed
No caching / rate-limit layer — adapter is thin, no re-implementation of provider caching. Build that as middleware in front of the adapter if you need it
No multi-provider failover — one mode at a time, no automatic Cloud→Local fallback. Can come later, not a day-one requirement

Where this lands in the pitch

EU mid-market: Top argument “data and AI stay with you” — own server, own model. Sensitive records never processed externally
Indie hackers: Sub-argument “switch providers anytime, no lock-in” — Anthropic expensive? OpenAI outage? Local Llama for most tasks, cloud for the complex ones
Enterprise: Compliance closer — air-gapped mode for ISO 27001, SOC2, BSI-Grundschutz, higher security tiers