Tech USP: All AI calls go through one OpenAI-compatible adapter. Anthropic, OpenAI, Llama via Ollama, vLLM, on-prem setups — interchangeable via environment variables, no code change in your app.
Buyer view: Your customer data, internal records, sensitive prompts never leave your data center — your own AI on your own hardware. Switching providers later is a
.envchange, not a sprint.
What this means concretely
One configuration, three provider worlds:
# Cloud API (kumiko.so hosted default)
LLM_ENDPOINT=https://api.anthropic.com/v1
LLM_API_KEY=sk-ant-...
LLM_MODEL=claude-sonnet-4-6
# BYOK Cloud (your own OpenAI account)
LLM_ENDPOINT=https://api.openai.com/v1
LLM_API_KEY=sk-...
LLM_MODEL=gpt-4o
# Local model (Ollama on your own GPU server)
LLM_ENDPOINT=http://gpu-server.internal:11434/v1
LLM_API_KEY=ollama
LLM_MODEL=llama3.1:70b
App code stays identical:
const result = await ai.generate({
prompt: "Suggest 8D steps D2-D4 from this complaint: ...",
schema: suggestionSchema,
});
The adapter does the rest. No provider-specific code in your features.
Three modes
| Mode | When | Setup |
|---|---|---|
| Cloud API | kumiko.so hosted default, fastest start | Anthropic API with platform key — we handle it |
| BYOK Cloud | Power users with their own Anthropic/OpenAI account, own quotas/compliance | Set env vars, platform uses your account |
| Local model | Mid-market with GDPR pressure or air-gapped setup | Ollama or vLLM on your own GPU server — AI data never leaves your network |
What you get for free
Provider switching without code change
Anthropic prices went up? Llama 3.1 70B is enough for your tasks? OpenAI ships a better model? — change env-var, deploy, done. No refactoring, no provider-specific SDKs in your code, no migration.
Data and AI stay in-house
Local language model on your own GPU hardware (RTX 4090, A100, or rented H100 from a sovereign provider). Customer prompts, sensitive records, business logic processed locally — no US cloud data flow, no Schrems-II risk, no “but our sub-processor”.
Air-gapped possible
Fully offline-capable. No outbound network connections. Even updates via signed container images, manual installation. Audit-ready for ISO 27001 and higher security tiers.
Cost optimization per task type
Simple classification tasks → small local model (Llama 3.1 8B, ~$5/day GPU power). Complex generation tasks → larger cloud model (Claude Opus or GPT-4o, pay-per-token). Mix configurable per feature.
Which local model when?
| Model | When useful | Hardware |
|---|---|---|
| Llama 3.1 8B | Classification, simple extraction, prompts with clear output schema | 1× RTX 4090 (~24 GB VRAM) |
| Llama 3.1 70B | Generating reports, multi-step reasoning, longer contexts | 2× A100 (80 GB) or quantization |
| Mistral / Mixtral 8x7B | Code generation, multilingual, strong European-language quality | 1× A100 or 2× RTX 4090 |
| Qwen 2.5 72B | Demanding reasoning + long contexts | like Llama 70B |
Recommendation for pilot setup: Llama 3.1 8B as default, cloud API as fallback for complex tasks. Scale by workload.
What it isn’t
- No own LLM training — we don’t host a training pipeline. Off-the-shelf models, fine-tuned via external providers if needed
- No caching / rate-limit layer — adapter is thin, no re-implementation of provider caching. Build that as middleware in front of the adapter if you need it
- No multi-provider failover — one mode at a time, no automatic Cloud→Local fallback. Can come later, not a day-one requirement
Where this lands in the pitch
- EU mid-market: Top argument “data and AI stay with you” — own server, own model. Sensitive records never processed externally
- Indie hackers: Sub-argument “switch providers anytime, no lock-in” — Anthropic expensive? OpenAI outage? Local Llama for most tasks, cloud for the complex ones
- Enterprise: Compliance closer — air-gapped mode for ISO 27001, SOC2, BSI-Grundschutz, higher security tiers