Skip to main content
collectionsSource-backedReview first Safety Privacy

Self-Hosted AI Operator Stack

A source-backed collection for operators running AI services on infrastructure they control: local model runtime, CPU and GPU inference, model gateway, self-hosted MCP access, retrieval storage, model API packaging, container rebuilds, and image security checks.

by MkDev11·added 2026-06-04·
Claude Code
HarnessClaude Code
Bundle:10 items
Review first review before installing

Open the source and read safety notes before installing.

Safety notes

  • Self-hosted AI endpoints can execute expensive inference, tool calls, retrieval, uploads, and container rebuilds; require authentication, rate limits, resource limits, and audit logs before network exposure.
  • Model runtimes and OpenAI-compatible gateways can be swapped into agent stacks quickly, so verify route policy, model capability, tool-call handling, and fallback behavior before production use.
  • Container rebuild and image scan hooks can read Docker state, pull images, start builds, and fail workflows; pin versions, bound permissions, and keep rollback commands tested.
  • Local or self-hosted models do not provide a safety layer by themselves; prompts, outputs, embeddings, tool inputs, and generated code still need abuse, correctness, and data-handling review.

Privacy notes

  • Self-hosting reduces third-party model-provider exposure, but prompts, files, embeddings, retrieved documents, model outputs, logs, traces, and admin actions can still persist on operator infrastructure.
  • Gateways, MCP servers, retrieval databases, model API servers, Docker logs, scanner reports, and backup jobs can duplicate sensitive data across disks, containers, volumes, and observability systems.
  • Pulling models, images, packages, vulnerability data, or provider fallbacks can disclose model names, image names, IP addresses, repository names, and timing metadata to external services.

Prerequisites

  • A host or small cluster with enough CPU, RAM, disk, and optional GPU/VRAM for the selected local or open model workloads.
  • A private network, firewall, TLS/auth plan, and operator-owned secrets store before exposing model, MCP, retrieval, or app endpoints.
  • Model license, weight source, quantization, context-length, embedding, and safety-policy decisions for the workloads you intend to run.
  • Container runtime and registry policy for Docker Compose services, rebuild triggers, image scanning, log retention, backups, and rollback.
  • Clear boundaries for what stays self-hosted and what may still call external model providers, registries, package indexes, or telemetry endpoints.

Schema details

Install type
copy
Troubleshooting
No
Collection metadata
Items
10 entries
Estimated setup
95 minutes
Difficulty
advanced
Installation order
local-first-ai-dev-stackollamallama-cppvllmlitellmmcp-supergateway-hubchromabentomldocker-container-auto-rebuilddocker-image-security-scanner
Full copyable content
Start from a local-first architecture, choose the right model runtime, put a gateway in front of model and MCP access, add retrieval and model API packaging, then automate container rebuilds and image scans.

About this resource

What this collection sets up

This collection is for the operator side of a self-hosted AI stack. It starts with the local-first architecture, then separates responsibilities into model runtime selection, OpenAI-compatible routing, self-hosted MCP access, retrieval storage, model API packaging, container rebuilds, and image security checks.

The goal is not to make every workload fully offline. It is to make data paths, runtime choices, network exposure, credentials, logs, and rebuild behavior visible before agents or users depend on the stack.

Layers

1. Architecture and model runtime

  • local-first-ai-dev-stack frames what stays on owned infrastructure and what may still call an external orchestrator or provider.
  • ollama is the simplest local model runner for developer machines, delegation tasks, and offline fallback.
  • llama-cpp covers lightweight GGUF-based inference for CPU, edge, and memory-constrained hosts.
  • vllm covers higher-throughput GPU serving with OpenAI-compatible APIs, batching, structured outputs, and tool-calling support.

2. Gateway, tools, and retrieval

  • litellm puts a model gateway in front of local and external providers so operators can manage routes, virtual keys, spending, and fallbacks.
  • mcp-supergateway-hub exposes a fleet of stdio MCP servers over HTTP for private-network access from approved clients.
  • chroma stores documents, embeddings, metadata, and retrieval indexes for local or self-hosted RAG and memory workflows.
  • bentoml packages model inference code into model APIs and deployable service artifacts when the stack needs a production API surface.

3. Container operations

  • docker-container-auto-rebuild helps rebuild affected containers after source or configuration changes.
  • docker-image-security-scanner checks container images before they become part of the self-hosted AI environment.

Suggested order

Start by writing the local-first boundary and choosing the runtime for each workload: Ollama for simple local use, llama.cpp for compact GGUF serving, and vLLM for GPU-backed throughput. Add LiteLLM only after route and credential rules are clear. Bring up MCP Supergateway Hub and Chroma on a private network, then package production inference services with BentoML. Finish by adding container rebuild and image scanning hooks so stack changes are repeatable and reviewed.

Operator checklist

  • {"task": "Boundary is written", "description": "Operators know which prompts, tools, models, embeddings, logs, and fallbacks are allowed to leave owned infrastructure"}
  • {"task": "Runtime fits hardware", "description": "CPU, RAM, GPU, VRAM, disk, context length, and concurrency match the selected model runtimes"}
  • {"task": "Endpoints are private", "description": "Model, MCP, retrieval, and API services have authentication, TLS or private-network controls, and rate limits"}
  • {"task": "Credentials are scoped", "description": "Model gateway keys, registry tokens, provider fallbacks, and MCP secrets are rotated and least-privilege"}
  • {"task": "Containers are reviewable", "description": "Rebuilds, image scans, logs, and rollbacks are part of the normal operator workflow"}
  • {"task": "Retention is explicit", "description": "Prompts, outputs, embeddings, traces, scanner reports, and backups have an owner and deletion policy"}

Source and references

Duplicate check

Checked existing collections, guides, tools, skills, hooks, open PRs, closed PRs, and issue history for self-hosted-ai-operator-stack, self-hosted AI operator, local-first AI stack, Ollama, llama.cpp, vLLM, LiteLLM, MCP Supergateway Hub, Chroma, BentoML, Docker rebuild hooks, and image security scanning. local-first-ai-dev-stack is a guide for one local-first developer architecture. agent-operator-growth-master-pack is a broad product-operator bundle covering review, release, growth, and automation. This collection is narrower and operational: it bundles the runtime, gateway, MCP, retrieval, model API, rebuild, and scan entries needed to run self-hosted AI services.

Disclosure

Editorial collection. No paid placement or affiliate link is used.

#self-hosted#local-models#inference#mcp#model-gateway#vector-database#docker

Source citations

Signals

Loading live community signals…

More like this, weekly

A short, calm digest of reviewed Claude resources. Unsubscribe any time.