Skip to main content
mcpSource-backedReview first Safety Privacy

Kreuzberg MCP Server

Document intelligence MCP server for extracting text, metadata, OCR output, structured data, embeddings, chunks, cache state, and supported-format information from PDFs, Office files, images, code, and many other formats.

by Kreuzberg·added 2026-06-06·
Claude CodeClaude Desktop
HarnessClaude CodeClaude Desktop
Review first review before installing

Open the source and read safety notes before installing.

Safety notes

  • Kreuzberg MCP can read local files supplied to extraction tools and can process batches of files when paths are provided.
  • OCR, structured extraction, embeddings, and VLM features may invoke local or provider-hosted models depending on configuration.
  • Cache tools can warm, inspect, or clear model/cache state; review cache directories and mounted volumes in shared environments.
  • Docker deployments should mount only the directories the agent is allowed to read.
  • Treat extracted text, metadata, structured fields, embeddings, and chunks as sensitive outputs when source documents are private.

Privacy notes

  • Documents may contain PII, contracts, invoices, source code, screenshots, medical data, financial data, credentials, or proprietary design information.
  • Extracted metadata can reveal filenames, authors, timestamps, document structure, attachments, image details, and software fingerprints.
  • OCR and structured extraction can expose text that was not previously copyable from scanned PDFs or images.
  • Embedding and VLM/LLM configuration may send document-derived content to external model providers if configured that way.
  • Cache directories, logs, model downloads, and MCP transcripts may retain document-derived context.

Prerequisites

  • Python environment suitable for installing `kreuzberg[all]`, or Docker if using the container path.
  • MCP-capable client such as Claude Desktop, Cursor, or a custom stdio MCP client.
  • Local file paths or mounted volumes limited to documents you are authorized to process.
  • Optional OCR, embeddings, VLM OCR, and LLM provider dependencies configured only when those features are needed.
  • Understanding of document sensitivity before extracting, OCRing, embedding, or chunking private files.

Schema details

Install type
cli
Troubleshooting
No
Source repository stats
Scope
Source repo
Collection metadata
Estimated setup
10 minutes
Difficulty
intermediate
Full copyable content
pip install "kreuzberg[all]"
kreuzberg mcp

About this resource

Content

Kreuzberg MCP Server exposes Kreuzberg's document intelligence engine through the Model Context Protocol. It lets Claude, Cursor, and custom MCP clients extract content from files, generate embeddings, chunk text, manage cache state, detect formats, and inspect supported extraction capabilities without writing custom extraction code.

The upstream MCP guide documents stdio mode with kreuzberg mcp as the default local setup. Docker can also run the same MCP mode when file access and cache volumes need to be controlled more explicitly.

Source Review

These sources were reviewed on 2026-06-06. Prefer the live MCP integration guide, CLI usage docs, Docker guide, repository README, container package, and PyPI metadata for current install commands, server modes, feature flags, and runtime dependencies.

Features

  • Run an MCP server over stdio with kreuzberg mcp.
  • Extract file content and metadata from PDFs, Office documents, images, HTML, XML, email, archives, academic formats, text files, and code.
  • Detect MIME types and list supported formats.
  • Batch-extract several files.
  • Generate embeddings and chunk text when configured.
  • Extract structured data when the required LLM feature is available.
  • Inspect and clear cache state.
  • Use config files to apply extraction defaults.
  • Run the MCP server through Docker with controlled file and cache mounts.

Installation

Install Kreuzberg with its optional feature bundle:

pip install "kreuzberg[all]"

Start the stdio MCP server:

kreuzberg mcp

Configure an MCP client with the documented command:

{
  "mcpServers": {
    "kreuzberg": {
      "command": "kreuzberg",
      "args": ["mcp"]
    }
  }
}

For containerized setups, mount only the document directories and config files the agent should access.

Use Cases

  • Extract text and metadata from PDFs, Office files, images, and email for analysis in Claude.
  • OCR scanned documents or screenshots before summarization.
  • Chunk long documents for RAG and agent workflows.
  • Generate embeddings for document-derived text when configured.
  • Detect file types before choosing an extraction path.
  • Build a local document-processing MCP workflow without exposing documents to a hosted extraction service.

Safety and Privacy

Kreuzberg can turn private files into model-visible text. That is useful, but it also means access boundaries matter. Mount or expose only approved directories, avoid broad workspace paths, and start with a small file before batch processing.

Review OCR, embedding, VLM, and LLM provider configuration before enabling features that can send document-derived content outside the local machine. Keep cache directories, logs, and extracted outputs out of shared repos unless they are approved for publication.

Duplicate Check

No kreuzberg-dev/kreuzberg entry, Kreuzberg MCP entry, or matching Kreuzberg source URL was found in content/mcp.

#documents#extraction#ocr#embeddings#chunking

Source citations

Signals

Loading live community signals…

More like this, weekly

A short, calm digest of reviewed Claude resources. Unsubscribe any time.