Kreuzberg MCP Server

Document intelligence MCP server for extracting text, metadata, OCR output, structured data, embeddings, chunks, cache state, and supported-format information from PDFs, Office files, images, code, and many other formats.

by Kreuzberg · submitted by oktofeesh1·added 2026-06-06·

Claude Code Codex Cursor Claude Desktop

HarnessClaude Code Codex Cursor Claude Desktop

Command center

Source

Review first

Review safety and privacy notes before installing or copying commands.

Safety notes Privacy notes

Install & copy

pip install "kreuzberg[all]" && kreuzberg mcp

Trust & readiness

TrustReview first
Sourcesource-backed
Safety notesPresent
ReviewedNo

Community context

Related entries(4)
Community signals

Compare

Integrations & API

Contribute

Suggest a metadata change Claim this listing

Documentation Source repository Browse directory

Review first — review before installing

Open the source and read safety notes before installing.

Citation facts

Source-backed facts for citing this resource, derived directly from the registry — also available as plain text for AI assistants.

Canonical URL: https://heyclau.de/entry/mcp/kreuzberg-mcp-server
Source URLs: https://docs.kreuzberg.dev/guides/mcp-integration/, https://github.com/kreuzberg-dev/kreuzberg
Brand: Kreuzberg
Brand domain: kreuzberg.dev
Brand asset source: brandfetch
Safety notes: Kreuzberg MCP can read local files supplied to extraction tools and can process batches of files when paths are provided., OCR, structured extraction, embeddings, and VLM features may invoke local or provider-hosted models depending on configuration., Cache tools can warm, inspect, or clear model/cache state; review cache directories and mounted volumes in shared environments., Docker deployments should mount only the directories the agent is allowed to read., Treat extracted text, metadata, structured fields, embeddings, and chunks as sensitive outputs when source documents are private.
Privacy notes: Documents may contain PII, contracts, invoices, source code, screenshots, medical data, financial data, credentials, or proprietary design information., Extracted metadata can reveal filenames, authors, timestamps, document structure, attachments, image details, and software fingerprints., OCR and structured extraction can expose text that was not previously copyable from scanned PDFs or images., Embedding and VLM/LLM configuration may send document-derived content to external model providers if configured that way., Cache directories, logs, model downloads, and MCP transcripts may retain document-derived context.
Author: Kreuzberg
Submitted by: oktofeesh1
Claim status: unclaimed
Last verified: 2026-06-06

Decision playbook

Review trust signals before you adopt

Signals are present but mixed. Use the checklist below to confirm the source and operational safety for your environment.

Compare context

Selected

Current score

Baseline

—

Delta

No baseline selected

No major trust-signal divergence detected in the current selection.

Source and provenance checks

Needs review

Confirm ownership and provenance before trusting install instructions.

Source link availableRequired
Open the canonical repository and verify ownership.
Done
Source provenance statusRequired
Marked as source-backed.
Done
Metadata reviewed
No reviewed flag detected in metadata.
Pending

Safety and privacy checks

Complete

Validate risk disclosures before installation or API wiring.

Safety notes presentRequired
Review the listed safety guidance before running commands.
Done
Privacy notes presentRequired
Review data handling notes before connecting accounts or secrets.
Done
Trust level risk gateRequired
Trust level does not block evaluation.
Done

Package and install checks

Needs review

Check package metadata and artifact integrity signals.

Install payload available
Install or copy payload is available for review.
Done
Package verification flag
No package verification flag provided.
Pending
Checksum metadata
No checksum provided for downloaded artifact.
Pending

Compare-driven decision checks

Needs review

Use compare context to validate trade-offs before adoption.

Compare tray has multiple entries
Add at least one more entry to compare trust differences.
Pending
Baseline comparison available
No baseline peer selected yet.
Pending
Diverging trust signals identified
No major trust-signal divergence found.
Pending

Setup at a glance

CLI install

Copy-ready — paste the snippet to get started.

10 minutes

Install command

Provided

Config snippet

Provided

Copy snippet

Provided

Prerequisites

5 to clear

Platforms

4 listed

Install type

CLI install

Adoption plan

Balanced adoption plan

Current risk score 24/100. Use staged verification before broader rollout.

Risk 24

Pre-adoption checks

Validate source and review signals before any execution.

Confirm source provenanceRequired
Source URL/provenance metadata is present.
Done
Confirm metadata review state
No review metadata found; increase manual validation.
Pending
Verify install payload
Install/config payload exists and can be inspected.
Done

Security checks

Confirm safety, privacy, and package integrity signals.

Review safety notesRequired
Safety notes are present.
Done
Review privacy notesRequired
Privacy notes are present.
Done
Verify package integrity metadata
No package verification/checksum metadata.
Pending

Rollout

Adopt in controlled steps based on the selected plan.

Run in isolated sandbox firstRequired
Use a constrained sandbox and observe behavior across multiple tasks.
Pending
Roll out graduallyRequired
Roll out to a small cohort before wider usage.
Pending
Set monitoring and fallback
Define rollback path and monitor errors after adoption.
Pending

Evidence readiness

Evidence readiness matrix · balanced

Missing required evidence: Metadata review. Risk score 31.

Risk 31

Source provenance

Present

Source repository/provenance is listed.

Required in this preset

Metadata review

Missing

Review metadata is missing.

Required in this preset

Safety notes

Present

Safety notes are present.

Required in this preset

Privacy notes

Present

Privacy notes are present.

Optional in this preset

Package integrity

Missing

Package integrity metadata is missing.

Optional in this preset

Install payload

Present

Install payload is available.

Required in this preset

Required gaps: Metadata review

Decision timeline

Decision timeline · balanced

Blocking gaps: Check metadata review status. Risk 28.

Risk 28

triage

Confirm source provenanceRequired

Source/provenance metadata is available.

Done

triage

Check metadata review statusRequired

Review metadata is missing.

Pending

verify

Review safety notesRequired

Safety notes are available.

Done

verify

Review privacy notes

Privacy notes are available.

Done

verify

Validate package integrity metadata

Package integrity metadata is missing.

Pending

rollout

Verify install payload and commandsRequired

Install payload is available.

Done

Blockers: Check metadata review status

Prerequisite readiness

5 prerequisites to line up before setup.

0/5 ready

Install & runtime2Configuration1Permissions & scopes1General110 minutes

Safety & privacy surface

5 safety and 5 privacy notes across 6 risk areas. Review closely: credentials & tokens, network access, third-party handling.

6 areas

SafetyLocal filesKreuzberg MCP can read local files supplied to extraction tools and can process batches of files when paths are provided.
SafetyThird-party handlingOCR, structured extraction, embeddings, and VLM features may invoke local or provider-hosted models depending on configuration.
SafetyData retentionCache tools can warm, inspect, or clear model/cache state; review cache directories and mounted volumes in shared environments.
SafetyGeneralDocker deployments should mount only the directories the agent is allowed to read.
SafetyGeneralTreat extracted text, metadata, structured fields, embeddings, and chunks as sensitive outputs when source documents are private.
PrivacyCredentials & tokensDocuments may contain PII, contracts, invoices, source code, screenshots, medical data, financial data, credentials, or proprietary design information.
PrivacyLocal filesExtracted metadata can reveal filenames, authors, timestamps, document structure, attachments, image details, and software fingerprints.
PrivacyGeneralOCR and structured extraction can expose text that was not previously copyable from scanned PDFs or images.
PrivacyThird-party handlingEmbedding and VLM/LLM configuration may send document-derived content to external model providers if configured that way.
PrivacyNetwork accessCache directories, logs, model downloads, and MCP transcripts may retain document-derived context.

Safety notes

Kreuzberg MCP can read local files supplied to extraction tools and can process batches of files when paths are provided.
OCR, structured extraction, embeddings, and VLM features may invoke local or provider-hosted models depending on configuration.
Cache tools can warm, inspect, or clear model/cache state; review cache directories and mounted volumes in shared environments.
Docker deployments should mount only the directories the agent is allowed to read.
Treat extracted text, metadata, structured fields, embeddings, and chunks as sensitive outputs when source documents are private.

Privacy notes

Documents may contain PII, contracts, invoices, source code, screenshots, medical data, financial data, credentials, or proprietary design information.
Extracted metadata can reveal filenames, authors, timestamps, document structure, attachments, image details, and software fingerprints.
OCR and structured extraction can expose text that was not previously copyable from scanned PDFs or images.
Embedding and VLM/LLM configuration may send document-derived content to external model providers if configured that way.
Cache directories, logs, model downloads, and MCP transcripts may retain document-derived context.

Prerequisites

Python environment suitable for installing `kreuzberg[all]`, or Docker if using the container path.
MCP-capable client such as Claude Desktop, Cursor, or a custom stdio MCP client.
Local file paths or mounted volumes limited to documents you are authorized to process.
Optional OCR, embeddings, VLM OCR, and LLM provider dependencies configured only when those features are needed.
Understanding of document sensitivity before extracting, OCRing, embedding, or chunking private files.

Schema details

Install type: cli
Troubleshooting: No

Source repository stats

Scope: Source repo

Collection metadata

Estimated setup: 10 minutes
Difficulty: intermediate

Full copyable content

pip install "kreuzberg[all]"
kreuzberg mcp

About this resource

Content

Kreuzberg MCP Server exposes Kreuzberg's document intelligence engine through the Model Context Protocol. It lets Claude, Cursor, and custom MCP clients extract content from files, generate embeddings, chunk text, manage cache state, detect formats, and inspect supported extraction capabilities without writing custom extraction code.

The upstream MCP guide documents stdio mode with kreuzberg mcp as the default local setup. Docker can also run the same MCP mode when file access and cache volumes need to be controlled more explicitly.

Source Review

These sources were reviewed on 2026-06-06. Prefer the live MCP integration guide, CLI usage docs, Docker guide, repository README, container package, and PyPI metadata for current install commands, server modes, feature flags, and runtime dependencies.

Features

Run an MCP server over stdio with kreuzberg mcp.
Extract file content and metadata from PDFs, Office documents, images, HTML, XML, email, archives, academic formats, text files, and code.
Detect MIME types and list supported formats.
Batch-extract several files.
Generate embeddings and chunk text when configured.
Extract structured data when the required LLM feature is available.
Inspect and clear cache state.
Use config files to apply extraction defaults.
Run the MCP server through Docker with controlled file and cache mounts.

Installation

Install Kreuzberg with its optional feature bundle:

pip install "kreuzberg[all]"

Start the stdio MCP server:

kreuzberg mcp

Configure an MCP client with the documented command:

{
  "mcpServers": {
    "kreuzberg": {
      "command": "kreuzberg",
      "args": ["mcp"]
    }
  }
}

For containerized setups, mount only the document directories and config files the agent should access.

Use Cases

Extract text and metadata from PDFs, Office files, images, and email for analysis in Claude.
OCR scanned documents or screenshots before summarization.
Chunk long documents for RAG and agent workflows.
Generate embeddings for document-derived text when configured.
Detect file types before choosing an extraction path.
Build a local document-processing MCP workflow without exposing documents to a hosted extraction service.

Safety and Privacy

Kreuzberg can turn private files into model-visible text. That is useful, but it also means access boundaries matter. Mount or expose only approved directories, avoid broad workspace paths, and start with a small file before batch processing.

Review OCR, embedding, VLM, and LLM provider configuration before enabling features that can send document-derived content outside the local machine. Keep cache directories, logs, and extracted outputs out of shared repos unless they are approved for publication.

Duplicate Check

No kreuzberg-dev/kreuzberg entry, Kreuzberg MCP entry, or matching Kreuzberg source URL was found in content/mcp.

#documents #extraction #ocr #embeddings #chunking

Source citations

Source methodology →

Add this badge to your README

Show that Kreuzberg MCP Server is listed on HeyClaude. Paste this Markdown into your README — it renders the badge and links back to this page.

[![Listed on HeyClaude](https://heyclau.de/badge/mcp/kreuzberg-mcp-server.svg)](https://heyclau.de/entry/mcp/kreuzberg-mcp-server)

How it compares

Kreuzberg MCP Server side by side with 3 alternatives on trust, install, platform support, and disclosed safety notes — all from reviewed registry metadata.

Field	Kreuzberg MCP Server Document intelligence MCP server for extracting text, metadata, OCR output, structured data, embeddings, chunks, cache state, and supported-format information from PDFs, Office files, images, code, and many other formats. Open dossier	Markdownify MCP Server MCP server that converts PDFs, Office files, images, audio, webpages, YouTube transcripts, and existing Markdown files into Markdown. Open dossier	AFFiNE MCP Server MCP server for connecting Claude to AFFiNE Cloud or self-hosted AFFiNE workspaces, documents, databases, comments, collections, folders, tags, notifications, blobs, access tokens, semantic page composition, templates, edgeless canvas data, and workspace organization workflows over stdio or HTTP. Open dossier	MarkItDown MCP Server Microsoft-maintained MCP server for converting HTTP, HTTPS, file, and data URIs into Markdown through the MarkItDown document-conversion library. Open dossier
Next steps	Open dossier API JSON Open LLM MCP feed Open source Newsletter Claim listing	Open dossier API JSON Open LLM MCP feed Open source Newsletter Claim listing	Open dossier API JSON Open LLM MCP feed Open source Newsletter Claim listing	Open dossier API JSON Open LLM MCP feed Open source Newsletter Claim listing
Trust
Review status	Not reviewed	Not reviewed	Not reviewed	Not reviewed
Package trust	Package not verified	Package not verified	Package not verified	Package not verified
Source provenance	Source-backed	Source-backed	Source-backed	Source-backed
Submitter	oktofeesh1	oktofeesh1	oktofeesh1	oktofeesh1
Install risk	Review first	Review first	Review first	Review first
Notes	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓	Safety ✓ Privacy ✓
Brand	Kreuzberg	—	—	—
Category	mcp	mcp	mcp	mcp
Source	Source-backed	Source-backed	Source-backed	Source-backed
Author	Kreuzberg	zcaceres	DAWNCR0W	Microsoft
Added	2026-06-06	2026-06-05	2026-06-06	2026-06-06
Platforms	Claude Code Codex Cursor Claude Desktop	Claude Code Claude Desktop	Claude Code Claude Desktop	Claude Code Claude Desktop
Harness	Claude Code Codex Cursor Claude Desktop	Claude Code Claude Desktop	Claude Code Claude Desktop	Claude Code Claude Desktop
Source repo	—	—	—	—
Safety notes	✓Kreuzberg MCP can read local files supplied to extraction tools and can process batches of files when paths are provided. OCR, structured extraction, embeddings, and VLM features may invoke local or provider-hosted models depending on configuration. Cache tools can warm, inspect, or clear model/cache state; review cache directories and mounted volumes in shared environments. Docker deployments should mount only the directories the agent is allowed to read. Treat extracted text, metadata, structured fields, embeddings, and chunks as sensitive outputs when source documents are private.	✓Markdownify MCP Server can read local files and convert documents, spreadsheets, presentations, images, audio files, and Markdown files. File conversion tools should be restricted with MD_ALLOWED_PATHS or container mounts before using the server with sensitive directories. Audio transcription, image OCR, document conversion, and webpage extraction can expose hidden text, metadata, comments, and embedded content. Webpage, YouTube, and Bing tools retrieve external content that can contain prompt-injection text or unsafe instructions. Review converted Markdown before using it as source material for decisions, publications, or further tool calls.	✓The full tool surface can create, update, publish, revoke, move, replace, and delete AFFiNE workspaces, documents, folders, collections, comments, database rows, blobs, surface elements, tags, and access tokens. Start with AFFINE_TOOL_PROFILE=read_only or disable groups such as destructive, admin, blobs, users, access_tokens, docs.database, or write when the assistant only needs discovery or reading. HTTP mode exposes MCP endpoints and must be protected with bearer or OAuth authentication, HTTPS, allowed origins, and deployment-level access controls. AFFiNE Cloud requires API tokens for MCP usage; avoid trying to automate email/password login against Cloud deployments. Cookie and password-based auth can grant broad account access and should be avoided for unattended or shared deployments.	✓MarkItDown MCP exposes one `convert_to_markdown(uri)` tool that accepts `http:`, `https:`, `file:`, and `data:` URIs. The server runs with the privileges of the user that starts it and can read any file that user can access when given a matching file URI. Streamable HTTP and SSE modes are local-use alternatives to stdio; the README warns not to bind them to non-localhost interfaces unless the security implications are understood. The HTTP and SSE server modes have no authentication, so any local process or user that can reach the bound interface can use the conversion tool. Use a container, virtual machine, or dedicated low-privilege user when converting untrusted documents or when file access must be tightly bounded.
Privacy notes	✓Documents may contain PII, contracts, invoices, source code, screenshots, medical data, financial data, credentials, or proprietary design information. Extracted metadata can reveal filenames, authors, timestamps, document structure, attachments, image details, and software fingerprints. OCR and structured extraction can expose text that was not previously copyable from scanned PDFs or images. Embedding and VLM/LLM configuration may send document-derived content to external model providers if configured that way. Cache directories, logs, model downloads, and MCP transcripts may retain document-derived context.	✓File paths, file contents, extracted text, OCR output, audio transcripts, webpage text, search results, prompts, tool arguments, and conversion errors may be visible to the MCP client and model provider. Documents and media can contain personal data, customer material, credentials, internal plans, meeting recordings, legal drafts, and proprietary screenshots. Avoid pointing the server at broad home, workspace, downloads, or cloud-sync directories unless every file in scope is approved for model access.	✓Workspaces, document titles, document bodies, comments, tags, database schemas, database rows, edgeless canvas data, notifications, user profiles, access-token metadata, blobs, and exported markdown can be sent to the MCP client and model. API tokens, cookies, passwords, bearer tokens, OAuth config, and saved config files should be treated as secrets and kept out of prompts, logs, screenshots, issues, and committed MCP config. Generated or modified documents can persist in AFFiNE and may become visible to collaborators depending on workspace permissions and sharing state. Blob upload, cleanup, export, and publish/revoke workflows can expose files, generated content, or collaboration state beyond the current prompt.	✓File paths, file contents, extracted Markdown, conversion errors, remote URLs, data URIs, and document metadata may become visible to the MCP client and model provider. Converting Office files, PDFs, media, web pages, or archives can reveal hidden text, comments, tracked changes, metadata, notes, image text, links, and embedded objects. Docker mounts make every mounted file readable to the MCP server path inside the container; mount only approved directories and prefer read-only mounts when possible. Remote URI conversion can reveal research targets, customer URLs, internal endpoints, or private document locations to upstream servers and to the model transcript.
Prerequisites	Python environment suitable for installing `kreuzberg[all]`, or Docker if using the container path. MCP-capable client such as Claude Desktop, Cursor, or a custom stdio MCP client. Local file paths or mounted volumes limited to documents you are authorized to process. Optional OCR, embeddings, VLM OCR, and LLM provider dependencies configured only when those features are needed.	Node.js and npx available to the MCP client runtime. Python and markitdown dependencies available for file conversion features. Optional Docker setup if you want mounted read-only directories and explicit allowed paths. MD_ALLOWED_PATHS configured when the server should be restricted to specific file directories.	Node.js and npm for installing the affine-mcp-server package. AFFiNE Cloud API token or approved self-hosted AFFiNE credentials. AFFiNE base URL for the Cloud or self-hosted deployment. Clear workspace scope and least-privilege tool profile before enabling write, admin, destructive, blob, or access-token tools.	Python 3.10 or newer and the `markitdown-mcp` package installed, or a locally built Docker image from the package Dockerfile. Local files or remote resources that the MCP server's user is allowed to read and convert. Sandboxed user, virtual machine, or container setup when exposing document conversion to an agent. Explicit file and directory boundaries before mounting local data into a container.
Install	`pip install "kreuzberg[all]" && kreuzberg mcp`	`npx mcp-markdownify-server`	`npm i -g affine-mcp-server`	`claude mcp add markitdown markitdown-mcp`
Config	`{ "mcpServers": { "kreuzberg": { "command": "kreuzberg", "args": ["mcp"] } } }`	`{ "mcpServers": { "markdownify": { "command": "npx", "args": ["mcp-markdownify-server"] } } }`	`{ "mcpServers": { "affine": { "command": "affine-mcp", "env": { "AFFINE_BASE_URL": "<affine-base-url>", "AFFINE_API_TOKEN": "<affine-api-token>", "AFFINE_TOOL_PROFILE": "read_only" } } } }`	`{ "mcpServers": { "markitdown": { "command": "docker", "args": ["run", "--rm", "-i", "markitdown-mcp:latest"] } } }`
Citations	Source repositorygithub.com 2026-07-21T07:17:01+00:00 Documentationdocs.kreuzberg.dev Submitted by oktofeesh12026-06-06 Source methodology →	Source repositorygithub.com 2026-07-21T07:17:01+00:00 Documentationgithub.com Submitted by oktofeesh12026-06-05 Source methodology →	Source repositorygithub.com 2026-07-21T07:17:01+00:00 Documentationgithub.com Submitted by oktofeesh12026-06-06 Source methodology →	Source repositorygithub.com 2026-07-21T07:17:01+00:00 Documentationgithub.com Submitted by oktofeesh12026-06-06 Source methodology →
Claim	Unclaimed	Unclaimed	Unclaimed	Unclaimed

Open 4 picks in the interactive comparison tool

Signals

Loading live community signals…

Citation facts

Review trust signals before you adopt

Source and provenance checks

Safety and privacy checks

Package and install checks

Compare-driven decision checks

CLI install

Balanced adoption plan

Pre-adoption checks

Security checks

Rollout

Evidence readiness matrix · balanced

Source provenance

Metadata review

Safety notes

Privacy notes

Package integrity

Install payload

Decision timeline · balanced

Confirm source provenanceRequired

Check metadata review statusRequired

Review safety notesRequired

Review privacy notes

Validate package integrity metadata

Verify install payload and commandsRequired

Prerequisite readiness

Safety & privacy surface

Safety notes

Privacy notes

Prerequisites

Schema details

About this resource

Content

Source Review

Features

Installation

Use Cases

Safety and Privacy

Duplicate Check

Source citations

Add this badge to your README

How it compares

Related resources

Markdownify MCP Server

AFFiNE MCP Server

MarkItDown MCP Server

Office Word MCP Server

Signals