Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.
数据来源:ClawHub。 在 ClawSkills 查看
选择你使用的 Agent
方法一:命令行安装(推荐)
推荐(无需提前安装 clawhub)
npx clawhub@latest --dir ~/.claude/skills install extract-pdf-text或使用 clawhub CLI(需提前安装)
clawhub --dir ~/.claude/skills install extract-pdf-text⚠️ 需要 Node.js 18+,没有 Node?请使用下方方法二直接下载 ZIP。 安装 Node.js →
方法二:手动下载安装(无需 Node)
下载 ZIP,解压后将文件夹放到以下路径,重启 Agent 即可:
安装路径
~/.claude/skills/extract-pdf-text/💡解压后将文件夹放到上方路径,重启 Agent 即可生效
--- name: Extract PDF Text slug: extract-pdf-text version: 1.0.2 homepage: https://clawic.com/skills/extract-pdf-text description: Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents. changelog: Remove internal build file that was accidentally included metadata: {"clawdbot":{"emoji":"📄","requires":{"bins":["python3"],"pip":["pymupdf"]},"os":["linux","darwin","win32"],"install":[{"id":"pymupdf","kind":"pip","package":"PyMuPDF","label":"Install PyMuPDF"}]}} ---
Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.
| Topic | File | |-------|------| | Code examples | examples.md | | OCR setup | ocr.md | | Troubleshooting | troubleshooting.md |
pip install PyMuPDF
Import as fitz (historical name):
import fitz # PyMuPDF
import fitz
doc = fitz.open("document.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
| PDF Type | Method | |----------|--------| | Text-based | page.get_text() — fast, accurate | | Scanned | OCR with pytesseract — slower | | Mixed | Check each page, use OCR when needed |
def needs_ocr(page):
text = page.get_text().strip()
return len(text) < 50 # Likely scanned if very little text
try:
doc = fitz.open(path)
except fitz.FileDataError:
print("Invalid or corrupted PDF")
except fitz.PasswordError:
doc = fitz.open(path, password="secret")
| Trap | What Happens | Fix | |------|--------------|-----| | OCR on text PDF | Slow + worse accuracy | Check get_text() first | | Forget to close doc | Memory leak | Use with or doc.close() | | Assume page order | Wrong reading flow | Use sort=True in get_text() | | Ignore encoding | Garbled characters | PyMuPDF handles UTF-8 |
This skill provides instructions for using PyMuPDF to extract PDF text.
This skill ONLY:
This skill NEVER:
All processing is local:
text = page.get_text()
blocks = page.get_text("dict")["blocks"]
for b in blocks:
if b["type"] == 0: # text block
for line in b["lines"]:
for span in line["spans"]:
print(span["text"], span["size"])
import json
data = page.get_text("json")
parsed = json.loads(data)
import fitz
def extract_pdf(path):
"""Extract text from PDF, with OCR fallback for scanned pages."""
doc = fitz.open(path)
results = []
for i, page in enumerate(doc):
text = page.get_text()
method = "text"
# If very little text, might be scanned
if len(text.strip()) < 50:
# OCR would go here (see ocr.md)
method = "needs_ocr"
results.append({
"page": i + 1,
"text": text,
"method": method
})
doc.close()
return {
"pages": len(results),
"content": results,
"word_count": sum(len(r["text"].split()) for r in results)
}
# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")
clawhub star extract-pdf-textclawhub sync安装 Extract PDF Text 后,可以对 AI 说这些话来触发它
Help me get started with Extract PDF Text
Explains what Extract PDF Text does, walks through the setup, and runs a quick demo based on your current project
Use Extract PDF Text to extract text from PDF files using PyMuPDF
Invokes Extract PDF Text with the right parameters and returns the result directly in the conversation
What can I do with Extract PDF Text in my documents & notes workflow?
Lists the top use cases for Extract PDF Text, with example commands for each scenario
将技能文件夹放到 ~/.claude/skills/extract-pdf-text/ 目录(个人级,所有项目可用),或 .claude/skills/extract-pdf-text/(项目级)。重启 AI 客户端后,用 /extract-pdf-text 主动调用,或让 AI 根据上下文自动发现并使用。
Extract PDF Text 支持 Claude、Cursor、OpenClaw,可与这些 AI 平台无缝集成,扩展其能力。
Extract PDF Text 可免费安装使用。请查阅仓库了解许可证信息。
Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.
Extract PDF Text 属于「Documents & Notes」分类,该分类的技能帮助 AI 智能体在此领域执行专业任务。
Automate my documents & notes tasks using Extract PDF Text
Identifies repetitive steps in your workflow and sets up Extract PDF Text to handle them automatically