Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
数据来源:ClawHub。 在 ClawSkills 查看
选择你使用的 Agent
方法一:命令行安装(推荐)
推荐(无需提前安装 clawhub)
npx clawhub@latest --dir ~/.claude/skills install pdf-text-extractor或使用 clawhub CLI(需提前安装)
clawhub --dir ~/.claude/skills install pdf-text-extractor⚠️ 需要 Node.js 18+,没有 Node?请使用下方方法二直接下载 ZIP。 安装 Node.js →
方法二:手动下载安装(无需 Node)
下载 ZIP,解压后将文件夹放到以下路径,重启 Agent 即可:
安装路径
~/.claude/skills/pdf-text-extractor/💡解压后将文件夹放到上方路径,重启 Agent 即可生效
--- name: pdf-text-extractor description: Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required. metadata: { "openclaw": { "version": "1.0.0", "author": "Vernox", "license": "MIT", "tags": ["pdf", "ocr", "text", "extraction", "document", "digitization"], "category": "tools" } } ---
Vernox Utility Skill - Perfect for document digitization.
PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).
clawhub install pdf-text-extractor
const result = await extractText({
pdfPath: './document.pdf',
options: {
outputFormat: 'text',
ocr: true,
language: 'eng'
}
});
console.log(result.text);
console.log(`Pages: ${result.pages}`);
console.log(`Words: ${result.wordCount}`);
const results = await extractBatch({
pdfFiles: [
'./document1.pdf',
'./document2.pdf',
'./document3.pdf'
],
options: {
outputFormat: 'json',
ocr: true
}
});
console.log(`Extracted ${results.length} PDFs`);
const result = await extractText({
pdfPath: './scanned-document.pdf',
options: {
ocr: true,
language: 'eng',
ocrQuality: 'high'
}
});
// OCR will be used (scanned document detected)
extractTextExtract text content from a single PDF file.
Parameters:
pdfPath (string, required): Path to PDF fileoptions (object, optional): Extraction options - outputFormat (string): 'text' | 'json' | 'markdown' | 'html' - ocr (boolean): Enable OCR for scanned docs - language (string): OCR language code ('eng', 'spa', 'fra', 'deu') - preserveFormatting (boolean): Keep headings/structure - minConfidence (number): Minimum OCR confidence score (0-100)
Returns:
text (string): Extracted text contentpages (number): Number of pages processedwordCount (number): Total word countcharCount (number): Total character countlanguage (string): Detected languagemetadata (object): PDF metadata (title, author, creation date)method (string): 'text' or 'ocr' (extraction method)extractBatchExtract text from multiple PDF files at once.
Parameters:
pdfFiles (array, required): Array of PDF file pathsoptions (object, optional): Same as extractTextReturns:
results (array): Array of extraction resultstotalPages (number): Total pages across all PDFssuccessCount (number): Successfully extractedfailureCount (number): Failed extractionserrors (array): Error details for failurescountWordsCount words in extracted text.
Parameters:
text (string, required): Text to countoptions (object, optional): - minWordLength (number): Minimum characters per word (default: 3) - excludeNumbers (boolean): Don't count numbers as words - countByPage (boolean): Return word count per page
Returns:
wordCount (number): Total word countcharCount (number): Total character countpageCounts (array): Word count per pageaverageWordsPerPage (number): Average words per pagedetectLanguageDetect the language of extracted text.
Parameters:
text (string, required): Text to analyzeminConfidence (number): Minimum confidence for detectionReturns:
language (string): Detected language codelanguageName (string): Full language nameconfidence (number): Confidence score (0-100)config.json:{
"ocr": {
"enabled": true,
"defaultLanguage": "eng",
"quality": "medium",
"languages": ["eng", "spa", "fra", "deu"]
},
"output": {
"defaultFormat": "text",
"preserveFormatting": true,
"includeMetadata": true
},
"batch": {
"maxConcurrent": 3,
"timeoutSeconds": 30
}
}
const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."
const contract = await extractText('./scanned-contract.pdf', {
ocr: true,
language: 'eng',
ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."
const docs = await extractBatch([
'./doc1.pdf',
'./doc2.pdf',
'./doc3.pdf',
'./doc4.pdf'
]);
console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);
...
安装 PDF Text Extractor 后,可以对 AI 说这些话来触发它
Help me get started with PDF Text Extractor
Explains what PDF Text Extractor does, walks through the setup, and runs a quick demo based on your current project
Use PDF Text Extractor to extract text from PDFs with OCR support
Invokes PDF Text Extractor with the right parameters and returns the result directly in the conversation
What can I do with PDF Text Extractor in my documents & notes workflow?
Lists the top use cases for PDF Text Extractor, with example commands for each scenario
将技能文件夹放到 ~/.claude/skills/pdf-text-extractor/ 目录(个人级,所有项目可用),或 .claude/skills/pdf-text-extractor/(项目级)。重启 AI 客户端后,用 /pdf-text-extractor 主动调用,或让 AI 根据上下文自动发现并使用。
PDF Text Extractor 支持 Claude、Cursor、OpenClaw,可与这些 AI 平台无缝集成,扩展其能力。
PDF Text Extractor 可免费安装使用。请查阅仓库了解许可证信息。
Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
PDF Text Extractor 属于「Documents & Notes」分类,该分类的技能帮助 AI 智能体在此领域执行专业任务。
Automate my documents & notes tasks using PDF Text Extractor
Identifies repetitive steps in your workflow and sets up PDF Text Extractor to handle them automatically