P

PDF Text Extractor

严选

pdf-text-extractor

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

数据来源：ClawHub。在 ClawSkills 查看

10.7k下载量

19收藏数

132浏览量

安装

选择你使用的 Agent

方法一：命令行安装（推荐）

关于 PDF Text Extractor

--- name: pdf-text-extractor description: Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required. metadata: { "openclaw": { "version": "1.0.0", "author": "Vernox", "license": "MIT", "tags": ["pdf", "ocr", "text", "extraction", "document", "digitization"], "category": "tools" } } ---

PDF-Text-Extractor - Extract Text from PDFs

Vernox Utility Skill - Perfect for document digitization.

Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

Features

✅ Text Extraction

Extract text from PDFs without external tools
Support for both text-based and scanned PDFs
Preserve document structure and formatting
Fast extraction (milliseconds for text-based)

✅ OCR Support

Use Tesseract.js for scanned documents
Support multiple languages (English, Spanish, French, German)
Configurable OCR quality/speed
Fallback to text extraction when possible

✅ Batch Processing

Process multiple PDFs at once
Batch extraction for document workflows
Progress tracking for large files
Error handling and retry logic

✅ Output Options

Plain text output
JSON output with metadata
Markdown conversion
HTML output (preserving links)

✅ Utility Features

Page-by-page extraction
Character/word counting
Language detection
Metadata extraction (author, title, creation date)

Installation

clawhub install pdf-text-extractor

Quick Start

Extract Text from PDF

const result = await extractText({
  pdfPath: './document.pdf',
  options: {
    outputFormat: 'text',
    ocr: true,
    language: 'eng'
  }
});

console.log(result.text);
console.log(`Pages: ${result.pages}`);
console.log(`Words: ${result.wordCount}`);

Batch Extract Multiple PDFs

const results = await extractBatch({
  pdfFiles: [
    './document1.pdf',
    './document2.pdf',
    './document3.pdf'
  ],
  options: {
    outputFormat: 'json',
    ocr: true
  }
});

console.log(`Extracted ${results.length} PDFs`);

Extract with OCR

const result = await extractText({
  pdfPath: './scanned-document.pdf',
  options: {
    ocr: true,
    language: 'eng',
    ocrQuality: 'high'
  }
});

// OCR will be used (scanned document detected)

Tool Functions

`extractText`

Extract text content from a single PDF file.

Parameters:

pdfPath (string, required): Path to PDF file
options (object, optional): Extraction options

- outputFormat (string): 'text' | 'json' | 'markdown' | 'html' - ocr (boolean): Enable OCR for scanned docs - language (string): OCR language code ('eng', 'spa', 'fra', 'deu') - preserveFormatting (boolean): Keep headings/structure - minConfidence (number): Minimum OCR confidence score (0-100)

Returns:

text (string): Extracted text content
pages (number): Number of pages processed
wordCount (number): Total word count
charCount (number): Total character count
language (string): Detected language
metadata (object): PDF metadata (title, author, creation date)
method (string): 'text' or 'ocr' (extraction method)

`extractBatch`

Extract text from multiple PDF files at once.

Parameters:

pdfFiles (array, required): Array of PDF file paths
options (object, optional): Same as extractText

Returns:

results (array): Array of extraction results
totalPages (number): Total pages across all PDFs
successCount (number): Successfully extracted
failureCount (number): Failed extractions
errors (array): Error details for failures

`countWords`

Count words in extracted text.

Parameters:

text (string, required): Text to count
options (object, optional):

- minWordLength (number): Minimum characters per word (default: 3) - excludeNumbers (boolean): Don't count numbers as words - countByPage (boolean): Return word count per page

Returns:

wordCount (number): Total word count
charCount (number): Total character count
pageCounts (array): Word count per page
averageWordsPerPage (number): Average words per page

`detectLanguage`

Detect the language of extracted text.

Parameters:

text (string, required): Text to analyze
minConfidence (number): Minimum confidence for detection

Returns:

language (string): Detected language code
languageName (string): Full language name
confidence (number): Confidence score (0-100)

Use Cases

Document Digitization

Convert paper documents to digital text
Process invoices and receipts
Digitize contracts and agreements
Archive physical documents

Content Analysis

Extract text for analysis tools
Prepare content for LLM processing
Clean up scanned documents
Parse PDF-based reports

Data Extraction

Extract data from PDF reports
Parse tables from PDFs
Pull structured data
Automate document workflows

Text Processing

Prepare content for translation
Clean up OCR output
Extract specific sections
Search within PDF content

Performance

Text-Based PDFs

Speed: ~100ms for 10-page PDF
Accuracy: 100% (exact text)
Memory: ~10MB for typical document

OCR Processing

Speed: ~1-3s per page (high quality)
Accuracy: 85-95% (depends on scan quality)
Memory: ~50-100MB peak during OCR

Technical Details

PDF Parsing

Uses native PDF.js library
Extracts text layer directly (no OCR needed)
Preserves document structure
Handles password-protected PDFs

OCR Engine

Tesseract.js under the hood
Supports 100+ languages
Adjustable quality/speed tradeoff
Confidence scoring for accuracy

Dependencies

ZERO external dependencies
Uses Node.js built-in modules only
PDF.js included in skill
Tesseract.js bundled

Error Handling

Invalid PDF

Clear error message
Suggest fix (check file format)
Skip to next file in batch

OCR Failure

Report confidence score
Suggest rescan at higher quality
Fallback to basic extraction

Memory Issues

Stream processing for large files
Progress reporting
Graceful degradation

Configuration

Edit `config.json`:

{
  "ocr": {
    "enabled": true,
    "defaultLanguage": "eng",
    "quality": "medium",
    "languages": ["eng", "spa", "fra", "deu"]
  },
  "output": {
    "defaultFormat": "text",
    "preserveFormatting": true,
    "includeMetadata": true
  },
  "batch": {
    "maxConcurrent": 3,
    "timeoutSeconds": 30
  }
}

Examples

Extract from Invoice

const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."

Extract from Scanned Contract

const contract = await extractText('./scanned-contract.pdf', {
  ocr: true,
  language: 'eng',
  ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."

Batch Process Documents

const docs = await extractBatch([
  './doc1.pdf',
  './doc2.pdf',
  './doc3.pdf',
  './doc4.pdf'
]);
console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);

Troubleshooting

OCR Not Working

Check if PDF is truly scanned (not text-based)
Try different quality settings (low/medium/high)
Ensure language matches document
Check image quality of scan

Extraction Returns Empty

PDF may be image-only
OCR failed with low confidence
Try different language setting

Slow Processing

Large PDF takes longer
Reduce quality for speed
Process in smaller batches

Tips

...

Prompt 示例

安装 PDF Text Extractor 后，可以对 AI 说这些话来触发它

U

Help me get started with PDF Text Extractor

A

Explains what PDF Text Extractor does, walks through the setup, and runs a quick demo based on your current project

U

Use PDF Text Extractor to extract text from PDFs with OCR support

A

Invokes PDF Text Extractor with the right parameters and returns the result directly in the conversation

U

What can I do with PDF Text Extractor in my documents & notes workflow?

A

Lists the top use cases for PDF Text Extractor, with example commands for each scenario

常见问题

如何安装 PDF Text Extractor？▾

将技能文件夹放到 ~/.claude/skills/pdf-text-extractor/ 目录（个人级，所有项目可用），或 .claude/skills/pdf-text-extractor/（项目级）。重启 AI 客户端后，用 /pdf-text-extractor 主动调用，或让 AI 根据上下文自动发现并使用。

PDF Text Extractor 支持哪些 AI 平台？▾

PDF Text Extractor 支持 Claude、Cursor、OpenClaw，可与这些 AI 平台无缝集成，扩展其能力。

PDF Text Extractor 是免费的吗？▾

PDF Text Extractor 可免费安装使用。请查阅仓库了解许可证信息。

PDF Text Extractor 有什么功能？▾

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

PDF Text Extractor 属于哪个分类？▾

PDF Text Extractor 属于「Documents & Notes」分类，该分类的技能帮助 AI 智能体在此领域执行专业任务。

使用场景

Getting Started with PDF Text Extractor→Automate Documents & Notes Workflows with PDF Text Extractor→Team Collaboration with PDF Text Extractor→

PDF Text Extractor

安装

关于 PDF Text Extractor

PDF-Text-Extractor - Extract Text from PDFs

Overview

Features

✅ Text Extraction

✅ OCR Support

✅ Batch Processing

✅ Output Options

✅ Utility Features

Installation

Quick Start

Extract Text from PDF

Batch Extract Multiple PDFs

Extract with OCR

Tool Functions

extractText

extractBatch

countWords

detectLanguage

Use Cases

Document Digitization

Content Analysis

Data Extraction

Text Processing

Performance

Text-Based PDFs

OCR Processing

Technical Details

PDF Parsing

OCR Engine

Dependencies

Error Handling

Invalid PDF

OCR Failure

Memory Issues

Configuration

Edit config.json:

Examples

Extract from Invoice

Extract from Scanned Contract

Batch Process Documents

Troubleshooting

OCR Not Working

Extraction Returns Empty

Slow Processing

Tips

Prompt 示例

常见问题

使用场景

同类技能推荐

Nano Pdf

Obsidian

Notion

Word / DOCX

`extractText`

`extractBatch`

`countWords`

`detectLanguage`

Edit `config.json`: