M

MinerU PDF Extractor

mineru-pdf-extractor

Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.

数据来源：ClawHub。在 ClawSkills 查看

890下载量

2收藏数

2浏览量

安装

选择你使用的 Agent

方法一：命令行安装（推荐）

关于 MinerU PDF Extractor

--- name: mineru-pdf-extractor description: Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods. author: Community version: 1.0.0 homepage: https://mineru.net/ source: https://github.com/opendatalab/MinerU env: - name: MINERU_TOKEN description: "MinerU API token for authentication (primary)" required: true - name: MINERU_API_KEY description: "Alternative API token if MINERU_TOKEN is not set" required: false - name: MINERU_BASE_URL description: "API base URL (optional, defaults to https://mineru.net/api/v4)" required: false default: "https://mineru.net/api/v4" tools: required: - name: curl description: "HTTP client for API requests and file downloads" - name: unzip description: "Archive extraction tool for result ZIP files" optional: - name: jq description: "JSON processor for enhanced parsing and security (recommended)" ---

MinerU PDF Extractor

Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.

> Note: This is a community skill, not an official MinerU product. You need to obtain your own API key from MinerU.

---

📁 Skill Structure

mineru-pdf-extractor/
├── SKILL.md                          # English documentation
├── SKILL_zh.md                       # Chinese documentation
├── docs/                             # Documentation
│   ├── Local_File_Parsing_Guide.md   # Local PDF parsing detailed guide (English)
│   ├── Online_URL_Parsing_Guide.md   # Online PDF parsing detailed guide (English)
│   ├── MinerU_本地文档解析完整流程.md  # Local parsing complete guide (Chinese)
│   └── MinerU_在线文档解析完整流程.md  # Online parsing complete guide (Chinese)
└── scripts/                          # Executable scripts
    ├── local_file_step1_apply_upload_url.sh    # Local parsing Step 1
    ├── local_file_step2_upload_file.sh         # Local parsing Step 2
    ├── local_file_step3_poll_result.sh         # Local parsing Step 3
    ├── local_file_step4_download.sh            # Local parsing Step 4
    ├── online_file_step1_submit_task.sh        # Online parsing Step 1
    └── online_file_step2_poll_result.sh        # Online parsing Step 2

---

🔧 Requirements

Required Environment Variables

Scripts automatically read MinerU Token from environment variables (choose one):

# Option 1: Set MINERU_TOKEN
export MINERU_TOKEN="your_api_token_here"

# Option 2: Set MINERU_API_KEY
export MINERU_API_KEY="your_api_token_here"

Required Command-Line Tools

curl - For HTTP requests (usually pre-installed)
unzip - For extracting results (usually pre-installed)

Optional Tools

jq - For enhanced JSON parsing and security (recommended but not required)

- If not installed, scripts will use fallback methods - Install: apt-get install jq (Debian/Ubuntu) or brew install jq (macOS)

Optional Configuration

# Set API base URL (default is pre-configured)
export MINERU_BASE_URL="https://mineru.net/api/v4"

> 💡 Get Token: Visit https://mineru.net/apiManage/docs to register and obtain an API Key

---

📄 Feature 1: Parse Local PDF Documents

For locally stored PDF files. Requires 4 steps.

Quick Start

cd scripts/

# Step 1: Apply for upload URL
./local_file_step1_apply_upload_url.sh /path/to/your.pdf
# Output: BATCH_ID=xxx UPLOAD_URL=xxx

# Step 2: Upload file
./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf

# Step 3: Poll for results
./local_file_step3_poll_result.sh "$BATCH_ID"
# Output: FULL_ZIP_URL=xxx

# Step 4: Download results
./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/

Script Descriptions

local_file_step1_apply_upload_url.sh

Apply for upload URL and batch_id.

Usage:

./local_file_step1_apply_upload_url.sh <pdf_file_path> [language] [layout_model]

Parameters:

language: ch (Chinese), en (English), auto (auto-detect), default ch
layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo

Output:

BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...

---

local_file_step2_upload_file.sh

Upload PDF file to the presigned URL.

Usage:

./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>

---

local_file_step3_poll_result.sh

Poll extraction results until completion or failure.

Usage:

./local_file_step3_poll_result.sh <batch_id> [max_retries] [retry_interval_seconds]

Output:

FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip

---

local_file_step4_download.sh

Download result ZIP and extract.

Usage:

./local_file_step4_download.sh <zip_url> [output_zip_filename] [extract_directory_name]

Output Structure:

extracted/
├── full.md              # 📄 Markdown document (main result)
├── images/              # 🖼️ Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data

Detailed Documentation

📚 Complete Guide: See docs/Local_File_Parsing_Guide.md

---

🌐 Feature 2: Parse Online PDF Documents (URL Method)

For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.

Quick Start

cd scripts/

# Step 1: Submit parsing task (provide URL directly)
./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf"
# Output: TASK_ID=xxx

# Step 2: Poll results and auto-download/extract
./online_file_step2_poll_result.sh "$TASK_ID" extracted/

Script Descriptions

online_file_step1_submit_task.sh

Submit parsing task for online PDF.

Usage:

./online_file_step1_submit_task.sh <pdf_url> [language] [layout_model]

Parameters:

pdf_url: Complete URL of the online PDF (required)
language: ch (Chinese), en (English), auto (auto-detect), default ch
layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo

Output:

TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

---

online_file_step2_poll_result.sh

Poll extraction results, automatically download and extract when complete.

Usage:

./online_file_step2_poll_result.sh <task_id> [output_directory] [max_retries] [retry_interval_seconds]

Output Structure:

extracted/
├── full.md              # 📄 Markdown document (main result)
├── images/              # 🖼️ Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data

Detailed Documentation

📚 Complete Guide: See docs/Online_URL_Parsing_Guide.md

---

📊 Comparison of Two Parsing Methods

| Feature | Local PDF Parsing | Online PDF Parsing | |---------|----------------------|------------------------| | Steps | 4 steps | 2 steps | | Upload Required | ✅ Yes | ❌ No | | Average Time | 30-60 seconds | 10-20 seconds | | Use Case | Local files | Files already online (arXiv, websites, etc.) | | File Size Limit | 200MB | Limited by source server |

---

⚙️ Advanced Usage

Batch Process Local Files

for pdf in /path/to/pdfs/*.pdf; do
    echo "Processing: $pdf"
    
    # Step 1
    result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1)
    batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2)
    upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2)
    
    # Step 2
    ./local_file_step2_upload_file.sh "$upload_url" "$pdf"
    
    # Step 3
    zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2)
    
    # Step 4
    filename=$(basename "$pdf" .pdf)
    ./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted"
done

...

Prompt 示例

安装 MinerU PDF Extractor 后，可以对 AI 说这些话来触发它

U

Help me get started with MinerU PDF Extractor

A

Explains what MinerU PDF Extractor does, walks through the setup, and runs a quick demo based on your current project

U

Use MinerU PDF Extractor to extract PDF content to Markdown using MinerU API

A

Invokes MinerU PDF Extractor with the right parameters and returns the result directly in the conversation

U

What can I do with MinerU PDF Extractor in my documents & notes workflow?

A

Lists the top use cases for MinerU PDF Extractor, with example commands for each scenario

常见问题

如何安装 MinerU PDF Extractor？▾

将技能文件夹放到 ~/.claude/skills/mineru-pdf-extractor/ 目录（个人级，所有项目可用），或 .claude/skills/mineru-pdf-extractor/（项目级）。重启 AI 客户端后，用 /mineru-pdf-extractor 主动调用，或让 AI 根据上下文自动发现并使用。

MinerU PDF Extractor 支持哪些 AI 平台？▾

MinerU PDF Extractor 支持 Claude、Cursor、OpenClaw，可与这些 AI 平台无缝集成，扩展其能力。

MinerU PDF Extractor 是免费的吗？▾

MinerU PDF Extractor 可免费安装使用。请查阅仓库了解许可证信息。

MinerU PDF Extractor 有什么功能？▾

Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.

MinerU PDF Extractor 属于哪个分类？▾

MinerU PDF Extractor 属于「Documents & Notes」分类，该分类的技能帮助 AI 智能体在此领域执行专业任务。

使用场景

Getting Started with MinerU PDF Extractor→Automate Documents & Notes Workflows with MinerU PDF Extractor→Team Collaboration with MinerU PDF Extractor→

MinerU PDF Extractor

安装

关于 MinerU PDF Extractor

MinerU PDF Extractor

📁 Skill Structure

🔧 Requirements

Required Environment Variables

Required Command-Line Tools

Optional Tools

Optional Configuration

📄 Feature 1: Parse Local PDF Documents

Quick Start

Script Descriptions

local_file_step1_apply_upload_url.sh

local_file_step2_upload_file.sh

local_file_step3_poll_result.sh

local_file_step4_download.sh

Detailed Documentation

🌐 Feature 2: Parse Online PDF Documents (URL Method)

Quick Start

Script Descriptions

online_file_step1_submit_task.sh

online_file_step2_poll_result.sh

Detailed Documentation

📊 Comparison of Two Parsing Methods

⚙️ Advanced Usage

Batch Process Local Files

Prompt 示例

常见问题

使用场景

同类技能推荐

Nano Pdf

Obsidian

Notion

Word / DOCX