基于智谱 GLM-OCR、GLM-4.7 及 GLM-4.6V 的多模态文档深度解析工具。 Use when: - 需要高精度提取文档(PDF/图片)中的表格并转换为 Markdown 格式 - 需要从文档页面中自动裁剪并提取插图、图表为独立文件 - 需要对提取的图表进行深度语义理解(基于 GLM-4.6V 视觉分析) - 需要对提取的表格数据进行逻辑分析(基于 GLM-4.7 文本分析) 核心架构: 1. 视觉提取:GLM-OCR 2. 语义理解:GLM-4.7 (纯文本/表格) + GLM-4.6V (多模态/图像)
数据来源:ClawHub。 在 ClawSkills 查看
选择你使用的 Agent
方法一:命令行安装(推荐)
推荐(无需提前安装 clawhub)
npx clawhub@latest --dir ~/.claude/skills install pdf-ocr-layout或使用 clawhub CLI(需提前安装)
clawhub --dir ~/.claude/skills install pdf-ocr-layout⚠️ 需要 Node.js 18+,没有 Node?请使用下方方法二直接下载 ZIP。 安装 Node.js →
方法二:手动下载安装(无需 Node)
下载 ZIP,解压后将文件夹放到以下路径,重启 Agent 即可:
安装路径
~/.claude/skills/pdf-ocr-layout/💡解压后将文件夹放到上方路径,重启 Agent 即可生效
--- name: pdf-ocr-layout description: Multimodal document deep analysis tool based on Zhipu GLM-OCR, GLM-4.7, and GLM-4.6V.
Use when: - Need to extract tables from documents (PDF/images) with high precision and convert to Markdown format - Need to automatically crop and extract illustrations and charts from document pages as independent files - Need to perform deep semantic understanding on extracted charts (based on GLM-4.6V visual analysis) - Need to perform logical analysis on extracted table data (based on GLM-4.7 text analysis)
Core Architecture: 1. Visual Extraction: GLM-OCR 2. Semantic Understanding: GLM-4.7 (text/tables) + GLM-4.6V (multimodal/images) ---
This tool builds a high-precision document parsing pipeline: using GLM-OCR for layout element extraction, calling GLM-4.7 for logical interpretation of table data, and calling GLM-4.6V for multimodal visual interpretation of images and charts.
This Skill consists of two core script stages, orchestrated through glm_ocr_pipeline.py:
scripts/glm_ocr_extract.py)scripts/glm_understanding.py)- Tables: Combine full text context, use GLM-4.7 to analyze business meaning of Markdown table data - Charts: Combine full text context + cropped images, use GLM-4.6V for multimodal visual analysis
# Run complete pipeline: extraction -> cropping -> understanding analysis, supports input in .pdf, .jpg, .png and other formats
python scripts/glm_ocr_pipeline.py \
--file_path "/data/report_page.jpg" \
--output_dir "/data/output"
| Parameter | Type | Required | Description | | --- | --- | --- | --- | | file_path | string | ✅ | Absolute path to input file (supports .pdf, .png, .jpg) | | output_dir | string | ✅ | Result output directory (used to save cropped images and JSON reports) |
The tool returns a list containing layout elements and their deep understanding:
[
{
"type": "table",
"bbox": [100, 200, 500, 600],
"content_info": "| Revenue | Q1 |\n|---|---|\n| 100M | ... |",
"deep_understanding": "(Generated by GLM-4.7) This table shows Q1 2024 revenue data. Combined with the 'market expansion strategy' mentioned in paragraph 3 of the body text, it can be seen that..."
},
{
"type": "image",
"bbox": [100, 700, 500, 900],
"content_info": "/data/output/images/report_page_img_2.png",
"deep_understanding": "(Generated by GLM-4.6V) This is a system architecture diagram. Visually, it shows the flow of clients connecting to servers through a Load Balancer. Combined with the title 'Fig 3' and context, this diagram is mainly used to illustrate..."
}
]
ZHIPU_API_KEY must be configuredzhipuai, pillow, beautifulsoup4All understanding is based on the complete layout logic of the document (Markdown Context), not isolated fragment analysis.
Multi-page PDFs default to processing the first page. For batch processing, please extend the loop logic at the script level.
安装 pdf-ocr-layout 后,可以对 AI 说这些话来触发它
Help me get started with pdf-ocr-layout
Explains what pdf-ocr-layout does, walks through the setup, and runs a quick demo based on your current project
Use pdf-ocr-layout to multi-modal document in-depth analysis tool based on GLM-OCR, GLM-4
Invokes pdf-ocr-layout with the right parameters and returns the result directly in the conversation
What can I do with pdf-ocr-layout in my documents & notes workflow?
Lists the top use cases for pdf-ocr-layout, with example commands for each scenario
将技能文件夹放到 ~/.claude/skills/pdf-ocr-layout/ 目录(个人级,所有项目可用),或 .claude/skills/pdf-ocr-layout/(项目级)。重启 AI 客户端后,用 /pdf-ocr-layout 主动调用,或让 AI 根据上下文自动发现并使用。
pdf-ocr-layout 支持 Claude、Cursor、OpenClaw,可与这些 AI 平台无缝集成,扩展其能力。
pdf-ocr-layout 可免费安装使用。请查阅仓库了解许可证信息。
基于智谱 GLM-OCR、GLM-4.7 及 GLM-4.6V 的多模态文档深度解析工具。 Use when: - 需要高精度提取文档(PDF/图片)中的表格并转换为 Markdown 格式 - 需要从文档页面中自动裁剪并提取插图、图表为独立文件 - 需要对提取的图表进行深度语义理解(基于 GLM-4.6V 视觉分析) - 需要对提取的表格数据进行逻辑分析(基于 GLM-4.7 文本分析) 核心架构: 1. 视觉提取:GLM-OCR 2. 语义理解:GLM-4.7 (纯文本/表格) + GLM-4.6V (多模态/图像)
pdf-ocr-layout 属于「Documents & Notes」分类,该分类的技能帮助 AI 智能体在此领域执行专业任务。
Automate my documents & notes tasks using pdf-ocr-layout
Identifies repetitive steps in your workflow and sets up pdf-ocr-layout to handle them automatically