P

pdf-ocr-layout

基于智谱 GLM-OCR、GLM-4.7 及 GLM-4.6V 的多模态文档深度解析工具。 Use when: - 需要高精度提取文档（PDF/图片）中的表格并转换为 Markdown 格式 - 需要从文档页面中自动裁剪并提取插图、图表为独立文件 - 需要对提取的图表进行深度语义理解（基于 GLM-4.6V 视觉分析） - 需要对提取的表格数据进行逻辑分析（基于 GLM-4.7 文本分析）核心架构： 1. 视觉提取：GLM-OCR 2. 语义理解：GLM-4.7 (纯文本/表格) + GLM-4.6V (多模态/图像)

数据来源：ClawHub。在 ClawSkills 查看

1.4k下载量

1收藏数

7浏览量

安装

选择你使用的 Agent

方法一：命令行安装（推荐）

关于 pdf-ocr-layout

--- name: pdf-ocr-layout description: Multimodal document deep analysis tool based on Zhipu GLM-OCR, GLM-4.7, and GLM-4.6V.

Use when: - Need to extract tables from documents (PDF/images) with high precision and convert to Markdown format - Need to automatically crop and extract illustrations and charts from document pages as independent files - Need to perform deep semantic understanding on extracted charts (based on GLM-4.6V visual analysis) - Need to perform logical analysis on extracted table data (based on GLM-4.7 text analysis)

Core Architecture: 1. Visual Extraction: GLM-OCR 2. Semantic Understanding: GLM-4.7 (text/tables) + GLM-4.6V (multimodal/images) ---

GLM-OCR Multimodal Deep Analysis

This tool builds a high-precision document parsing pipeline: using GLM-OCR for layout element extraction, calling GLM-4.7 for logical interpretation of table data, and calling GLM-4.6V for multimodal visual interpretation of images and charts.

Pipeline Implementation Architecture

This Skill consists of two core script stages, orchestrated through glm_ocr_pipeline.py:

1. Extraction Stage (`scripts/glm_ocr_extract.py`)

Core Model: GLM-OCR
Function: Responsible for physical layout analysis of documents
Output: Extract table HTML and clean to Markdown, automatically crop independent chart image files based on Bbox coordinates, and generate intermediate JSON containing full page reading order

2. Understanding Stage (`scripts/glm_understanding.py`)

Core Model: GLM-4.7 (text) / GLM-4.6V (visual)
Function: Responsible for deep semantic reasoning of content
Logic:

- Tables: Combine full text context, use GLM-4.7 to analyze business meaning of Markdown table data - Charts: Combine full text context + cropped images, use GLM-4.6V for multimodal visual analysis

Invocation Methods

Command Line Invocation

# Run complete pipeline: extraction -> cropping -> understanding analysis, supports input in .pdf, .jpg, .png and other formats
python scripts/glm_ocr_pipeline.py \
  --file_path "/data/report_page.jpg" \
  --output_dir "/data/output"

API Parameter Description

| Parameter | Type | Required | Description | | --- | --- | --- | --- | | file_path | string | ✅ | Absolute path to input file (supports .pdf, .png, .jpg) | | output_dir | string | ✅ | Result output directory (used to save cropped images and JSON reports) |

Return Result Structure (JSON)

The tool returns a list containing layout elements and their deep understanding:

[
  {
    "type": "table",
    "bbox": [100, 200, 500, 600],
    "content_info": "| Revenue | Q1 |\n|---|---|\n| 100M | ... |",
    "deep_understanding": "(Generated by GLM-4.7) This table shows Q1 2024 revenue data. Combined with the 'market expansion strategy' mentioned in paragraph 3 of the body text, it can be seen that..."
  },
  {
    "type": "image",
    "bbox": [100, 700, 500, 900],
    "content_info": "/data/output/images/report_page_img_2.png",
    "deep_understanding": "(Generated by GLM-4.6V) This is a system architecture diagram. Visually, it shows the flow of clients connecting to servers through a Load Balancer. Combined with the title 'Fig 3' and context, this diagram is mainly used to illustrate..."
  }
]

Environment Requirements

Environment variable ZHIPU_API_KEY must be configured
Python 3.8+
Dependencies: zhipuai, pillow, beautifulsoup4

Notes

1. Model Routing Strategy

Table (表格): Content passed to GLM-4.7, combined with full text Markdown context for logical reasoning
Image (图片): Image Base64 encoded and passed to GLM-4.6V, combined with OCR-extracted titles and full text context for multimodal understanding

2. Context Association

All understanding is based on the complete layout logic of the document (Markdown Context), not isolated fragment analysis.

3. PDF Processing

Multi-page PDFs default to processing the first page. For batch processing, please extend the loop logic at the script level.

Prompt 示例

安装 pdf-ocr-layout 后，可以对 AI 说这些话来触发它

U

Help me get started with pdf-ocr-layout

A

Explains what pdf-ocr-layout does, walks through the setup, and runs a quick demo based on your current project

U

Use pdf-ocr-layout to multi-modal document in-depth analysis tool based on GLM-OCR, GLM-4

A

Invokes pdf-ocr-layout with the right parameters and returns the result directly in the conversation

U

What can I do with pdf-ocr-layout in my documents & notes workflow?

A

Lists the top use cases for pdf-ocr-layout, with example commands for each scenario

常见问题

如何安装 pdf-ocr-layout？▾

将技能文件夹放到 ~/.claude/skills/pdf-ocr-layout/ 目录（个人级，所有项目可用），或 .claude/skills/pdf-ocr-layout/（项目级）。重启 AI 客户端后，用 /pdf-ocr-layout 主动调用，或让 AI 根据上下文自动发现并使用。

pdf-ocr-layout 支持哪些 AI 平台？▾

pdf-ocr-layout 支持 Claude、Cursor、OpenClaw，可与这些 AI 平台无缝集成，扩展其能力。

pdf-ocr-layout 是免费的吗？▾

pdf-ocr-layout 可免费安装使用。请查阅仓库了解许可证信息。

pdf-ocr-layout 有什么功能？▾

基于智谱 GLM-OCR、GLM-4.7 及 GLM-4.6V 的多模态文档深度解析工具。 Use when: - 需要高精度提取文档（PDF/图片）中的表格并转换为 Markdown 格式 - 需要从文档页面中自动裁剪并提取插图、图表为独立文件 - 需要对提取的图表进行深度语义理解（基于 GLM-4.6V 视觉分析） - 需要对提取的表格数据进行逻辑分析（基于 GLM-4.7 文本分析）核心架构： 1. 视觉提取：GLM-OCR 2. 语义理解：GLM-4.7 (纯文本/表格) + GLM-4.6V (多模态/图像)

pdf-ocr-layout 属于哪个分类？▾

pdf-ocr-layout 属于「Documents & Notes」分类，该分类的技能帮助 AI 智能体在此领域执行专业任务。

使用场景

Getting Started with pdf-ocr-layout→Automate Documents & Notes Workflows with pdf-ocr-layout→Team Collaboration with pdf-ocr-layout→

pdf-ocr-layout

安装

关于 pdf-ocr-layout

GLM-OCR Multimodal Deep Analysis

Pipeline Implementation Architecture

1. Extraction Stage (`scripts/glm_ocr_extract.py`)

2. Understanding Stage (`scripts/glm_understanding.py`)

Invocation Methods

Command Line Invocation

API Parameter Description

Return Result Structure (JSON)

Environment Requirements

Notes

1. Model Routing Strategy

2. Context Association

3. PDF Processing

Prompt 示例

常见问题

使用场景

同类技能推荐

Nano Pdf

Obsidian

Notion

Word / DOCX

pdf-ocr-layout

安装

关于 pdf-ocr-layout

GLM-OCR Multimodal Deep Analysis

Pipeline Implementation Architecture

1. Extraction Stage (scripts/glm_ocr_extract.py)

2. Understanding Stage (scripts/glm_understanding.py)

Invocation Methods

Command Line Invocation

API Parameter Description

Return Result Structure (JSON)

Environment Requirements

Notes

1. Model Routing Strategy

2. Context Association

3. PDF Processing

Prompt 示例

常见问题

使用场景

同类技能推荐

Nano Pdf

Obsidian

Notion

Word / DOCX

1. Extraction Stage (`scripts/glm_ocr_extract.py`)

2. Understanding Stage (`scripts/glm_understanding.py`)