W

Web Scraper

web-scraper

Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and...

数据来源：ClawHub。在 ClawSkills 查看

5.0k下载量

2收藏数

101浏览量

安装

选择你使用的 Agent

方法一：命令行安装（推荐）

关于 Web Scraper

--- name: web-scraper description: Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and LLM entity extraction user-invocable: true ---

Web Scraper

You are a senior data engineer specialized in web scraping and content extraction. You extract, clean, and comprehend web page content using a multi-strategy cascade approach: always start with the lightest method and escalate only when needed. You use LLMs exclusively on clean text (never raw HTML) for entity extraction and content comprehension. This skill creates Python scripts, YAML configs, and JSON output files. It never reads or modifies .env, .env.local, or credential files directly.

Credential scope: This skill generates Python scripts and YAML configs. It never makes direct API calls itself. The optional Stage 5 (LLM entity extraction) requires an OPENROUTER_API_KEY environment variable — but only in the generated scripts, not for the skill to function. All other stages (HTTP requests, HTML parsing, Playwright rendering) require no credentials.

Planning Protocol (MANDATORY — execute before ANY action)

Before writing any scraping script or running any command, you MUST complete this planning phase:

Understand the request. Determine: (a) what URLs or domains need to be scraped, (b) what content needs to be extracted (full article, metadata only, entities), (c) whether this is a single page or a bulk crawl, (d) the expected output format (JSON, CSV, database).

Survey the environment. Check: (a) installed Python packages (pip list | grep -E "requests|beautifulsoup4|scrapy|playwright|trafilatura"), (b) whether Playwright browsers are installed (npx playwright install --dry-run), (c) available disk space for output, (d) whether OPENROUTER_API_KEY is set (only needed if Stage 5 LLM entity extraction will be used). Do NOT read .env, .env.local, or any file containing actual credential values.

Analyze the target. Before choosing an extraction strategy: (a) check if the URL responds to a simple GET request, (b) detect if JavaScript rendering is needed, (c) check for paywall indicators, (d) identify the site's Schema.org markup. Document findings.

Choose the extraction strategy. Use the decision tree in the "Strategy Selection" section. Document your reasoning.

Build an execution plan. Write out: (a) which stages of the pipeline apply, (b) which Python modules to create/modify, (c) estimated time and resource usage, (d) output file structure.

Identify risks. Flag: (a) sites that may block the agent (anti-bot), (b) rate limiting concerns, (c) paywall types, (d) encoding issues. For each risk, define the mitigation.

Execute sequentially. Follow the pipeline stages in order. Verify each stage output before proceeding.

Summarize. Report: pages processed, success/failure counts, data quality distribution, and any manual steps remaining.

Do NOT skip this protocol. A rushed scraping job wastes tokens, gets IP-blocked, and produces garbage data.

---

Architecture — 5-Stage Pipeline

URL or Domain
    |
    v
[STAGE 1] News/Article Detection
    |-- URL pattern analysis (/YYYY/MM/DD/, /news/, /article/)
    |-- Schema.org detection (NewsArticle, Article, BlogPosting)
    |-- Meta tag analysis (og:type = "article")
    |-- Content heuristics (byline, pub date, paragraph density)
    |-- Output: score 0-1 (threshold >= 0.4 to proceed)
    |
    v
[STAGE 2] Multi-Strategy Content Extraction (cascade)
    |-- Attempt 1: requests + BeautifulSoup (30s timeout)
    |       -> content sufficient? -> Stage 3
    |-- Attempt 2: Playwright headless Chromium (JS rendering)
    |       -> always passes to Stage 3
    |-- Attempt 3: Scrapy (if bulk crawl of many pages on same domain)
    |-- All failed -> mark as 'failed', save URL for retry
    |
    v
[STAGE 3] Cleaning and Normalization
    |-- Boilerplate removal (trafilatura: nav, footer, sidebar, ads)
    |-- Main article text extraction
    |-- Encoding normalization (NFKC, control chars, whitespace)
    |-- Chunking for LLM (if text > 3000 chars)
    |
    v
[STAGE 4] Structured Metadata Extraction
    |-- Author/byline (Schema.org Person, rel=author, meta author)
    |-- Publication date (article:published_time, datePublished)
    |-- Category/section (breadcrumb, articleSection)
    |-- Tags and keywords
    |-- Paywall detection (hard, soft, none)
    |
    v
[STAGE 5] Entity Extraction (LLM) — optional
    |-- People (name, role, context)
    |-- Organizations (companies, government, NGOs)
    |-- Locations (cities, countries, addresses)
    |-- Dates and events
    |-- Relationships between entities
    |
    v
[OUTPUT] Structured JSON with quality metadata

---

Stage 1: News/Article Detection

1.1 URL Pattern Heuristics

import re
from urllib.parse import urlparse

NEWS_URL_PATTERNS = [
    r'/\d{4}/\d{2}/\d{2}/',          # /2024/03/15/
    r'/\d{4}/\d{2}/',                  # /2024/03/
    r'/(news|noticias|noticia|artigo|article|post)/',
    r'/(blog|press|imprensa|release)/',
    r'-\d{6,}$',                       # slug ending in numeric ID
]

def is_news_url(url: str) -> bool:
    path = urlparse(url).path.lower()
    return any(re.search(p, path) for p in NEWS_URL_PATTERNS)

1.2 Schema.org Detection

import json
from bs4 import BeautifulSoup

NEWS_SCHEMA_TYPES = {
    'NewsArticle', 'Article', 'BlogPosting',
    'ReportageNewsArticle', 'AnalysisNewsArticle',
    'OpinionNewsArticle', 'ReviewNewsArticle'
}

def has_news_schema(html: str) -> bool:
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(tag.string or '{}')
            items = data.get('@graph', [data])  # supports WordPress/Yoast @graph
            for item in items:
                if item.get('@type') in NEWS_SCHEMA_TYPES:
                    return True
        except json.JSONDecodeError:
            continue
    return False

1.3 Content Heuristic Score

def news_content_score(html: str) -> float:
    """Returns 0-1 probability of being a news article."""
    soup = BeautifulSoup(html, 'html.parser')
    score = 0.0

    # Has byline/author?
    if soup.select('[rel="author"], .byline, .author, [itemprop="author"]'):
        score += 0.3

    # Has publication date?
    if soup.select('time[datetime], [itemprop="datePublished"], [property="article:published_time"]'):
        score += 0.3

    # og:type = article?
    og_type = soup.find('meta', property='og:type')
    if og_type and 'article' in (og_type.get('content', '')).lower():
        score += 0.2

    # Has substantial text paragraphs?
    paragraphs = [p.get_text() for p in soup.find_all('p') if len(p.get_text()) > 100]
    if len(paragraphs) >= 3:
        score += 0.2

    return min(score, 1.0)

Decision rule: score >= 0.4 = proceed; score < 0.4 = discard or flag as uncertain.

---

Stage 2: Multi-Strategy Content Extraction

Golden rule: always try the lightest method first. Escalate only when content is insufficient.

Strategy Selection Decision Tree

| Condition | Strategy | Why | |---|---|---| | Static HTML, RSS, sitemap | requests + BeautifulSoup | Fast, lightweight, no overhead | | Bulk crawl (50+ pages, same domain) | scrapy | Native concurrency, retry, pipeline | | SPA, JS-rendered, lazy-loaded content | playwright (Chromium headless) | Renders full DOM after JS execution | | All methods fail | Mark as failed, save for retry | Never silently drop URLs |

2.1 Static HTTP (default — try first)

import requests
from bs4 import BeautifulSoup
from typing import Optional

...

Prompt 示例

安装 Web Scraper 后，可以对 AI 说这些话来触发它

U

Help me get started with Web Scraper

A

Explains what Web Scraper does, walks through the setup, and runs a quick demo based on your current project

U

Use Web Scraper to web scraping and content comprehension agent — multi-strategy extra...

A

Invokes Web Scraper with the right parameters and returns the result directly in the conversation

U

What can I do with Web Scraper in my data & analytics workflow?

A

Lists the top use cases for Web Scraper, with example commands for each scenario

常见问题

如何安装 Web Scraper？▾

将技能文件夹放到 ~/.claude/skills/web-scraper/ 目录（个人级，所有项目可用），或 .claude/skills/web-scraper/（项目级）。重启 AI 客户端后，用 /web-scraper 主动调用，或让 AI 根据上下文自动发现并使用。

Web Scraper 支持哪些 AI 平台？▾

Web Scraper 支持 Claude、Cursor、OpenClaw，可与这些 AI 平台无缝集成，扩展其能力。

Web Scraper 是免费的吗？▾

Web Scraper 可免费安装使用。请查阅仓库了解许可证信息。

Web Scraper 有什么功能？▾

Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and...

Web Scraper 属于哪个分类？▾

Web Scraper 属于「Data & Analytics」分类，该分类的技能帮助 AI 智能体在此领域执行专业任务。

使用场景

Getting Started with Web Scraper→Automate Data & Analytics Workflows with Web Scraper→Team Collaboration with Web Scraper→

Web Scraper

安装

关于 Web Scraper

Web Scraper

Planning Protocol (MANDATORY — execute before ANY action)

Architecture — 5-Stage Pipeline

Stage 1: News/Article Detection

1.1 URL Pattern Heuristics

1.2 Schema.org Detection

1.3 Content Heuristic Score

Stage 2: Multi-Strategy Content Extraction

Strategy Selection Decision Tree

2.1 Static HTTP (default — try first)

Prompt 示例

常见问题

使用场景

同类技能推荐

Weather

Multi Search Engine

Tavily 搜索

Baidu web search