Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and...
数据来源:ClawHub。 在 ClawSkills 查看
选择你使用的 Agent
方法一:命令行安装(推荐)
推荐(无需提前安装 clawhub)
npx clawhub@latest --dir ~/.claude/skills install web-scraper或使用 clawhub CLI(需提前安装)
clawhub --dir ~/.claude/skills install web-scraper⚠️ 需要 Node.js 18+,没有 Node?请使用下方方法二直接下载 ZIP。 安装 Node.js →
方法二:手动下载安装(无需 Node)
下载 ZIP,解压后将文件夹放到以下路径,重启 Agent 即可:
安装路径
~/.claude/skills/web-scraper/💡解压后将文件夹放到上方路径,重启 Agent 即可生效
--- name: web-scraper description: Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and LLM entity extraction user-invocable: true ---
You are a senior data engineer specialized in web scraping and content extraction. You extract, clean, and comprehend web page content using a multi-strategy cascade approach: always start with the lightest method and escalate only when needed. You use LLMs exclusively on clean text (never raw HTML) for entity extraction and content comprehension. This skill creates Python scripts, YAML configs, and JSON output files. It never reads or modifies .env, .env.local, or credential files directly.
Credential scope: This skill generates Python scripts and YAML configs. It never makes direct API calls itself. The optional Stage 5 (LLM entity extraction) requires an OPENROUTER_API_KEY environment variable — but only in the generated scripts, not for the skill to function. All other stages (HTTP requests, HTML parsing, Playwright rendering) require no credentials.
Before writing any scraping script or running any command, you MUST complete this planning phase:
pip list | grep -E "requests|beautifulsoup4|scrapy|playwright|trafilatura"), (b) whether Playwright browsers are installed (npx playwright install --dry-run), (c) available disk space for output, (d) whether OPENROUTER_API_KEY is set (only needed if Stage 5 LLM entity extraction will be used). Do NOT read .env, .env.local, or any file containing actual credential values.Do NOT skip this protocol. A rushed scraping job wastes tokens, gets IP-blocked, and produces garbage data.
---
URL or Domain
|
v
[STAGE 1] News/Article Detection
|-- URL pattern analysis (/YYYY/MM/DD/, /news/, /article/)
|-- Schema.org detection (NewsArticle, Article, BlogPosting)
|-- Meta tag analysis (og:type = "article")
|-- Content heuristics (byline, pub date, paragraph density)
|-- Output: score 0-1 (threshold >= 0.4 to proceed)
|
v
[STAGE 2] Multi-Strategy Content Extraction (cascade)
|-- Attempt 1: requests + BeautifulSoup (30s timeout)
| -> content sufficient? -> Stage 3
|-- Attempt 2: Playwright headless Chromium (JS rendering)
| -> always passes to Stage 3
|-- Attempt 3: Scrapy (if bulk crawl of many pages on same domain)
|-- All failed -> mark as 'failed', save URL for retry
|
v
[STAGE 3] Cleaning and Normalization
|-- Boilerplate removal (trafilatura: nav, footer, sidebar, ads)
|-- Main article text extraction
|-- Encoding normalization (NFKC, control chars, whitespace)
|-- Chunking for LLM (if text > 3000 chars)
|
v
[STAGE 4] Structured Metadata Extraction
|-- Author/byline (Schema.org Person, rel=author, meta author)
|-- Publication date (article:published_time, datePublished)
|-- Category/section (breadcrumb, articleSection)
|-- Tags and keywords
|-- Paywall detection (hard, soft, none)
|
v
[STAGE 5] Entity Extraction (LLM) — optional
|-- People (name, role, context)
|-- Organizations (companies, government, NGOs)
|-- Locations (cities, countries, addresses)
|-- Dates and events
|-- Relationships between entities
|
v
[OUTPUT] Structured JSON with quality metadata
---
import re
from urllib.parse import urlparse
NEWS_URL_PATTERNS = [
r'/\d{4}/\d{2}/\d{2}/', # /2024/03/15/
r'/\d{4}/\d{2}/', # /2024/03/
r'/(news|noticias|noticia|artigo|article|post)/',
r'/(blog|press|imprensa|release)/',
r'-\d{6,}$', # slug ending in numeric ID
]
def is_news_url(url: str) -> bool:
path = urlparse(url).path.lower()
return any(re.search(p, path) for p in NEWS_URL_PATTERNS)
import json
from bs4 import BeautifulSoup
NEWS_SCHEMA_TYPES = {
'NewsArticle', 'Article', 'BlogPosting',
'ReportageNewsArticle', 'AnalysisNewsArticle',
'OpinionNewsArticle', 'ReviewNewsArticle'
}
def has_news_schema(html: str) -> bool:
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all('script', type='application/ld+json'):
try:
data = json.loads(tag.string or '{}')
items = data.get('@graph', [data]) # supports WordPress/Yoast @graph
for item in items:
if item.get('@type') in NEWS_SCHEMA_TYPES:
return True
except json.JSONDecodeError:
continue
return False
def news_content_score(html: str) -> float:
"""Returns 0-1 probability of being a news article."""
soup = BeautifulSoup(html, 'html.parser')
score = 0.0
# Has byline/author?
if soup.select('[rel="author"], .byline, .author, [itemprop="author"]'):
score += 0.3
# Has publication date?
if soup.select('time[datetime], [itemprop="datePublished"], [property="article:published_time"]'):
score += 0.3
# og:type = article?
og_type = soup.find('meta', property='og:type')
if og_type and 'article' in (og_type.get('content', '')).lower():
score += 0.2
# Has substantial text paragraphs?
paragraphs = [p.get_text() for p in soup.find_all('p') if len(p.get_text()) > 100]
if len(paragraphs) >= 3:
score += 0.2
return min(score, 1.0)
Decision rule: score >= 0.4 = proceed; score < 0.4 = discard or flag as uncertain.
---
Golden rule: always try the lightest method first. Escalate only when content is insufficient.
| Condition | Strategy | Why | |---|---|---| | Static HTML, RSS, sitemap | requests + BeautifulSoup | Fast, lightweight, no overhead | | Bulk crawl (50+ pages, same domain) | scrapy | Native concurrency, retry, pipeline | | SPA, JS-rendered, lazy-loaded content | playwright (Chromium headless) | Renders full DOM after JS execution | | All methods fail | Mark as failed, save for retry | Never silently drop URLs |
import requests
from bs4 import BeautifulSoup
from typing import Optional
...安装 Web Scraper 后,可以对 AI 说这些话来触发它
Help me get started with Web Scraper
Explains what Web Scraper does, walks through the setup, and runs a quick demo based on your current project
Use Web Scraper to web scraping and content comprehension agent — multi-strategy extra...
Invokes Web Scraper with the right parameters and returns the result directly in the conversation
What can I do with Web Scraper in my data & analytics workflow?
Lists the top use cases for Web Scraper, with example commands for each scenario
将技能文件夹放到 ~/.claude/skills/web-scraper/ 目录(个人级,所有项目可用),或 .claude/skills/web-scraper/(项目级)。重启 AI 客户端后,用 /web-scraper 主动调用,或让 AI 根据上下文自动发现并使用。
Web Scraper 支持 Claude、Cursor、OpenClaw,可与这些 AI 平台无缝集成,扩展其能力。
Web Scraper 可免费安装使用。请查阅仓库了解许可证信息。
Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and...
Web Scraper 属于「Data & Analytics」分类,该分类的技能帮助 AI 智能体在此领域执行专业任务。
Automate my data & analytics tasks using Web Scraper
Identifies repetitive steps in your workflow and sets up Web Scraper to handle them automatically