Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MC...
数据来源:ClawHub。 在 ClawSkills 查看
选择你使用的 Agent
方法一:命令行安装(推荐)
推荐(无需提前安装 clawhub)
npx clawhub@latest --dir ~/.claude/skills install scrapling-web-scraping或使用 clawhub CLI(需提前安装)
clawhub --dir ~/.claude/skills install scrapling-web-scraping⚠️ 需要 Node.js 18+,没有 Node?请使用下方方法二直接下载 ZIP。 安装 Node.js →
方法二:手动下载安装(无需 Node)
下载 ZIP,解压后将文件夹放到以下路径,重启 Agent 即可:
安装路径
~/.claude/skills/scrapling-web-scraping/💡解压后将文件夹放到上方路径,重启 Agent 即可生效
--- name: scrapling-mcp description: Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the scrapling MCP server for execution; this skill provides strategy, recipes, and best practices. ---
> Guidance Layer + MCP Integration > Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.
pip install scrapling[mcp]
# Or for full features:
pip install scrapling[mcp,playwright]
python -m playwright install chromium
{
"mcpServers": {
"scrapling": {
"command": "python",
"args": ["-m", "scrapling.mcp"]
}
}
}
mcporter call scrapling fetch_page --url "https://example.com"
| Task | Tool | Example | |------|------|---------| | Fetch a page | mcporter | mcporter call scrapling fetch_page --url URL | | Extract with CSS | mcporter | mcporter call scrapling css_select --selector ".title::text" | | Which fetcher to use? | This skill | See "Fetcher Selection Guide" below | | Anti-bot strategy? | This skill | See "Anti-Bot Escalation Ladder" | | Complex crawl patterns? | This skill | See "Spider Recipes" |
┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Fetcher │────▶│ DynamicFetcher │────▶│ StealthyFetcher │
│ (HTTP) │ │ (Browser/JS) │ │ (Anti-bot) │
└─────────────────┘ └──────────────────┘ └──────────────────┘
Fastest JS-rendered Cloudflare,
Static pages SPAs, React/Vue Turnstile, etc.
Fetcher (10-100x faster)DynamicFetcherStealthyFetcherfetch_page — HTTP fetcherfetch_dynamic — Browser-based with Playwrightfetch_stealthy — Anti-bot bypass mode# MCP call: fetch_page with options
{
"url": "https://example.com",
"headers": {"User-Agent": "..."},
"delay": 2.0
}
# Use sessions for cookie/state across requests
FetcherSession(impersonate="chrome") # TLS fingerprint spoofing
# MCP: fetch_stealthy
StealthyFetcher.fetch(
url,
headless=True,
solve_cloudflare=True, # Auto-solve Turnstile
network_idle=True
)
See references/proxy-rotation.md
Scrapling can survive website redesigns using adaptive selectors:
# First run — save fingerprints
products = page.css('.product', auto_save=True)
# Later runs — auto-relocate if DOM changed
products = page.css('.product', adaptive=True)
MCP usage:
mcporter call scrapling css_select \\
--selector ".product" \\
--adaptive true \\
--auto-save true
When to use Spiders vs direct fetching:
from scrapling.spiders import Spider, Response
class ProductSpider(Spider):
name = "products"
start_urls = ["https://example.com/products"]
concurrent_requests = 10
download_delay = 1.0
async def parse(self, response: Response):
for product in response.css('.product'):
yield {
"name": product.css('h2::text').get(),
"price": product.css('.price::text').get(),
"url": response.url
}
# Follow pagination
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page)
# Run with resume capability
result = ProductSpider(crawldir="./crawl_data").start()
result.items.to_jsonl("products.jsonl")
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
if "/protected/" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast")
crawldir parameter saves checkpointsasync for item in spider.stream() for real-time processingto_json(), to_jsonl()# Extract to markdown
scrapling extract get 'https://example.com' content.md
# Extract specific element
scrapling extract get 'https://example.com' content.txt \\
--css-selector '.article' \\
--impersonate 'chrome'
# Stealth mode
scrapling extract stealthy-fetch 'https://protected.com' content.md \\
--no-headless \\
--solve-cloudflare
scrapling shell
# Inside shell:
>>> page = Fetcher.get('https://example.com')
>>> page.css('h1::text').get()
>>> page.find_all('div', class_='item')
# Find by attributes
page.find_all('div', {'class': 'product', 'data-id': True})
page.find_all('div', class_='product', id=re.compile(r'item-\\d+'))
# Text search
page.find_by_text('Add to Cart', tag='button')
page.find_by_regex(r'\\$\\d+\\.\\d{2}')
# Navigation
first = page.css('.product')[0]
parent = first.parent
siblings = first.next_siblings
children = first.children
# Similarity
similar = first.find_similar() # Find visually/structurally similar elements
below = first.below_elements() # Elements below in DOM
# Get robust selector for any element
element = page.css('.product')[0]
selector = element.auto_css_selector() # Returns stable CSS path
xpath = element.auto_xpath()
from scrapling.spiders import ProxyRotator
# Cyclic rotation
rotator = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080",
"http://user:pass@proxy3:8080"
], strategy="cyclic")
# Use with any session
with FetcherSession(proxy=rotator.next()) as session:
page = session.get('https://example.com')
# Page numbers
for page_num in range(1, 11):
url = f"https://example.com/products?page={page_num}"
...
# Next button
while next_page := response.css('.next a::attr(href)').get():
yield response.follow(next_page)
# Infinite scroll (DynamicFetcher)
with DynamicSession() as session:
page = session.fetch(url)
page.scroll_to_bottom()
items = page.css('.item').getall()
with StealthySession(headless=False) as session:
# Login
login_page = session.fetch('https://example.com/login')
login_page.fill('input[name="username"]', 'user')
login_page.fill('input[name="password"]', 'pass')
login_page.click('button[type="submit"]')
# Now session has cookies
protected_page = session.fetch('https://example.com/dashboard')
...
安装 Scrapling MCP 后,可以对 AI 说这些话来触发它
Help me get started with Scrapling MCP
Explains what Scrapling MCP does, walks through the setup, and runs a quick demo based on your current project
Use Scrapling MCP to advanced web scraping with Scrapling — MCP-native guidance for extr...
Invokes Scrapling MCP with the right parameters and returns the result directly in the conversation
What can I do with Scrapling MCP in my data & analytics workflow?
Lists the top use cases for Scrapling MCP, with example commands for each scenario
将技能文件夹放到 ~/.claude/skills/scrapling-web-scraping/ 目录(个人级,所有项目可用),或 .claude/skills/scrapling-web-scraping/(项目级)。重启 AI 客户端后,用 /scrapling-web-scraping 主动调用,或让 AI 根据上下文自动发现并使用。
Scrapling MCP 支持 Claude、Cursor、OpenClaw,可与这些 AI 平台无缝集成,扩展其能力。
Scrapling MCP 可免费安装使用。请查阅仓库了解许可证信息。
Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MC...
Scrapling MCP 属于「Data & Analytics」分类,该分类的技能帮助 AI 智能体在此领域执行专业任务。
Automate my data & analytics tasks using Scrapling MCP
Identifies repetitive steps in your workflow and sets up Scrapling MCP to handle them automatically