Adaptive web scraping framework with anti-bot bypass and spider crawling.
数据来源:ClawHub。 在 ClawSkills 查看
选择你使用的 Agent
方法一:命令行安装(推荐)
推荐(无需提前安装 clawhub)
npx clawhub@latest --dir ~/.claude/skills install scrapling或使用 clawhub CLI(需提前安装)
clawhub --dir ~/.claude/skills install scrapling⚠️ 需要 Node.js 18+,没有 Node?请使用下方方法二直接下载 ZIP。 安装 Node.js →
方法二:手动下载安装(无需 Node)
下载 ZIP,解压后将文件夹放到以下路径,重启 Agent 即可:
安装路径
~/.claude/skills/scrapling/💡解压后将文件夹放到上方路径,重启 Agent 即可生效
--- name: scrapling description: "Adaptive web scraping framework with anti-bot bypass and spider crawling." version: "1.0.8" metadata: {"openclaw":{"emoji":"🕷️","requires":{"bins":["python3"]}, "tags":["web-scraping", "crawling", "research", "automation"]}} ---
> "Effortless web scraping for the modern web."
---
---
# Core library (parser only)
pip install scrapling
# With fetchers (HTTP + browser automation) - RECOMMENDED
pip install "scrapling[fetchers]"
scrapling install
# With shell (CLI tools) - RECOMMENDED
pip install "scrapling[shell]"
# With AI (MCP server) - OPTIONAL
pip install "scrapling[ai]"
# Everything
pip install "scrapling[all]"
# Browser for stealth/dynamic mode
playwright install chromium
# For Cloudflare bypass (advanced)
pip install cloudscraper
---
Use Scrapling when:
Do NOT use for:
---
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://example.com')
# Extract content
title = page.css('h1::text').get()
paragraphs = page.css('p::text').getall()
from scrapling.fetchers import StealthyFetcher
StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch('https://example.com', headless=True, solve_cloudflare=True)
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch('https://example.com', headless=True, network_idle=True)
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://example.com')
# First scrape - saves selectors
items = page.css('.product', auto_save=True)
# Later - if site changes, use adaptive=True to relocate
items = page.css('.product', adaptive=True)
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com"]
concurrent_requests = 3
async def parse(self, response: Response):
for item in response.css('.item'):
yield {"item": item.css('h2::text').get()}
# Follow links
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
MySpider().start()
# Simple fetch to file
scrapling extract get https://example.com content.html
# Stealthy fetch (bypass anti-bot)
scrapling extract stealthy-fetch https://example.com content.html
# Interactive shell
scrapling shell https://example.com
---
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://example.com/article')
# Try multiple selectors for title
title = (
page.css('[itemprop="headline"]::text').get() or
page.css('article h1::text').get() or
page.css('h1::text').get()
)
# Get paragraphs
content = page.css('article p::text, .article-body p::text').getall()
print(f"Title: {title}")
print(f"Paragraphs: {len(content)}")
from scrapling.spiders import Spider, Response
class ResearchSpider(Spider):
name = "research"
start_urls = ["https://news.ycombinator.com"]
concurrent_requests = 5
async def parse(self, response: Response):
for item in response.css('.titleline a::text').getall()[:10]:
yield {"title": item, "source": "HN"}
more = response.css('.morelink::attr(href)').get()
if more:
yield response.follow(more)
ResearchSpider().start()
Auto-crawl all pages on a domain by following internal links:
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
class EasyCrawl(Spider):
"""Auto-crawl all pages on a domain."""
name = "easy_crawl"
start_urls = ["https://example.com"]
concurrent_requests = 3
def __init__(self):
super().__init__()
self.visited = set()
async def parse(self, response: Response):
# Extract content
yield {
'url': response.url,
'title': response.css('title::text').get(),
'h1': response.css('h1::text').get(),
}
# Follow internal links (limit to 50 pages)
if len(self.visited) >= 50:
return
self.visited.add(response.url)
links = response.css('a::attr(href)').getall()[:20]
for link in links:
full_url = urljoin(response.url, link)
if full_url not in self.visited:
yield response.follow(full_url)
# Usage
result = EasyCrawl()
result.start()
Crawl pages from sitemap.xml (with fallback to link discovery):
from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
import re
def get_sitemap_urls(url: str, max_urls: int = 100) -> list:
"""Extract URLs from sitemap.xml - also checks robots.txt."""
parsed = urlparse(url)
base_url = f"{parsed.scheme}://{parsed.netloc}"
sitemap_urls = [
f"{base_url}/sitemap.xml",
f"{base_url}/sitemap-index.xml",
f"{base_url}/sitemap_index.xml",
f"{base_url}/sitemap-news.xml",
]
all_urls = []
# First check robots.txt for sitemap URL
try:
robots = Fetcher.get(f"{base_url}/robots.txt")
if robots.status == 200:
sitemap_in_robots = re.findall(r'Sitemap:\s*(\S+)', robots.text, re.IGNORECASE)
for sm in sitemap_in_robots:
sitemap_urls.insert(0, sm)
except:
pass
# Try each sitemap location
for sitemap_url in sitemap_urls:
try:
page = Fetcher.get(sitemap_url, timeout=10)
if page.status != 200:
continue
text = page.text
# Check if it's XML
if '<?xml' in text or '<urlset' in text or '<sitemapindex' in text:
urls = re.findall(r'<loc>([^<]+)</loc>', text)
all_urls.extend(urls[:max_urls])
print(f"Found {len(urls)} URLs in {sitemap_url}")
except:
continue
return list(set(all_urls))[:max_urls]
...安装 Scrapling 后,可以对 AI 说这些话来触发它
Help me get started with Scrapling
Explains what Scrapling does, walks through the setup, and runs a quick demo based on your current project
Use Scrapling to adaptive web scraping framework with anti-bot bypass and spider cra...
Invokes Scrapling with the right parameters and returns the result directly in the conversation
What can I do with Scrapling in my data & analytics workflow?
Lists the top use cases for Scrapling, with example commands for each scenario
将技能文件夹放到 ~/.claude/skills/scrapling/ 目录(个人级,所有项目可用),或 .claude/skills/scrapling/(项目级)。重启 AI 客户端后,用 /scrapling 主动调用,或让 AI 根据上下文自动发现并使用。
Scrapling 支持 Claude、Cursor、OpenClaw,可与这些 AI 平台无缝集成,扩展其能力。
Scrapling 可免费安装使用。请查阅仓库了解许可证信息。
Adaptive web scraping framework with anti-bot bypass and spider crawling.
Scrapling 属于「Data & Analytics」分类,该分类的技能帮助 AI 智能体在此领域执行专业任务。
Automate my data & analytics tasks using Scrapling
Identifies repetitive steps in your workflow and sets up Scrapling to handle them automatically