S

Scrapling

scrapling

🌐 English

Adaptive web scraping framework with anti-bot bypass and spider crawling.

数据来源：ClawHub。在 ClawSkills 查看

2.8k下载量

3收藏数

91浏览量

安装

选择你使用的 Agent

方法一：命令行安装（推荐）

关于 Scrapling

--- name: scrapling description: "Adaptive web scraping framework with anti-bot bypass and spider crawling." version: "1.0.8" metadata: {"openclaw":{"emoji":"🕷️","requires":{"bins":["python3"]}, "tags":["web-scraping", "crawling", "research", "automation"]}} ---

Scrapling - Adaptive Web Scraping

> "Effortless web scraping for the modern web."

---

Credits

Core Library

Repository: https://github.com/D4Vinci/Scrapling
Author: D4Vinci (Karim Shoair)
License: BSD-3-Clause
Documentation: https://scrapling.readthedocs.io

API Reverse Engineering Methodology

GitHub: https://github.com/paoloanzn/free-solscan-api
X Post: https://x.com/paoloanzn/status/2026361234032046319
Author: @paoloanzn
Insight: "Web scraping is 80% reverse engineering"

---

Installation

# Core library (parser only)
pip install scrapling

# With fetchers (HTTP + browser automation) - RECOMMENDED
pip install "scrapling[fetchers]"
scrapling install

# With shell (CLI tools) - RECOMMENDED
pip install "scrapling[shell]"

# With AI (MCP server) - OPTIONAL
pip install "scrapling[ai]"

# Everything
pip install "scrapling[all]"

# Browser for stealth/dynamic mode
playwright install chromium

# For Cloudflare bypass (advanced)
pip install cloudscraper

---

Agent Instructions

When to Use Scrapling

Use Scrapling when:

Research topics from websites
Extract data from blogs, news sites, docs
Crawl multiple pages with Spider
Gather content for summaries
Extract brand data from any website
Reverse engineer APIs from websites

Do NOT use for:

X/Twitter (use x-tweet-fetcher skill)
Login-protected sites (unless credentials provided)
Paywalled content (respect robots.txt)
Sites that prohibit scraping in their TOS

---

Quick Commands

1. Basic Fetch (Most Common)

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://example.com')

# Extract content
title = page.css('h1::text').get()
paragraphs = page.css('p::text').getall()

2. Stealthy Fetch (Anti-Bot/Cloudflare)

from scrapling.fetchers import StealthyFetcher

StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch('https://example.com', headless=True, solve_cloudflare=True)

3. Dynamic Fetch (Full Browser Automation)

from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch('https://example.com', headless=True, network_idle=True)

4. Adaptive Parsing (Survives Design Changes)

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://example.com')

# First scrape - saves selectors
items = page.css('.product', auto_save=True)

# Later - if site changes, use adaptive=True to relocate
items = page.css('.product', adaptive=True)

5. Spider (Multiple Pages)

from scrapling.spiders import Spider, Response

class MySpider(Spider):
    name = "demo"
    start_urls = ["https://example.com"]
    concurrent_requests = 3
    
    async def parse(self, response: Response):
        for item in response.css('.item'):
            yield {"item": item.css('h2::text').get()}
        
        # Follow links
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

MySpider().start()

6. CLI Usage

# Simple fetch to file
scrapling extract get https://example.com content.html

# Stealthy fetch (bypass anti-bot)
scrapling extract stealthy-fetch https://example.com content.html

# Interactive shell
scrapling shell https://example.com

---

Common Patterns

Extract Article Content

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://example.com/article')

# Try multiple selectors for title
title = (
    page.css('[itemprop="headline"]::text').get() or
    page.css('article h1::text').get() or
    page.css('h1::text').get()
)

# Get paragraphs
content = page.css('article p::text, .article-body p::text').getall()

print(f"Title: {title}")
print(f"Paragraphs: {len(content)}")

Research Multiple Pages

from scrapling.spiders import Spider, Response

class ResearchSpider(Spider):
    name = "research"
    start_urls = ["https://news.ycombinator.com"]
    concurrent_requests = 5
    
    async def parse(self, response: Response):
        for item in response.css('.titleline a::text').getall()[:10]:
            yield {"title": item, "source": "HN"}
        
        more = response.css('.morelink::attr(href)').get()
        if more:
            yield response.follow(more)

ResearchSpider().start()

Crawl Entire Site (Easy Mode)

Auto-crawl all pages on a domain by following internal links:

from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse

class EasyCrawl(Spider):
    """Auto-crawl all pages on a domain."""
    
    name = "easy_crawl"
    start_urls = ["https://example.com"]
    concurrent_requests = 3
    
    def __init__(self):
        super().__init__()
        self.visited = set()
    
    async def parse(self, response: Response):
        # Extract content
        yield {
            'url': response.url,
            'title': response.css('title::text').get(),
            'h1': response.css('h1::text').get(),
        }
        
        # Follow internal links (limit to 50 pages)
        if len(self.visited) >= 50:
            return
        
        self.visited.add(response.url)
        
        links = response.css('a::attr(href)').getall()[:20]
        for link in links:
            full_url = urljoin(response.url, link)
            if full_url not in self.visited:
                yield response.follow(full_url)

# Usage
result = EasyCrawl()
result.start()

Sitemap Crawl

Crawl pages from sitemap.xml (with fallback to link discovery):

from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
import re

def get_sitemap_urls(url: str, max_urls: int = 100) -> list:
    """Extract URLs from sitemap.xml - also checks robots.txt."""
    
    parsed = urlparse(url)
    base_url = f"{parsed.scheme}://{parsed.netloc}"
    
    sitemap_urls = [
        f"{base_url}/sitemap.xml",
        f"{base_url}/sitemap-index.xml",
        f"{base_url}/sitemap_index.xml",
        f"{base_url}/sitemap-news.xml",
    ]
    
    all_urls = []
    
    # First check robots.txt for sitemap URL
    try:
        robots = Fetcher.get(f"{base_url}/robots.txt")
        if robots.status == 200:
            sitemap_in_robots = re.findall(r'Sitemap:\s*(\S+)', robots.text, re.IGNORECASE)
            for sm in sitemap_in_robots:
                sitemap_urls.insert(0, sm)
    except:
        pass
    
    # Try each sitemap location
    for sitemap_url in sitemap_urls:
        try:
            page = Fetcher.get(sitemap_url, timeout=10)
            if page.status != 200:
                continue
            
            text = page.text
            
            # Check if it's XML
            if '<?xml' in text or '<urlset' in text or '<sitemapindex' in text:
                urls = re.findall(r'<loc>([^<]+)</loc>', text)
                all_urls.extend(urls[:max_urls])
                print(f"Found {len(urls)} URLs in {sitemap_url}")
        except:
            continue
    
    return list(set(all_urls))[:max_urls]

...

Prompt 示例

安装 Scrapling 后，可以对 AI 说这些话来触发它

U

Help me get started with Scrapling

A

Explains what Scrapling does, walks through the setup, and runs a quick demo based on your current project

U

Use Scrapling to adaptive web scraping framework with anti-bot bypass and spider cra...

A

Invokes Scrapling with the right parameters and returns the result directly in the conversation

U

What can I do with Scrapling in my data & analytics workflow?

A

Lists the top use cases for Scrapling, with example commands for each scenario

常见问题

如何安装 Scrapling？▾

将技能文件夹放到 ~/.claude/skills/scrapling/ 目录（个人级，所有项目可用），或 .claude/skills/scrapling/（项目级）。重启 AI 客户端后，用 /scrapling 主动调用，或让 AI 根据上下文自动发现并使用。

Scrapling 支持哪些 AI 平台？▾

Scrapling 支持 Claude、Cursor、OpenClaw，可与这些 AI 平台无缝集成，扩展其能力。

Scrapling 是免费的吗？▾

Scrapling 可免费安装使用。请查阅仓库了解许可证信息。

Scrapling 有什么功能？▾