模型引导的 arXiv 论文收集工作流程,可规划查询、获取元数据、过滤相关性并按语言合并重复数据删除结果。
数据来源:ClawHub。 在 ClawSkills 查看
选择你使用的 Agent
方法一:命令行安装(推荐)
推荐(无需提前安装 clawhub)
npx clawhub@latest --dir ~/.claude/skills install arxiv-search-collector或使用 clawhub CLI(需提前安装)
clawhub --dir ~/.claude/skills install arxiv-search-collector⚠️ 需要 Node.js 18+,没有 Node?请使用下方方法二直接下载 ZIP。 安装 Node.js →
方法二:手动下载安装(无需 Node)
下载 ZIP,解压后将文件夹放到以下路径,重启 Agent 即可:
安装路径
~/.claude/skills/arxiv-search-collector/💡解压后将文件夹放到上方路径,重启 Agent 即可生效
--- name: arxiv-search-collector description: "Model-driven arXiv retrieval workflow for building a paper set with a manual language parameter: initialize a run, fetch metadata for each model-designed query, let the model filter irrelevant items per query by keep indexes, then merge and dedupe into per-paper metadata directories. Use when query planning and relevance filtering should be done by the model, not rule-based heuristics." ---
Use this skill when you want model-led query planning and model-led relevance filtering.
Scripts are tools. The model performs the reasoning and decisions:
python3 scripts/init_collection_run.py \
--output-root /path/to/data \
--topic "LLM applications in Lean 4 formalization" \
--keywords "Lean 4,LLM,formalization" \
--categories "cs.AI,cs.LO" \
--target-range 5-10 \
--lookback 30d \
--language English
This creates a run directory with task_meta.json, task_meta.md, query_results/, and query_selection/.
--language must be set manually for each collection run.
--language is non-English (for example Chinese), generated markdown files are written in that language:
- task_meta.md
- query_results/
-
- papers_index.md
Follow these rules before running per-query fetch:
3 queries for small/medium targets (2-5, 5-10).
4 queries for larger targets (10-50 or above).
target_max be the upper bound in target range.
target_per_query = ceil(target_max / query_count).
max_results = target_per_query 2 (or 3 when recall is more important).
5-10, query count 3 -> target_per_query=4 -> each query fetches 8-12.
OR inside the same semantic group (synonyms), and AND across groups.
OR to increase recall.
- Example group A (model terms): LLM OR "large language model" OR AI.
- Example group B (Lean terms): "Lean 4" OR Lean OR "formal language".
AND to keep relevance.
- Example: (LLM-group) AND (Lean-group).
- (
Theme A: LLM applications in Lean 4 formalization
all:"LLM applications in Lean 4 formalization"
(all:"Lean 4" OR all:"Lean" OR all:"formal language") AND (all:"LLM" OR all:"large language model" OR all:"AI")
(all:"Lean" OR all:"formalization") AND (all:"LLM" OR all:"large language model") AND all:"theorem proving"
(all:"Lean" OR all:"proof assistant") AND (all:"AI" OR all:"LLM")
Theme B: agentic tool use for code generation
all:"agentic tool use code generation"
(all:"agentic" OR all:"autonomous agent") AND (all:"LLM" OR all:"large language model")
(all:"tool use" OR all:"function calling") AND (all:"coding assistant" OR all:"code generation")
Theme C: multimodal reasoning with retrieval
all:"multimodal reasoning retrieval"
(all:"multimodal" OR all:"vision language") AND (all:"retrieval" OR all:"RAG")
(all:"multimodal model" OR all:"vision language model") AND (all:"reasoning" OR all:"tool use")
Model defines queries manually, for example:
all:"Lean 4"
all:"LLM formalization"
all:"AI formal verification"
Recommended batch mode (safe defaults, serial execution):
python3 scripts/fetch_queries_batch.py \
--run-dir /path/to/run-dir \
--plan-json /path/to/query_plan.json
In batch mode, the script auto-applies:
--min-interval-sec 5
--retry-max 4
--retry-base-sec 5
--retry-max-sec 120
--retry-jitter-sec 1
/.runtime/arxiv_api_state.json ) for throttling
max_results from target_range and query count (default oversample x2, cap 60)
task_meta.json
Minimal query_plan.json only needs label and query.
See references/query-plan-format.md.
You normally do not need to set fetch-control args manually.
If you need one-by-one manual fetch, run each query:
python3 scripts/fetch_query_metadata.py \
--run-dir /path/to/run-dir \
--label lean4 \
--query 'all:"Lean 4"' \
--max-results 30 \
--min-interval-sec 5 \
--retry-max 4 \
--language English
Output files:
query_results/ (indexed full metadata list)
query_results/ (human-readable preview)
Date range is applied directly in arXiv API search_query via submittedDate:[... TO ...].
No second local date-filter pass is performed.
Rate-limit controls in fetch_query_metadata.py:
--min-interval-sec (default 5.0)
--retry-max (default 4)
--retry-base-sec (default 5.0)
--retry-max-sec (default 120.0)
--retry-jitter-sec (default 1.0)
--rate-state-path (optional override; default is /.runtime/arxiv_api_state.json )
--force to bypass cache and re-fetch
For each query list, the model reads indexed results and decides what to keep.
Use keep specs by index and/or arXiv ID when merging.
To explicitly drop one weak query in later iterations, set that label to an empty keep list in selection-json.
python3 scripts/merge_selected_papers.py \
--run-dir /path/to/run-dir \
--keep lean4:0,2,4 \
--keep llm-formalization:1,3 \
--language English
or with selection-json:
{
"lean4-round1": [0, 2, 4],
"lean4-round2": [],
"formalization-round2": [1, 3, 5]
}
An empty list means this query label is intentionally dropped (keep 0).
This writes final outputs:
/metadata.json
/metadata.md
papers_index.json
papers_index.md
If relevance is weak or final count is insufficient after Step 4, iterate:
papers_index.md and per-paper metadata quality.
OR terms, keep cross-group AND constraints).
python3 scripts/merge_selected_papers.py \
--run-dir /path/to/run-dir \
--incremental \
--selection-json /path/to/updated_selection.json \
--language English
Incremental behavior:
query_selection/selected_by_query.json.
selection-json override previous selections for those labels.
[].
Stop retrying when:
...
安装 Arxiv 搜索收集器 后,可以对 AI 说这些话来触发它
Help me get started with Arxiv Search Collector
Explains what Arxiv Search Collector does, walks through the setup, and runs a quick demo based on your current project
Use Arxiv Search Collector to model-guided arXiv paper collection workflow that plans queries, fe...
Invokes Arxiv Search Collector with the right parameters and returns the result directly in the conversation
What can I do with Arxiv Search Collector in my data & analytics workflow?
Lists the top use cases for Arxiv Search Collector, with example commands for each scenario
将技能文件夹放到 ~/.claude/skills/arxiv-search-collector/ 目录(个人级,所有项目可用),或 .claude/skills/arxiv-search-collector/(项目级)。重启 AI 客户端后,用 /arxiv-search-collector 主动调用,或让 AI 根据上下文自动发现并使用。
Arxiv 搜索收集器 支持 Claude、Cursor、OpenClaw,可与这些 AI 平台无缝集成,扩展其能力。
Arxiv 搜索收集器 可免费安装使用。请查阅仓库了解许可证信息。
模型引导的 arXiv 论文收集工作流程,可规划查询、获取元数据、过滤相关性并按语言合并重复数据删除结果。
Arxiv 搜索收集器 属于「Data & Analytics」分类,该分类的技能帮助 AI 智能体在此领域执行专业任务。
Automate my data & analytics tasks using Arxiv Search Collector
Identifies repetitive steps in your workflow and sets up Arxiv Search Collector to handle them automatically