A

Arxiv 搜索收集器

(Arxiv Search Collector)

arxiv-search-collector

🌐 English

模型引导的 arXiv 论文收集工作流程，可规划查询、获取元数据、过滤相关性并按语言合并重复数据删除结果。

数据来源：ClawHub。在 ClawSkills 查看

1.5k下载量

0收藏数

16浏览量

安装

选择你使用的 Agent

方法一：命令行安装（推荐）

关于 Arxiv 搜索收集器

--- name: arxiv-search-collector description: "Model-driven arXiv retrieval workflow for building a paper set with a manual language parameter: initialize a run, fetch metadata for each model-designed query, let the model filter irrelevant items per query by keep indexes, then merge and dedupe into per-paper metadata directories. Use when query planning and relevance filtering should be done by the model, not rule-based heuristics." ---

ArXiv Search Collector

Use this skill when you want model-led query planning and model-led relevance filtering.

Core Principle

Scripts are tools. The model performs the reasoning and decisions:

Expand the original topic into multiple focused queries.
Run one fetch command per query.
Read each query result list and decide keep indexes.
Merge kept items and dedupe with one script.

Step 1: Initialize Run

python3 scripts/init_collection_run.py \
  --output-root /path/to/data \
  --topic "LLM applications in Lean 4 formalization" \
  --keywords "Lean 4,LLM,formalization" \
  --categories "cs.AI,cs.LO" \
  --target-range 5-10 \
  --lookback 30d \
  --language English

This creates a run directory with task_meta.json, task_meta.md, query_results/, and query_selection/.

Language Parameter

--language must be set manually for each collection run.
Use the same language value across all collector scripts for consistency.
If --language is non-English (for example Chinese), generated markdown files are written in that language:

- task_meta.md - query_results/.md - /metadata.md - papers_index.md

Query Writing Requirements

Follow these rules before running per-query fetch:

Determine query count from final target range.

Prefer 3 queries for small/medium targets (2-5, 5-10).
Prefer 4 queries for larger targets (10-50 or above).
Avoid writing too many low-quality queries.

Allocate target budget to each query, then oversample.

Let target_max be the upper bound in target range.
Compute target_per_query = ceil(target_max / query_count).
Fetch each query with max_results = target_per_query 2 (or 3 when recall is more important).
Example: target 5-10, query count 3 -> target_per_query=4 -> each query fetches 8-12.

Keep one original-theme query, then add normalized/synonym expansions.

Query 1 keeps original topic wording.
Remaining queries use normalized terms and close synonyms.
Prefer concise noun phrases that match arXiv indexing behavior.

Use OR inside the same semantic group (synonyms), and AND across groups.

Same-group synonyms should be connected with OR to increase recall.

- Example group A (model terms): LLM OR "large language model" OR AI. - Example group B (Lean terms): "Lean 4" OR Lean OR "formal language".

Different semantic groups should be connected with AND to keep relevance.

- Example: (LLM-group) AND (Lean-group).

Recommended pattern:

- () AND () [AND ]

Query Examples (arXiv API-ready)

Theme A: LLM applications in Lean 4 formalization

all:"LLM applications in Lean 4 formalization"
(all:"Lean 4" OR all:"Lean" OR all:"formal language") AND (all:"LLM" OR all:"large language model" OR all:"AI")
(all:"Lean" OR all:"formalization") AND (all:"LLM" OR all:"large language model") AND all:"theorem proving"
(all:"Lean" OR all:"proof assistant") AND (all:"AI" OR all:"LLM")

Theme B: agentic tool use for code generation

all:"agentic tool use code generation"
(all:"agentic" OR all:"autonomous agent") AND (all:"LLM" OR all:"large language model")
(all:"tool use" OR all:"function calling") AND (all:"coding assistant" OR all:"code generation")

Theme C: multimodal reasoning with retrieval

all:"multimodal reasoning retrieval"
(all:"multimodal" OR all:"vision language") AND (all:"retrieval" OR all:"RAG")
(all:"multimodal model" OR all:"vision language model") AND (all:"reasoning" OR all:"tool use")

Step 2: Fetch One Query at a Time

Model defines queries manually, for example:

all:"Lean 4"
all:"LLM formalization"
all:"AI formal verification"

Recommended batch mode (safe defaults, serial execution):

python3 scripts/fetch_queries_batch.py \
  --run-dir /path/to/run-dir \
  --plan-json /path/to/query_plan.json

In batch mode, the script auto-applies:

serial API calls
--min-interval-sec 5
--retry-max 4
--retry-base-sec 5
--retry-max-sec 120
--retry-jitter-sec 1
per-run rate-state file (/.runtime/arxiv_api_state.json) for throttling
auto max_results from target_range and query count (default oversample x2, cap 60)
default language/categories from task_meta.json

Minimal query_plan.json only needs label and query. See references/query-plan-format.md. You normally do not need to set fetch-control args manually.

If you need one-by-one manual fetch, run each query:

python3 scripts/fetch_query_metadata.py \
  --run-dir /path/to/run-dir \
  --label lean4 \
  --query 'all:"Lean 4"' \
  --max-results 30 \
  --min-interval-sec 5 \
  --retry-max 4 \
  --language English

Output files:

query_results/.json (indexed full metadata list)
query_results/.md (human-readable preview)

Date range is applied directly in arXiv API search_query via submittedDate:[... TO ...]. No second local date-filter pass is performed.

Rate-limit controls in fetch_query_metadata.py:

--min-interval-sec (default 5.0)
--retry-max (default 4)
--retry-base-sec (default 5.0)
--retry-max-sec (default 120.0)
--retry-jitter-sec (default 1.0)
--rate-state-path (optional override; default is /.runtime/arxiv_api_state.json)
--force to bypass cache and re-fetch

Step 3: Model Filters Relevance

For each query list, the model reads indexed results and decides what to keep.

Use keep specs by index and/or arXiv ID when merging. To explicitly drop one weak query in later iterations, set that label to an empty keep list in selection-json.

Step 4: Merge and Dedupe

python3 scripts/merge_selected_papers.py \
  --run-dir /path/to/run-dir \
  --keep lean4:0,2,4 \
  --keep llm-formalization:1,3 \
  --language English

or with selection-json:

{
  "lean4-round1": [0, 2, 4],
  "lean4-round2": [],
  "formalization-round2": [1, 3, 5]
}

An empty list means this query label is intentionally dropped (keep 0).

This writes final outputs:

/metadata.json
/metadata.md
papers_index.json
papers_index.md

Step 5: Iterative Retry Loop (Incremental)

If relevance is weak or final count is insufficient after Step 4, iterate:

Review papers_index.md and per-paper metadata quality.
Adjust query plan (usually broaden with additional synonym OR terms, keep cross-group AND constraints).
Fetch additional query results with new labels.
Re-run merge in incremental mode:

python3 scripts/merge_selected_papers.py \
  --run-dir /path/to/run-dir \
  --incremental \
  --selection-json /path/to/updated_selection.json \
  --language English

Incremental behavior:

Previous label selections are loaded from query_selection/selected_by_query.json.
Labels provided in the new selection-json override previous selections for those labels.
New labels can be added.
Old labels can be dropped by setting [].

Stop retrying when:

relevance is acceptable, or
additional broadened queries mainly add low-relevance papers.

...

Prompt 示例

安装 Arxiv 搜索收集器后，可以对 AI 说这些话来触发它

U

Help me get started with Arxiv Search Collector

A

Explains what Arxiv Search Collector does, walks through the setup, and runs a quick demo based on your current project

U

Use Arxiv Search Collector to model-guided arXiv paper collection workflow that plans queries, fe...

A

Invokes Arxiv Search Collector with the right parameters and returns the result directly in the conversation

U

What can I do with Arxiv Search Collector in my data & analytics workflow?

A

Lists the top use cases for Arxiv Search Collector, with example commands for each scenario

常见问题

如何安装 Arxiv 搜索收集器？▾

将技能文件夹放到 ~/.claude/skills/arxiv-search-collector/ 目录（个人级，所有项目可用），或 .claude/skills/arxiv-search-collector/（项目级）。重启 AI 客户端后，用 /arxiv-search-collector 主动调用，或让 AI 根据上下文自动发现并使用。

Arxiv 搜索收集器支持哪些 AI 平台？▾

Arxiv 搜索收集器支持 Claude、Cursor、OpenClaw，可与这些 AI 平台无缝集成，扩展其能力。

Arxiv 搜索收集器是免费的吗？▾

Arxiv 搜索收集器可免费安装使用。请查阅仓库了解许可证信息。

Arxiv 搜索收集器有什么功能？▾

模型引导的 arXiv 论文收集工作流程，可规划查询、获取元数据、过滤相关性并按语言合并重复数据删除结果。

Arxiv 搜索收集器属于哪个分类？▾

Arxiv 搜索收集器属于「Data & Analytics」分类，该分类的技能帮助 AI 智能体在此领域执行专业任务。

使用场景

Getting Started with Arxiv Search Collector→Automate Data & Analytics Workflows with Arxiv Search Collector→Team Collaboration with Arxiv Search Collector→