Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all v...
数据来源:ClawHub。 在 ClawSkills 查看
选择你使用的 Agent
方法一:命令行安装(推荐)
推荐(无需提前安装 clawhub)
npx clawhub@latest --dir ~/.claude/skills install midscene-android-automation或使用 clawhub CLI(需提前安装)
clawhub --dir ~/.claude/skills install midscene-android-automation⚠️ 需要 Node.js 18+,没有 Node?请使用下方方法二直接下载 ZIP。 安装 Node.js →
方法二:手动下载安装(无需 Node)
下载 ZIP,解压后将文件夹放到以下路径,重启 Agent 即可:
安装路径
~/.claude/skills/midscene-android-automation/💡解压后将文件夹放到上方路径,重启 Agent 即可生效
--- name: android-device-automation description: > Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control Android devices with natural language commands via ADB. Perform taps, swipes, text input, app launches, screenshots, and more.
Trigger keywords: android, phone, mobile app, tap, swipe, install app, open app on phone, android device, mobile automation, adb, launch app, mobile screen
Powered by Midscene.js (https://midscenejs.com) allowed-tools: - Bash ---
> CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW: > > 1. Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop. > 2. Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together. > 3. Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer. > 4. Always report task results before finishing. After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.
Automate Android devices using npx @midscene/android@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.
Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
Example: Gemini (Gemini-3-Flash)
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
Example: Qwen 3.5
MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCENE_MODEL_NAME="qwen/qwen3.5-plus"
# MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
Example: Doubao Seed 2.0 Lite
MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-2-0-lite"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-seed"
Commonly used models: Doubao Seed 2.0 Lite, Qwen 3.5, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.
If the model is not configured, ask the user to set it up. See Model Configuration for supported providers.
npx @midscene/android@1 connect
npx @midscene/android@1 connect --deviceId emulator-5554
npx @midscene/android@1 take_screenshot
After taking a screenshot, read the saved image file to understand the current screen state before deciding the next action.
Use act to interact with the device and get the result. It autonomously handles all UI interactions internally — tapping, typing, scrolling, swiping, waiting, and navigating — so you should give it complex, high-level tasks as a whole rather than breaking them into small steps. Describe what you want to do and the desired effect in natural language:
# specific instructions
npx @midscene/android@1 act --prompt "type hello world in the search field and press Enter"
npx @midscene/android@1 act --prompt "long press the message bubble and tap Delete in the popup menu"
# or target-driven instructions
npx @midscene/android@1 act --prompt "open Settings and navigate to Wi-Fi settings, tell me the connected network name"
npx @midscene/android@1 disconnect
Since CLI commands are stateless between invocations, follow this pattern:
act to perform the desired action or target-driven instructions.adb shell am start -n ) before invoking any midscene commands. Then take a screenshot to confirm the app is actually in the foreground. Only after visual confirmation should you proceed with UI automation using this skill. ADB commands are significantly faster than using midscene to navigate to and open apps."the Wi-Fi toggle switch on the right side" instead of "the toggle"."the search icon at the top right", "the third item in the list").act command: When performing consecutive operations within the same app, combine them into one act prompt instead of splitting them into separate commands. For example, "open Settings, tap Wi-Fi, and toggle it on" should be a single act call, not three. This reduces round-trips, avoids unnecessary screenshot-analyze cycles, and is significantly faster.Example — Popup menu interaction:
npx @midscene/android@1 act --prompt "long press the message bubble and tap Delete in the popup menu"
npx @midscene/android@1 take_screenshot
Example — Form interaction:
npx @midscene/android@1 act --prompt "fill in the username field with 'testuser' and the password field with 'pass123', then tap the Login button"
npx @midscene/android@1 take_screenshot
...
安装 Midscene Automations Skills for Android 后,可以对 AI 说这些话来触发它
Help me get started with Midscene Automations Skills for Android
Explains what Midscene Automations Skills for Android does, walks through the setup, and runs a quick demo based on your current project
Use Midscene Automations Skills for Android to vision-driven Android device automation using Midscene
Invokes Midscene Automations Skills for Android with the right parameters and returns the result directly in the conversation
What can I do with Midscene Automations Skills for Android in my developer & devops workflow?
Lists the top use cases for Midscene Automations Skills for Android, with example commands for each scenario
将技能文件夹放到 ~/.claude/skills/midscene-android-automation/ 目录(个人级,所有项目可用),或 .claude/skills/midscene-android-automation/(项目级)。重启 AI 客户端后,用 /midscene-android-automation 主动调用,或让 AI 根据上下文自动发现并使用。
Midscene Automations Skills for Android 支持 Claude、Cursor、OpenClaw,可与这些 AI 平台无缝集成,扩展其能力。
Midscene Automations Skills for Android 可免费安装使用。请查阅仓库了解许可证信息。
Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all v...
Automate my developer & devops tasks using Midscene Automations Skills for Android
Identifies repetitive steps in your workflow and sets up Midscene Automations Skills for Android to handle them automatically
Midscene Automations Skills for Android 属于「Developer & DevOps」分类,该分类的技能帮助 AI 智能体在此领域执行专业任务。