M

Midscene Automations Skills for Android

midscene-android-automation

Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all v...

数据来源：ClawHub。在 ClawSkills 查看

1.6k下载量

0收藏数

8浏览量

安装

选择你使用的 Agent

方法一：命令行安装（推荐）

关于 Midscene Automations Skills for Android

--- name: android-device-automation description: > Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control Android devices with natural language commands via ADB. Perform taps, swipes, text input, app launches, screenshots, and more.

Trigger keywords: android, phone, mobile app, tap, swipe, install app, open app on phone, android device, mobile automation, adb, launch app, mobile screen

Powered by Midscene.js (https://midscenejs.com) allowed-tools: - Bash ---

Android Device Automation

> CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW: > > 1. Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop. > 2. Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together. > 3. Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer. > 4. Always report task results before finishing. After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.

Automate Android devices using npx @midscene/android@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):

MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"

Example: Gemini (Gemini-3-Flash)

MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"

Example: Qwen 3.5

MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCENE_MODEL_NAME="qwen/qwen3.5-plus"
# MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"

Example: Doubao Seed 2.0 Lite

MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-2-0-lite"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-seed"

Commonly used models: Doubao Seed 2.0 Lite, Qwen 3.5, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.

If the model is not configured, ask the user to set it up. See Model Configuration for supported providers.

Commands

Connect to Device

npx @midscene/android@1 connect
npx @midscene/android@1 connect --deviceId emulator-5554

Take Screenshot

npx @midscene/android@1 take_screenshot

After taking a screenshot, read the saved image file to understand the current screen state before deciding the next action.

Perform Action

Use act to interact with the device and get the result. It autonomously handles all UI interactions internally — tapping, typing, scrolling, swiping, waiting, and navigating — so you should give it complex, high-level tasks as a whole rather than breaking them into small steps. Describe what you want to do and the desired effect in natural language:

# specific instructions
npx @midscene/android@1 act --prompt "type hello world in the search field and press Enter"
npx @midscene/android@1 act --prompt "long press the message bubble and tap Delete in the popup menu"

# or target-driven instructions
npx @midscene/android@1 act --prompt "open Settings and navigate to Wi-Fi settings, tell me the connected network name"

Disconnect

npx @midscene/android@1 disconnect

Workflow Pattern

Since CLI commands are stateless between invocations, follow this pattern:

Connect to establish a session
Launch the target app and take screenshot to see the current state, make sure the app is launched and visible on the screen.
Execute action using act to perform the desired action or target-driven instructions.
Disconnect when done
Report results — summarize what was accomplished, present key findings and data extracted during the task, and list any generated files (screenshots, logs, etc.) with their paths

Best Practices

Bring the target app to the foreground before using this skill: For best efficiency, launch the app using ADB (e.g., adb shell am start -n ) before invoking any midscene commands. Then take a screenshot to confirm the app is actually in the foreground. Only after visual confirmation should you proceed with UI automation using this skill. ADB commands are significantly faster than using midscene to navigate to and open apps.
Be specific about UI elements: Instead of vague descriptions, provide clear, specific details. Say "the Wi-Fi toggle switch on the right side" instead of "the toggle".
Describe locations when possible: Help target elements by describing their position (e.g., "the search icon at the top right", "the third item in the list").
Never run in background: Every midscene command must run synchronously — background execution breaks the screenshot-analyze-act loop.
Batch related operations into a single act command: When performing consecutive operations within the same app, combine them into one act prompt instead of splitting them into separate commands. For example, "open Settings, tap Wi-Fi, and toggle it on" should be a single act call, not three. This reduces round-trips, avoids unnecessary screenshot-analyze cycles, and is significantly faster.
Always report results after completion: After finishing the automation task, you MUST proactively present the results to the user without waiting for them to ask. This includes: (1) the answer to the user's original question or the outcome of the requested task, (2) key data extracted or observed during execution, (3) screenshots and other generated files with their paths, (4) a brief summary of steps taken. Do NOT silently finish after the last automation command — the user expects complete results in a single interaction.

Example — Popup menu interaction:

npx @midscene/android@1 act --prompt "long press the message bubble and tap Delete in the popup menu"
npx @midscene/android@1 take_screenshot

Example — Form interaction:

npx @midscene/android@1 act --prompt "fill in the username field with 'testuser' and the password field with 'pass123', then tap the Login button"
npx @midscene/android@1 take_screenshot

Troubleshooting

...

Prompt 示例

安装 Midscene Automations Skills for Android 后，可以对 AI 说这些话来触发它

U

Help me get started with Midscene Automations Skills for Android

A

Explains what Midscene Automations Skills for Android does, walks through the setup, and runs a quick demo based on your current project

U

Use Midscene Automations Skills for Android to vision-driven Android device automation using Midscene

A

Invokes Midscene Automations Skills for Android with the right parameters and returns the result directly in the conversation

U

What can I do with Midscene Automations Skills for Android in my developer & devops workflow?

A

Lists the top use cases for Midscene Automations Skills for Android, with example commands for each scenario

常见问题

如何安装 Midscene Automations Skills for Android？▾

将技能文件夹放到 ~/.claude/skills/midscene-android-automation/ 目录（个人级，所有项目可用），或 .claude/skills/midscene-android-automation/（项目级）。重启 AI 客户端后，用 /midscene-android-automation 主动调用，或让 AI 根据上下文自动发现并使用。

Midscene Automations Skills for Android 支持哪些 AI 平台？▾

Midscene Automations Skills for Android 支持 Claude、Cursor、OpenClaw，可与这些 AI 平台无缝集成，扩展其能力。

Midscene Automations Skills for Android 是免费的吗？▾

Midscene Automations Skills for Android 可免费安装使用。请查阅仓库了解许可证信息。

Midscene Automations Skills for Android 有什么功能？▾

Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all v...

Midscene Automations Skills for Android 属于哪个分类？

使用场景

Getting Started with Midscene Automations Skills for Android→Automate Developer & DevOps Workflows with Midscene Automations Skills for Android→Team Collaboration with Midscene Automations Skills for Android→