Design unified agent architectures for desktop, web, and mobile environments, achieving SOTA performance through orchestrator-subagent coordination, visual grounding, and failure recovery.
Build with frameworks like LangGraph or custom (e.g., PyAutoGUI + vision models).
pip install langgraph opencv-python pillow transformers
# For platforms: selenium (web), appium (mobile), pyautogui (desktop)
from langgraph.graph import StateGraph
from transformers import pipeline
import pyautogui
# Desktop example
# Visual grounder
vision = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
class AgentState:
goal: str
current_screen: str
# Screenshot path
actions: list
observations: list
def reason(state):
# Analyze screen
caption = vision(state.current_screen)[0]['generated_text']
# Plan next action
action = f"Click button based on: {caption}"
return {"actions": [action]}
def act(state):
# Execute (platform-specific)
pyautogui.click(x, y)
# From grounding
return {"observations": ["Action executed"]}
graph = StateGraph(AgentState)
graph.add_node("reason", reason)
graph.add_node("act", act)
# Compile with edges for ReAct loop
Orchestrator: Decompose "Book flight" → Sub-tasks: Search sites, fill forms.
Full Example: Agent navigates web/mobile for task; validates via screenshots.