Occam's razor Archive Pages Categories Tags

Building a Computer-Use Agent in Python - Part 2: Multi-Agent Architecture

18 January 2026

In Part 1, we built a single GUI agent that could control a computer through screenshots and mouse/keyboard actions. That works for simple tasks, but complex workflows benefit from specialization. A pixel-clicking agent is inefficient for web automation (fragile coordinates vs. DOM selectors), and high-level task planning differs from low-level action generation.

This post covers the multi-agent architecture in kyros: a BossAgent that orchestrates, specialized worker agents for different domains, and how they coordinate through a hierarchical delegation pattern.

Why Multi-Agent?

Single agents hit practical limits:

Prompt complexity: One prompt can’t effectively cover task planning, GUI control, browser automation, and shell commands. Each domain has different tools, context, and reasoning patterns.
Context efficiency: A browser agent doesn’t need window lists, and a GUI agent doesn’t need DOM trees. Splitting agents means each gets relevant context.
Model selection: Planning benefits from stronger reasoning models. Action generation works fine with smaller, faster models. Multi-agent lets you mix models.

The Boss-Worker Pattern

The architecture is hierarchical:

User
  │
  ▼
BossAgent (orchestrator)
  │
  ├── GUIAgent (desktop control)
  ├── BrowserActionAgent (web automation)
  ├── ShellAgent (command execution)
  └── ResearchAgent (web search)

The BossAgent receives tasks from the user, creates plans, and delegates subtasks to specialized agents. It doesn’t execute actions directly - it coordinates.

BossAgent: The Orchestrator

The BossAgent’s job is to understand what the user wants, break it into subtasks, and route them to the right worker:

class BossAgent(BaseAgent):
    def __init__(self, ...):
        super().__init__(agent_name="boss", ...)
        self.subagents: Dict[str, Any] = {}

    def get_system_prompt(self) -> str:
        return """# Identity

You are a computer-use agent. You coordinate tasks by delegating to specialized agents:

## Available Agents

1. **BrowserBossAgent**: Browser automation and web interactions
   - Use for: navigating websites, filling forms, clicking web elements

2. **GUIAgent**: Mouse and keyboard interactions with GUI
   - Use for: clicking, typing, hotkeys, scrolling

3. **ShellAgent**: Executes shell commands
   - Use for: running terminal commands, file operations

4. **ResearchAgent**: Researches information using Tavily
   - Use for: searching the web, gathering information

## Your Responsibilities

1. Analyze the user's request and the current screenshot
2. For multi-step tasks: create a plan and get user approval first
3. Delegate manageable subtasks to sub-agents
4. Handle agent responses and coordinate multi-step workflows
5. Report results back to the user

## Response Format

{
  "thought": "Your reasoning about what to do next",
  "action": {
    "type": "delegate",
    "agent": "BrowserBossAgent",
    "message": "Open https://example.com and fill the login form"
  }
}
"""

The BossAgent outputs structured JSON with a thought process and an action. Actions can be:

delegate: Send subtask to a worker agent
message: Communicate with the user (often to present a plan)
exit: Task complete, report results

Agent Creation and Reuse

Workers are created on-demand and reused across delegations:

def get_or_create_agent(self, agent_type: str) -> Any:
    """Get existing agent or create a new one"""
    if agent_type in self.subagents:
        return self.subagents[agent_type]

    if agent_type == "BrowserActionAgent":
        from agents.browser_action_agent import BrowserActionAgent
        agent = BrowserActionAgent(
            websocket_callback=self.websocket_callback,
            config_dict=self.config_dict
        )
    elif agent_type == "GUIAgent":
        from agents.gui_agent import GUIAgent
        agent = GUIAgent(
            websocket_callback=self.websocket_callback,
            config_dict=self.config_dict
        )
    # ... other agent types

    self.subagents[agent_type] = agent
    return agent

This is important for the BrowserActionAgent - it keeps the browser instance alive across multiple delegations, so we don’t keep opening and closing browsers.

Browser Automation with Playwright

The GUIAgent clicks pixels, which is fragile for web automation. The BrowserActionAgent uses Playwright for proper DOM interaction.

Setup

from playwright.async_api import async_playwright, Browser, BrowserContext, Page

class BrowserActionAgent(BaseAgent):
    def __init__(self, ...):
        super().__init__(agent_name="browser", ...)
        self.playwright = None
        self.browser: Optional[Browser] = None
        self.context: Optional[BrowserContext] = None
        self.page: Optional[Page] = None

    async def _launch(self, name: str = "chromium") -> Dict[str, Any]:
        """Launch a new browser"""
        if self.playwright is None:
            self.playwright = await async_playwright().start()

        self.browser = await self.playwright.chromium.launch(
            headless=False,
            args=[
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage'
            ]
        )
        self.context = await self.browser.new_context(no_viewport=True)
        self.page = await self.context.new_page()

        return {"success": True, "message": "Browser launched"}

XPath-Based Actions

The agent uses XPath selectors rather than coordinates. This is more reliable because elements can be identified by their semantic structure:

async def _click(self, xpath: str, button: str = "left", click_count: int = 1) -> Dict[str, Any]:
    """Click on an element by XPath"""
    if not self.page:
        return {"success": False, "error": "No active browser page"}

    element = await self.page.query_selector(f"xpath={xpath}")
    if not element:
        return {"success": False, "error": f"No element found: {xpath}"}

    await element.click(button=button, click_count=click_count)
    return {"success": True, "message": f"Clicked: {xpath}"}

async def _fill(self, xpath: str, text: str) -> Dict[str, Any]:
    """Fill an element with text"""
    element = await self.page.query_selector(f"xpath={xpath}")
    if not element:
        return {"success": False, "error": f"No element found: {xpath}"}

    await element.fill(text)
    return {"success": True, "message": f"Filled: {xpath}"}

The LLM generates XPath selectors based on the page structure. For example:

//input[@name='email'] - input with name attribute
//button[contains(text(), 'Submit')] - button containing text
//div[@class='search-box']//input - input inside a div with class

System Prompt

def get_system_prompt(self) -> str:
    return """# Identity

You are a Browser Action Agent that automates browser interactions using Playwright.

# Available Tools

- launch(name): Launch browser ("chromium" or "firefox")
- navigate(url): Navigate to URL
- click(xpath): Click element by XPath
- fill(xpath, text): Fill element with text
- input_text(xpath, text): Type text character by character
- press_key(xpath, key): Press a key ("Enter", "Tab", "Escape")
- hover(xpath): Hover over element
- get_text(xpath): Get text content
- wait_for_element(xpath, timeout): Wait for element to appear
- scroll_into_view(xpath): Scroll element into view
- wait(seconds): Wait
- exit(summary, exitCode): Exit when finished

# Response Format

{
  "thought": "Your reasoning about what to do next",
  "action": {
    "tool": "click",
    "args": {"xpath": "//button[@id='submit']"}
  }
}
"""

Action Dispatch

The agent has a dispatch method that routes tool calls to implementations:

async def execute_action(self, action: Dict[str, Any]) -> Dict[str, Any]:
    """Execute a browser action"""
    tool = action.get("tool")
    args = action.get("args", {})

    if tool == "launch":
        return await self._launch(**args)
    elif tool == "navigate":
        return await self._navigate(**args)
    elif tool == "click":
        return await self._click(**args)
    elif tool == "fill":
        return await self._fill(**args)
    elif tool == "press_key":
        return await self._press_key(**args)
    elif tool == "wait_for_element":
        return await self._wait_for_element(**args)
    elif tool == "exit":
        return self._exit(**args)
    else:
        return {"success": False, "error": f"Unknown tool: {tool}"}

Context Management

Agents need context to make good decisions, but context has costs (tokens, latency). The system manages this through context passing and compaction.

Context Passing Between Agents

When the BossAgent delegates to a worker, it passes relevant context in the message:

# BossAgent builds context for worker
context_parts = []
if self.compacted_context:
    context_parts.append(f"Previous Context:\n{self.compacted_context}")
context_parts.append(f"User Request: {message.get('content', '')}")

# Include agent responses from previous steps
for agent_resp in message.get('agent_responses', []):
    agent_type = agent_resp.get("agent_type")
    response = agent_resp.get("response", {})
    context_parts.append(f"{agent_type} response: {response}")

Context Compaction

Both BossAgent and worker agents compact their context when it gets too large:

# Check if compaction is needed
context_text = str(self.history)
word_count = count_words(context_text)

if self.step_count >= trigger_steps or word_count >= trigger_words:
    self.compacted_context = compact_context(
        self.history,
        task,
        self.config_dict,
        self.websocket_callback
    )
    self.history = []  # Clear after compaction
    self.step_count = 0

The compact_context function uses a fast LLM to summarize the action history, preserving key information while reducing token count.

Exit Summaries

When a worker agent finishes, it returns an exit summary that captures what was accomplished:

def _exit(self, summary: str = None, exitCode: int = 0) -> Dict[str, Any]:
    """Exit the agent"""
    return {
        "success": True,
        "exit": True,
        "summary": summary or "Agent completed",
        "exitCode": exitCode
    }

This summary gets passed back to the BossAgent, which uses it as context for the next step. For example:

BrowserActionAgent response: "Filled login form with email user@example.com
and password, clicked Submit button. Login successful - now on dashboard page."

This gives the BossAgent enough context to decide the next action without re-analyzing the full history.

Full Orchestration

Here’s how a multi-step task flows through the system:

User: "Log into example.com and download my invoice"

1. BossAgent receives task, takes screenshot
2. BossAgent creates plan, asks user for approval:
   "I'll help with that. Plan:
    1. Open browser and navigate to example.com
    2. Fill login form with your credentials
    3. Navigate to invoices section
    4. Download the latest invoice
    Does this look good?"

3. User approves

4. BossAgent delegates to BrowserActionAgent:
   "Open browser and navigate to example.com/login"

5. BrowserActionAgent executes:
   - launch(name="chromium")
   - navigate(url="https://example.com/login")
   - exit(summary="Browser launched, navigated to login page")

6. BossAgent receives result, delegates next step:
   "Fill login form with email user@example.com, password ****"

7. BrowserActionAgent executes:
   - fill(xpath="//input[@name='email']", text="user@example.com")
   - fill(xpath="//input[@name='password']", text="****")
   - click(xpath="//button[@type='submit']")
   - wait_for_element(xpath="//div[@class='dashboard']")
   - exit(summary="Logged in successfully, on dashboard")

8. ... continues until task complete

9. BossAgent reports to user:
   "Downloaded invoice_2024_01.pdf to ~/Downloads"

Real-Time Updates

The system uses WebSocket callbacks to send real-time updates to the frontend:

def send_llm_update(self, event_type: str, data: Dict[str, Any]):
    """Send update via WebSocket"""
    if self.websocket_callback:
        self.websocket_callback({
            "type": event_type,
            "agent_id": self.agent_id,
            "agent_type": self.__class__.__name__,
            "data": data
        })

Events include:

screenshot: New screenshot captured
llm_call_start: LLM request started
llm_content_chunk: Streaming response token
action_execute: About to execute action
action_result: Action completed

Tradeoffs and Extensions

What Works

Specialization pays off: Browser agent with Playwright is much more reliable than pixel-clicking for web tasks.
Hierarchical delegation: BossAgent can focus on planning while workers handle execution details.
Context compaction: Keeps long sessions from hitting token limits.

What’s Tricky

Coordination overhead: More agents means more LLM calls. A simple task might be faster with a single agent.
Error recovery: When a worker fails, the BossAgent needs to decide whether to retry, try a different approach, or give up. This is hard to get right.
State sharing: The browser persists across delegations, but the BossAgent needs to track what state it’s in (logged in? which page?).

Extensions

Memory: Persistent memory across sessions (what credentials work, common workflows).
Self-improvement: Agent analyzes its failures and updates its prompts or tool usage.
Human-in-the-loop: Ask for help when stuck rather than failing silently.
Tool creation: Agent writes new tools when it needs capabilities it doesn’t have.

The multi-agent pattern is a good foundation for these extensions because each capability can be added to the appropriate agent without affecting others.