In Part 1, we built a single GUI agent that could control a computer through screenshots and mouse/keyboard actions. That works for simple tasks, but complex workflows benefit from specialization. A pixel-clicking agent is inefficient for web automation (fragile coordinates vs. DOM selectors), and high-level task planning differs from low-level action generation.
This post covers the multi-agent architecture in kyros: a BossAgent that orchestrates, specialized worker agents for different domains, and how they coordinate through a hierarchical delegation pattern.
Single agents hit practical limits:
Prompt complexity: One prompt can’t effectively cover task planning, GUI control, browser automation, and shell commands. Each domain has different tools, context, and reasoning patterns.
Context efficiency: A browser agent doesn’t need window lists, and a GUI agent doesn’t need DOM trees. Splitting agents means each gets relevant context.
Model selection: Planning benefits from stronger reasoning models. Action generation works fine with smaller, faster models. Multi-agent lets you mix models.
The architecture is hierarchical:
User
│
▼
BossAgent (orchestrator)
│
├── GUIAgent (desktop control)
├── BrowserActionAgent (web automation)
├── ShellAgent (command execution)
└── ResearchAgent (web search)
The BossAgent receives tasks from the user, creates plans, and delegates subtasks to specialized agents. It doesn’t execute actions directly - it coordinates.
The BossAgent’s job is to understand what the user wants, break it into subtasks, and route them to the right worker:
class BossAgent(BaseAgent):
def __init__(self, ...):
super().__init__(agent_name="boss", ...)
self.subagents: Dict[str, Any] = {}
def get_system_prompt(self) -> str:
return """# Identity
You are a computer-use agent. You coordinate tasks by delegating to specialized agents:
## Available Agents
1. **BrowserBossAgent**: Browser automation and web interactions
- Use for: navigating websites, filling forms, clicking web elements
2. **GUIAgent**: Mouse and keyboard interactions with GUI
- Use for: clicking, typing, hotkeys, scrolling
3. **ShellAgent**: Executes shell commands
- Use for: running terminal commands, file operations
4. **ResearchAgent**: Researches information using Tavily
- Use for: searching the web, gathering information
## Your Responsibilities
1. Analyze the user's request and the current screenshot
2. For multi-step tasks: create a plan and get user approval first
3. Delegate manageable subtasks to sub-agents
4. Handle agent responses and coordinate multi-step workflows
5. Report results back to the user
## Response Format
{
"thought": "Your reasoning about what to do next",
"action": {
"type": "delegate",
"agent": "BrowserBossAgent",
"message": "Open https://example.com and fill the login form"
}
}
"""
The BossAgent outputs structured JSON with a thought process and an action. Actions can be:
delegate: Send subtask to a worker agentmessage: Communicate with the user (often to present a plan)exit: Task complete, report resultsWorkers are created on-demand and reused across delegations:
def get_or_create_agent(self, agent_type: str) -> Any:
"""Get existing agent or create a new one"""
if agent_type in self.subagents:
return self.subagents[agent_type]
if agent_type == "BrowserActionAgent":
from agents.browser_action_agent import BrowserActionAgent
agent = BrowserActionAgent(
websocket_callback=self.websocket_callback,
config_dict=self.config_dict
)
elif agent_type == "GUIAgent":
from agents.gui_agent import GUIAgent
agent = GUIAgent(
websocket_callback=self.websocket_callback,
config_dict=self.config_dict
)
# ... other agent types
self.subagents[agent_type] = agent
return agent
This is important for the BrowserActionAgent - it keeps the browser instance alive across multiple delegations, so we don’t keep opening and closing browsers.
The GUIAgent clicks pixels, which is fragile for web automation. The BrowserActionAgent uses Playwright for proper DOM interaction.
from playwright.async_api import async_playwright, Browser, BrowserContext, Page
class BrowserActionAgent(BaseAgent):
def __init__(self, ...):
super().__init__(agent_name="browser", ...)
self.playwright = None
self.browser: Optional[Browser] = None
self.context: Optional[BrowserContext] = None
self.page: Optional[Page] = None
async def _launch(self, name: str = "chromium") -> Dict[str, Any]:
"""Launch a new browser"""
if self.playwright is None:
self.playwright = await async_playwright().start()
self.browser = await self.playwright.chromium.launch(
headless=False,
args=[
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage'
]
)
self.context = await self.browser.new_context(no_viewport=True)
self.page = await self.context.new_page()
return {"success": True, "message": "Browser launched"}
The agent uses XPath selectors rather than coordinates. This is more reliable because elements can be identified by their semantic structure:
async def _click(self, xpath: str, button: str = "left", click_count: int = 1) -> Dict[str, Any]:
"""Click on an element by XPath"""
if not self.page:
return {"success": False, "error": "No active browser page"}
element = await self.page.query_selector(f"xpath={xpath}")
if not element:
return {"success": False, "error": f"No element found: {xpath}"}
await element.click(button=button, click_count=click_count)
return {"success": True, "message": f"Clicked: {xpath}"}
async def _fill(self, xpath: str, text: str) -> Dict[str, Any]:
"""Fill an element with text"""
element = await self.page.query_selector(f"xpath={xpath}")
if not element:
return {"success": False, "error": f"No element found: {xpath}"}
await element.fill(text)
return {"success": True, "message": f"Filled: {xpath}"}
The LLM generates XPath selectors based on the page structure. For example:
//input[@name='email'] - input with name attribute//button[contains(text(), 'Submit')] - button containing text//div[@class='search-box']//input - input inside a div with classdef get_system_prompt(self) -> str:
return """# Identity
You are a Browser Action Agent that automates browser interactions using Playwright.
# Available Tools
- launch(name): Launch browser ("chromium" or "firefox")
- navigate(url): Navigate to URL
- click(xpath): Click element by XPath
- fill(xpath, text): Fill element with text
- input_text(xpath, text): Type text character by character
- press_key(xpath, key): Press a key ("Enter", "Tab", "Escape")
- hover(xpath): Hover over element
- get_text(xpath): Get text content
- wait_for_element(xpath, timeout): Wait for element to appear
- scroll_into_view(xpath): Scroll element into view
- wait(seconds): Wait
- exit(summary, exitCode): Exit when finished
# Response Format
{
"thought": "Your reasoning about what to do next",
"action": {
"tool": "click",
"args": {"xpath": "//button[@id='submit']"}
}
}
"""
The agent has a dispatch method that routes tool calls to implementations:
async def execute_action(self, action: Dict[str, Any]) -> Dict[str, Any]:
"""Execute a browser action"""
tool = action.get("tool")
args = action.get("args", {})
if tool == "launch":
return await self._launch(**args)
elif tool == "navigate":
return await self._navigate(**args)
elif tool == "click":
return await self._click(**args)
elif tool == "fill":
return await self._fill(**args)
elif tool == "press_key":
return await self._press_key(**args)
elif tool == "wait_for_element":
return await self._wait_for_element(**args)
elif tool == "exit":
return self._exit(**args)
else:
return {"success": False, "error": f"Unknown tool: {tool}"}
Agents need context to make good decisions, but context has costs (tokens, latency). The system manages this through context passing and compaction.
When the BossAgent delegates to a worker, it passes relevant context in the message:
# BossAgent builds context for worker
context_parts = []
if self.compacted_context:
context_parts.append(f"Previous Context:\n{self.compacted_context}")
context_parts.append(f"User Request: {message.get('content', '')}")
# Include agent responses from previous steps
for agent_resp in message.get('agent_responses', []):
agent_type = agent_resp.get("agent_type")
response = agent_resp.get("response", {})
context_parts.append(f"{agent_type} response: {response}")
Both BossAgent and worker agents compact their context when it gets too large:
# Check if compaction is needed
context_text = str(self.history)
word_count = count_words(context_text)
if self.step_count >= trigger_steps or word_count >= trigger_words:
self.compacted_context = compact_context(
self.history,
task,
self.config_dict,
self.websocket_callback
)
self.history = [] # Clear after compaction
self.step_count = 0
The compact_context function uses a fast LLM to summarize the action history, preserving key information while reducing token count.
When a worker agent finishes, it returns an exit summary that captures what was accomplished:
def _exit(self, summary: str = None, exitCode: int = 0) -> Dict[str, Any]:
"""Exit the agent"""
return {
"success": True,
"exit": True,
"summary": summary or "Agent completed",
"exitCode": exitCode
}
This summary gets passed back to the BossAgent, which uses it as context for the next step. For example:
BrowserActionAgent response: "Filled login form with email user@example.com
and password, clicked Submit button. Login successful - now on dashboard page."
This gives the BossAgent enough context to decide the next action without re-analyzing the full history.
Here’s how a multi-step task flows through the system:
User: "Log into example.com and download my invoice"
1. BossAgent receives task, takes screenshot
2. BossAgent creates plan, asks user for approval:
"I'll help with that. Plan:
1. Open browser and navigate to example.com
2. Fill login form with your credentials
3. Navigate to invoices section
4. Download the latest invoice
Does this look good?"
3. User approves
4. BossAgent delegates to BrowserActionAgent:
"Open browser and navigate to example.com/login"
5. BrowserActionAgent executes:
- launch(name="chromium")
- navigate(url="https://example.com/login")
- exit(summary="Browser launched, navigated to login page")
6. BossAgent receives result, delegates next step:
"Fill login form with email user@example.com, password ****"
7. BrowserActionAgent executes:
- fill(xpath="//input[@name='email']", text="user@example.com")
- fill(xpath="//input[@name='password']", text="****")
- click(xpath="//button[@type='submit']")
- wait_for_element(xpath="//div[@class='dashboard']")
- exit(summary="Logged in successfully, on dashboard")
8. ... continues until task complete
9. BossAgent reports to user:
"Downloaded invoice_2024_01.pdf to ~/Downloads"
The system uses WebSocket callbacks to send real-time updates to the frontend:
def send_llm_update(self, event_type: str, data: Dict[str, Any]):
"""Send update via WebSocket"""
if self.websocket_callback:
self.websocket_callback({
"type": event_type,
"agent_id": self.agent_id,
"agent_type": self.__class__.__name__,
"data": data
})
Events include:
screenshot: New screenshot capturedllm_call_start: LLM request startedllm_content_chunk: Streaming response tokenaction_execute: About to execute actionaction_result: Action completedThe multi-agent pattern is a good foundation for these extensions because each capability can be added to the appropriate agent without affecting others.