Computer-use agents are AI systems that can control a computer the way a human would - through mouse clicks, keyboard input, and visual observation. Unlike API-based automation, these agents work with any application that has a GUI, making them surprisingly general-purpose.
I built one called kyros as a side project. This post covers the core architecture: how the agent observes the screen, decides what to do, executes actions, and verifies results. Part 2 will cover multi-agent coordination.
Every computer-use agent follows the same fundamental pattern:
while task_not_complete:
1. Observe - Take a screenshot, get window list
2. Think - Send to LLM, get next action
3. Act - Execute the action (click, type, etc.)
4. Verify - Compare before/after screenshots
The key insight is that this loop is reactive - the agent doesn’t plan out every step in advance. It observes the current state, takes one action, observes the result, and decides the next action. This makes it robust to unexpected UI states.
Here’s the simplified loop from the GUI agent:
async def process_message(self, message: Dict[str, Any]) -> Dict[str, Any]:
task = message.get("content", "")
self.running = True
history = []
while self.running and iteration < max_iterations:
# 1. Observe
screenshot = self.get_screenshot_base64()
active_windows = self.get_active_windows()
# 2. Think - Build context and call LLM
messages = self.build_context(task, screenshot, active_windows, history)
action_code = self.generate_action(messages, system=self.get_system_prompt())
# 3. Act
screenshot_before = screenshot
exec_result = self.execute_action(action_code)
if not self.running: # Exit action was called
break
# 4. Verify
time.sleep(0.5) # Wait for UI to update
screenshot_after = self.get_screenshot_base64()
verification = await self.verify_action(screenshot_before, screenshot_after, action_code)
history.append({
"action": action_code,
"result": exec_result,
"verification": verification
})
The agent needs low-level control over mouse, keyboard, and screen capture. On Linux, I use Xlib for mouse control (it’s more reliable than pyautogui for X11), pyautogui for keyboard, and ImageMagick’s import command for screenshots.
Xlib gives direct access to the X11 server. The trick is using relative coordinates (0-1 range) so the agent doesn’t need to know screen resolution:
import Xlib.display
import Xlib.X
import Xlib.ext.xtest
def click(x: float, y: float, button: int = 1, clicks: int = 1) -> dict:
"""Click at relative coordinates (0-1 range)"""
display = Xlib.display.Display(os.environ.get('DISPLAY', ':0'))
screen = display.screen()
width = screen.width_in_pixels
height = screen.height_in_pixels
# Convert relative to absolute coordinates
abs_x = int(x * width)
abs_y = int(y * height)
# Move mouse
root = screen.root
root.warp_pointer(abs_x, abs_y)
display.sync()
# Perform clicks using XTEST extension
for _ in range(clicks):
Xlib.ext.xtest.fake_input(display, Xlib.X.ButtonPress, button)
display.sync()
Xlib.ext.xtest.fake_input(display, Xlib.X.ButtonRelease, button)
display.sync()
if clicks > 1:
time.sleep(0.05) # Small delay between multiple clicks
display.close()
return {"stdout": "", "stderr": "", "exitCode": 0}
The XTEST extension is key - it injects synthetic input events that applications can’t distinguish from real input.
For keyboard input, pyautogui works well enough:
import pyautogui
def type(text: str) -> dict:
"""Type text character by character"""
pyautogui.write(text, interval=0.01)
return {"stdout": "", "stderr": "", "exitCode": 0}
def hotkey(keys: str) -> dict:
"""Execute hotkey combination. Example: 'super+r' or 'ctrl+alt+t'"""
key_parts = keys.split('+')
key_map = {
'super': 'winleft',
'ctrl': 'ctrl',
'alt': 'alt',
'shift': 'shift'
}
mapped_keys = [key_map.get(k.lower(), k.lower()) for k in key_parts]
pyautogui.hotkey(*mapped_keys)
return {"stdout": "", "stderr": "", "exitCode": 0}
For screenshots, I use ImageMagick’s import command which captures the X11 root window:
def get_screenshot_base64(self) -> str:
"""Capture screenshot and return as base64-encoded JPEG"""
temp_fd, temp_path = tempfile.mkstemp(suffix='.png')
os.close(temp_fd)
try:
env = os.environ.copy()
env['DISPLAY'] = env.get('DISPLAY', ':0')
subprocess.run(
["import", "-window", "root", temp_path],
capture_output=True,
timeout=2,
env=env
)
# Convert to JPEG for smaller payload
screenshot = Image.open(temp_path)
buffer = BytesIO()
screenshot.save(buffer, format="JPEG", quality=75)
buffer.seek(0)
img_base64 = base64.b64encode(buffer.read()).decode('utf-8')
return f"data:image/jpeg;base64,{img_base64}"
finally:
if os.path.exists(temp_path):
os.unlink(temp_path)
I also overlay the cursor position onto the screenshot so the LLM knows where the mouse currently is:
# Get cursor position
display = Xlib.display.Display(':0')
root = display.screen().root
pointer = root.query_pointer()
cursor_x, cursor_y = pointer.root_x, pointer.root_y
# Overlay cursor image
cursor_img = Image.open('./cursor.png')
screenshot.paste(cursor_img, (cursor_x, cursor_y), cursor_img)
The agent needs to know what windows are open and be able to focus them. wmctrl is perfect for this:
def get_active_windows(self) -> str:
"""Get list of active windows"""
result = subprocess.run(
["wmctrl", "-l"],
capture_output=True,
text=True,
timeout=2
)
return result.stdout # Format: window_id desktop host title
def focus_window(window_id: str) -> dict:
"""Focus a window by its ID"""
subprocess.run(["wmctrl", "-i", "-a", window_id], timeout=2)
# Move mouse to center of window
geom_result = subprocess.run(["wmctrl", "-l", "-G"], capture_output=True, text=True)
# Parse geometry and move mouse...
All agents inherit from a base class that handles LLM communication:
class BaseAgent(ABC):
def __init__(
self,
agent_id: str = None,
api_key: str = None,
model: str = None,
websocket_callback: Optional[Callable] = None,
agent_name: str = None,
config_dict: Dict[str, Any] = None
):
self.agent_id = agent_id or str(uuid.uuid4())
self.agent_name = agent_name
# Load config for this agent type
agent_config = config.get_agent_config(agent_name, config_dict)
self.api_key = api_key or agent_config.get("api_key")
self.model = model or agent_config.get("model")
self.api_provider = agent_config.get("api_provider", "openai")
# Initialize client based on provider
if self.api_provider == "anthropic":
self.client = Anthropic(api_key=self.api_key)
else:
self.client = OpenAI(api_key=self.api_key, base_url=self.base_url)
@abstractmethod
def get_system_prompt(self) -> str:
pass
@abstractmethod
async def process_message(self, message: Dict[str, Any]) -> Dict[str, Any]:
pass
The call_llm method handles both OpenAI and Anthropic APIs, with streaming support:
def call_llm(
self,
messages: List[Dict[str, Any]],
system: str = None,
temperature: float = None,
max_tokens: int = None,
stream: bool = True
) -> str:
if self.api_provider == "anthropic":
anthropic_messages = self._convert_to_anthropic_format(messages)
with self.client.messages.stream(
model=self.model,
messages=anthropic_messages,
system=system or "",
temperature=temperature,
max_tokens=max_tokens
) as response:
response_text = ""
for text in response.text_stream:
response_text += text
self.send_llm_update("llm_content_chunk", {"content": text})
else:
# OpenAI streaming...
pass
return response_text
The GUI agent is the workhorse - it takes a task, observes the screen, and generates Python code to execute actions.
The system prompt defines what tools are available and how to use them:
def get_system_prompt(self) -> str:
return """
# Identity
You are a GUI Agent. Your job is to analyze the given screenshot and execute
the given TASK by performing step-by-step actions.
# Tools
- tools.focus_window(window_id): Focus a window by its ID
- tools.move(x, y): Move mouse to relative coordinates (0-1 range)
- tools.click(x, y, button=1, clicks=1): Click at relative coordinates
- tools.scroll(amount): Scroll (positive=down, negative=up)
- tools.type(text): Type the given text
- tools.hotkey(keys): Press a hotkey combination (e.g., 'super+r', 'ctrl+c')
- tools.wait(n): Wait for n seconds
- tools.exit(summary, exitCode): Exit when finished
# Rules
- Respond with executable Python code that calls ONE of these tools
- Only generate 1 action at a time
- Don't repeat the same action again and again
- Look at the "Currently active windows" list to determine which window to focus
# Example
```python
# Focus the Firefox browser window
tools.focus_window("0x02400003")
”””
<br>
### Action Execution
The agent generates Python code, which we execute in a controlled namespace:
```python
def execute_action(self, action_code: str) -> dict:
"""Execute the generated Python code"""
# Extract code from markdown blocks
if "```python" in action_code:
code = action_code.split("```python")[1].split("```")[0].strip()
result = {"stdout": "", "stderr": "", "exitCode": 0}
try:
namespace = {"tools": tools, "__action_result__": None}
# Capture tool call results
lines = code.split('\n')
modified_lines = []
for line in lines:
stripped = line.strip()
if stripped.startswith('tools.') and not stripped.startswith('tools.exit('):
indent = len(line) - len(line.lstrip())
modified_lines.append(' ' * indent + f'__action_result__ = {stripped}')
else:
modified_lines.append(line)
exec('\n'.join(modified_lines), namespace)
if namespace.get("__action_result__"):
tool_result = namespace["__action_result__"]
result["stdout"] = tool_result.get("stdout", "")
result["stderr"] = tool_result.get("stderr", "")
result["exitCode"] = tool_result.get("exitCode", 0)
except tools.ExitException as e:
self.running = False
result["stdout"] = e.message
result["exitCode"] = e.exit_code
return result
After each action, the agent compares before/after screenshots to verify success:
async def verify_action(self, screenshot_before: str, screenshot_after: str, action: str) -> str:
"""Verify if action succeeded by comparing screenshots"""
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": f"Action performed: {action}\n\nDid the action succeed?"},
{"type": "image_url", "image_url": {"url": screenshot_before}},
{"type": "text", "text": "Screenshot after:"},
{"type": "image_url", "image_url": {"url": screenshot_after}}
]
}
]
verification = self.call_llm(
messages=messages,
system="Compare two screenshots and determine if the action succeeded. Be concise.",
max_tokens=500
)
return verification
This verification feeds back into the agent’s context for the next iteration, helping it understand what worked and what didn’t.
As the agent takes more actions, the context grows. To prevent hitting token limits, we periodically compact the history:
# Check if compaction is needed
if self.step_count >= trigger_steps or word_count >= trigger_words:
self.compacted_context = compact_context(
history,
task,
self.config_dict,
self.websocket_callback
)
history = [] # Clear history after compaction
self.step_count = 0
The compaction uses a smaller/faster LLM to summarize the action history into a condensed form that preserves the essential information.
Here’s what a simple task execution looks like:
from agents.gui_agent import GUIAgent
agent = GUIAgent(config_dict=config)
result = await agent.process_message({
"content": "Open Firefox and navigate to google.com",
"max_iterations": 20
})
The agent might produce actions like:
# Step 1: Press Super key to open app launcher
tools.hotkey("super")
# Step 2: Type "firefox" to search for the app
tools.type("firefox")
# Step 3: Press Enter to launch Firefox
tools.hotkey("enter")
# Step 4: Wait for Firefox to open
tools.wait(2)
# Step 5: Click on the URL bar
tools.click(0.5, 0.05)
# Step 6: Type the URL
tools.type("google.com")
# Step 7: Press Enter to navigate
tools.hotkey("enter")
# Step 8: Task complete
tools.exit(summary="Opened Firefox and navigated to google.com", exitCode=0)
This single-agent architecture works, but it has limitations. The GUI agent is good at pixel-level interactions but not at high-level reasoning about complex tasks. It also can’t efficiently handle browser automation (clicking by coordinates is fragile compared to DOM selectors).
In Part 2, we’ll look at the multi-agent architecture: a BossAgent that breaks down tasks, a BrowserActionAgent that uses Playwright for web automation, and how they coordinate to handle complex workflows.