Occam's razor Archive Pages Categories Tags

Building a Computer-Use Agent in Python - Part 1: The Agent Loop

14 December 2025

Computer-use agents are AI systems that can control a computer the way a human would - through mouse clicks, keyboard input, and visual observation. Unlike API-based automation, these agents work with any application that has a GUI, making them surprisingly general-purpose.

I built one called kyros as a side project. This post covers the core architecture: how the agent observes the screen, decides what to do, executes actions, and verifies results. Part 2 will cover multi-agent coordination.

The Core Loop

Every computer-use agent follows the same fundamental pattern:

while task_not_complete:
Observe  - Take a screenshot, get window list
Think   - Send to LLM, get next action
Act     - Execute the action (click, type, etc.)
Verify  - Compare before/after screenshots

The key insight is that this loop is reactive - the agent doesn’t plan out every step in advance. It observes the current state, takes one action, observes the result, and decides the next action. This makes it robust to unexpected UI states.

Here’s the simplified loop from the GUI agent:

async def process_message(self, message: Dict[str, Any]) -> Dict[str, Any]:
    task = message.get("content", "")
    self.running = True
    history = []

    while self.running and iteration < max_iterations:
        # 1. Observe
        screenshot = self.get_screenshot_base64()
        active_windows = self.get_active_windows()

        # 2. Think - Build context and call LLM
        messages = self.build_context(task, screenshot, active_windows, history)
        action_code = self.generate_action(messages, system=self.get_system_prompt())

        # 3. Act
        screenshot_before = screenshot
        exec_result = self.execute_action(action_code)

        if not self.running:  # Exit action was called
            break

        # 4. Verify
        time.sleep(0.5)  # Wait for UI to update
        screenshot_after = self.get_screenshot_base64()
        verification = await self.verify_action(screenshot_before, screenshot_after, action_code)

        history.append({
            "action": action_code,
            "result": exec_result,
            "verification": verification
        })

System Control Tools

The agent needs low-level control over mouse, keyboard, and screen capture. On Linux, I use Xlib for mouse control (it’s more reliable than pyautogui for X11), pyautogui for keyboard, and ImageMagick’s import command for screenshots.

Mouse Control with Xlib

Xlib gives direct access to the X11 server. The trick is using relative coordinates (0-1 range) so the agent doesn’t need to know screen resolution:

import Xlib.display
import Xlib.X
import Xlib.ext.xtest

def click(x: float, y: float, button: int = 1, clicks: int = 1) -> dict:
    """Click at relative coordinates (0-1 range)"""
    display = Xlib.display.Display(os.environ.get('DISPLAY', ':0'))
    screen = display.screen()
    width = screen.width_in_pixels
    height = screen.height_in_pixels

    # Convert relative to absolute coordinates
    abs_x = int(x * width)
    abs_y = int(y * height)

    # Move mouse
    root = screen.root
    root.warp_pointer(abs_x, abs_y)
    display.sync()

    # Perform clicks using XTEST extension
    for _ in range(clicks):
        Xlib.ext.xtest.fake_input(display, Xlib.X.ButtonPress, button)
        display.sync()
        Xlib.ext.xtest.fake_input(display, Xlib.X.ButtonRelease, button)
        display.sync()
        if clicks > 1:
            time.sleep(0.05)  # Small delay between multiple clicks

    display.close()
    return {"stdout": "", "stderr": "", "exitCode": 0}

The XTEST extension is key - it injects synthetic input events that applications can’t distinguish from real input.

Keyboard with pyautogui

For keyboard input, pyautogui works well enough:

import pyautogui

def type(text: str) -> dict:
    """Type text character by character"""
    pyautogui.write(text, interval=0.01)
    return {"stdout": "", "stderr": "", "exitCode": 0}

def hotkey(keys: str) -> dict:
    """Execute hotkey combination. Example: 'super+r' or 'ctrl+alt+t'"""
    key_parts = keys.split('+')

    key_map = {
        'super': 'winleft',
        'ctrl': 'ctrl',
        'alt': 'alt',
        'shift': 'shift'
    }

    mapped_keys = [key_map.get(k.lower(), k.lower()) for k in key_parts]
    pyautogui.hotkey(*mapped_keys)
    return {"stdout": "", "stderr": "", "exitCode": 0}

Screenshot Capture

For screenshots, I use ImageMagick’s import command which captures the X11 root window:

def get_screenshot_base64(self) -> str:
    """Capture screenshot and return as base64-encoded JPEG"""
    temp_fd, temp_path = tempfile.mkstemp(suffix='.png')
    os.close(temp_fd)

    try:
        env = os.environ.copy()
        env['DISPLAY'] = env.get('DISPLAY', ':0')

        subprocess.run(
            ["import", "-window", "root", temp_path],
            capture_output=True,
            timeout=2,
            env=env
        )

        # Convert to JPEG for smaller payload
        screenshot = Image.open(temp_path)
        buffer = BytesIO()
        screenshot.save(buffer, format="JPEG", quality=75)
        buffer.seek(0)

        img_base64 = base64.b64encode(buffer.read()).decode('utf-8')
        return f"data:image/jpeg;base64,{img_base64}"
    finally:
        if os.path.exists(temp_path):
            os.unlink(temp_path)

I also overlay the cursor position onto the screenshot so the LLM knows where the mouse currently is:

# Get cursor position
display = Xlib.display.Display(':0')
root = display.screen().root
pointer = root.query_pointer()
cursor_x, cursor_y = pointer.root_x, pointer.root_y

# Overlay cursor image
cursor_img = Image.open('./cursor.png')
screenshot.paste(cursor_img, (cursor_x, cursor_y), cursor_img)

Window Management

The agent needs to know what windows are open and be able to focus them. wmctrl is perfect for this:

def get_active_windows(self) -> str:
    """Get list of active windows"""
    result = subprocess.run(
        ["wmctrl", "-l"],
        capture_output=True,
        text=True,
        timeout=2
    )
    return result.stdout  # Format: window_id desktop host title

def focus_window(window_id: str) -> dict:
    """Focus a window by its ID"""
    subprocess.run(["wmctrl", "-i", "-a", window_id], timeout=2)

    # Move mouse to center of window
    geom_result = subprocess.run(["wmctrl", "-l", "-G"], capture_output=True, text=True)
    # Parse geometry and move mouse...

The Base Agent Class

All agents inherit from a base class that handles LLM communication:

class BaseAgent(ABC):
    def __init__(
        self,
        agent_id: str = None,
        api_key: str = None,
        model: str = None,
        websocket_callback: Optional[Callable] = None,
        agent_name: str = None,
        config_dict: Dict[str, Any] = None
    ):
        self.agent_id = agent_id or str(uuid.uuid4())
        self.agent_name = agent_name

        # Load config for this agent type
        agent_config = config.get_agent_config(agent_name, config_dict)
        self.api_key = api_key or agent_config.get("api_key")
        self.model = model or agent_config.get("model")
        self.api_provider = agent_config.get("api_provider", "openai")

        # Initialize client based on provider
        if self.api_provider == "anthropic":
            self.client = Anthropic(api_key=self.api_key)
        else:
            self.client = OpenAI(api_key=self.api_key, base_url=self.base_url)

    @abstractmethod
    def get_system_prompt(self) -> str:
        pass

    @abstractmethod
    async def process_message(self, message: Dict[str, Any]) -> Dict[str, Any]:
        pass

The call_llm method handles both OpenAI and Anthropic APIs, with streaming support:

def call_llm(
    self,
    messages: List[Dict[str, Any]],
    system: str = None,
    temperature: float = None,
    max_tokens: int = None,
    stream: bool = True
) -> str:
    if self.api_provider == "anthropic":
        anthropic_messages = self._convert_to_anthropic_format(messages)
        with self.client.messages.stream(
            model=self.model,
            messages=anthropic_messages,
            system=system or "",
            temperature=temperature,
            max_tokens=max_tokens
        ) as response:
            response_text = ""
            for text in response.text_stream:
                response_text += text
                self.send_llm_update("llm_content_chunk", {"content": text})
    else:
        # OpenAI streaming...
        pass

    return response_text

The GUI Agent

The GUI agent is the workhorse - it takes a task, observes the screen, and generates Python code to execute actions.

System Prompt

The system prompt defines what tools are available and how to use them:

def get_system_prompt(self) -> str:
    return """
# Identity

You are a GUI Agent. Your job is to analyze the given screenshot and execute
the given TASK by performing step-by-step actions.

# Tools

- tools.focus_window(window_id): Focus a window by its ID
- tools.move(x, y): Move mouse to relative coordinates (0-1 range)
- tools.click(x, y, button=1, clicks=1): Click at relative coordinates
- tools.scroll(amount): Scroll (positive=down, negative=up)
- tools.type(text): Type the given text
- tools.hotkey(keys): Press a hotkey combination (e.g., 'super+r', 'ctrl+c')
- tools.wait(n): Wait for n seconds
- tools.exit(summary, exitCode): Exit when finished

# Rules

- Respond with executable Python code that calls ONE of these tools
- Only generate 1 action at a time
- Don't repeat the same action again and again
- Look at the "Currently active windows" list to determine which window to focus

# Example

```python
# Focus the Firefox browser window
tools.focus_window("0x02400003")

”””

<br>
### Action Execution

The agent generates Python code, which we execute in a controlled namespace:

```python
def execute_action(self, action_code: str) -> dict:
    """Execute the generated Python code"""
    # Extract code from markdown blocks
    if "```python" in action_code:
        code = action_code.split("```python")[1].split("```")[0].strip()

    result = {"stdout": "", "stderr": "", "exitCode": 0}

    try:
        namespace = {"tools": tools, "__action_result__": None}

        # Capture tool call results
        lines = code.split('\n')
        modified_lines = []
        for line in lines:
            stripped = line.strip()
            if stripped.startswith('tools.') and not stripped.startswith('tools.exit('):
                indent = len(line) - len(line.lstrip())
                modified_lines.append(' ' * indent + f'__action_result__ = {stripped}')
            else:
                modified_lines.append(line)

        exec('\n'.join(modified_lines), namespace)

        if namespace.get("__action_result__"):
            tool_result = namespace["__action_result__"]
            result["stdout"] = tool_result.get("stdout", "")
            result["stderr"] = tool_result.get("stderr", "")
            result["exitCode"] = tool_result.get("exitCode", 0)

    except tools.ExitException as e:
        self.running = False
        result["stdout"] = e.message
        result["exitCode"] = e.exit_code

    return result

Verification Loop

After each action, the agent compares before/after screenshots to verify success:

async def verify_action(self, screenshot_before: str, screenshot_after: str, action: str) -> str:
    """Verify if action succeeded by comparing screenshots"""
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": f"Action performed: {action}\n\nDid the action succeed?"},
                {"type": "image_url", "image_url": {"url": screenshot_before}},
                {"type": "text", "text": "Screenshot after:"},
                {"type": "image_url", "image_url": {"url": screenshot_after}}
            ]
        }
    ]

    verification = self.call_llm(
        messages=messages,
        system="Compare two screenshots and determine if the action succeeded. Be concise.",
        max_tokens=500
    )
    return verification

This verification feeds back into the agent’s context for the next iteration, helping it understand what worked and what didn’t.

Context Compaction

As the agent takes more actions, the context grows. To prevent hitting token limits, we periodically compact the history:

# Check if compaction is needed
if self.step_count >= trigger_steps or word_count >= trigger_words:
    self.compacted_context = compact_context(
        history,
        task,
        self.config_dict,
        self.websocket_callback
    )
    history = []  # Clear history after compaction
    self.step_count = 0

The compaction uses a smaller/faster LLM to summarize the action history into a condensed form that preserves the essential information.

Running It

Here’s what a simple task execution looks like:

from agents.gui_agent import GUIAgent

agent = GUIAgent(config_dict=config)
result = await agent.process_message({
    "content": "Open Firefox and navigate to google.com",
    "max_iterations": 20
})

The agent might produce actions like:

# Step 1: Press Super key to open app launcher
tools.hotkey("super")

# Step 2: Type "firefox" to search for the app
tools.type("firefox")

# Step 3: Press Enter to launch Firefox
tools.hotkey("enter")

# Step 4: Wait for Firefox to open
tools.wait(2)

# Step 5: Click on the URL bar
tools.click(0.5, 0.05)

# Step 6: Type the URL
tools.type("google.com")

# Step 7: Press Enter to navigate
tools.hotkey("enter")

# Step 8: Task complete
tools.exit(summary="Opened Firefox and navigated to google.com", exitCode=0)

Next Steps

This single-agent architecture works, but it has limitations. The GUI agent is good at pixel-level interactions but not at high-level reasoning about complex tasks. It also can’t efficiently handle browser automation (clicking by coordinates is fragile compared to DOM selectors).

In Part 2, we’ll look at the multi-agent architecture: a BossAgent that breaks down tasks, a BrowserActionAgent that uses Playwright for web automation, and how they coordinate to handle complex workflows.