Looking for a typed client? The Desktop SDKs wrap this HTTP API for Python, JavaScript, and Go — no manual request wiring needed.
Overview
With BYOA, you own the agent loop . Nen exposes the desktop’s tools as a standard HTTP API that any agent framework can call — OpenAI, Anthropic, LangChain, CrewAI, or plain HTTP.
You get two endpoints on each desktop:
GET /desktops/{desktop_id}/tools — discover available tools as JSON Schema
POST /desktops/{desktop_id}/execute — execute a tool and get a screenshot back
Your agent runs anywhere — your laptop, your cloud, your customer’s infra. Nen is just the execution backend.
BYOA vs Nen Agent
BYOA Nen Agent Agent loop Yours Nen’s Language Any (HTTP API) Python Agent environment Managed by you Managed by Nen Best for You want computer use to be tightly coupled with your agentic loop You want to define high level tasks and built-in observability
Use BYOA when you want full control over the agent loop and LLM choice. Use Nen Agent when you want Nen to handle orchestration for you.
Authentication
Get your API key from the Nen Dashboard
Pass it as the Authorization: Bearer header on every request
The same API key works for both the desktop API and the managed-workflows API — only the header name differs between them (see the API reference ).
Quick Start
A complete agent loop using Claude’s tool calling API:
uv init && uv add httpx anthropic
import httpx
import anthropic
DESKTOP_ID = "dsk_abc123def456"
BASE_URL = f "https://desktop.api.getnen.ai/desktops/ { DESKTOP_ID } "
NEN_API_KEY = "your_nen_api_key"
ANTHROPIC_API_KEY = "your_anthropic_api_key"
headers = { "Authorization" : f "Bearer { NEN_API_KEY } " }
# 1. Discover tools and convert to Anthropic format
tools = httpx.get( f " { BASE_URL } /tools" , headers = headers).json()
anthropic_tools = [
{ "name" : t[ "name" ], "description" : t[ "description" ], "input_schema" : t[ "parameters" ]}
for t in tools
]
# 2. Take initial screenshot
initial = httpx.post( f " { BASE_URL } /execute" , json = {
"action" : { "tool" : "computer" , "action" : "screenshot" , "params" : {}}
}, headers = headers).json()
# 3. Run the agent loop
llm = anthropic.Anthropic( api_key = ANTHROPIC_API_KEY )
messages = [{ "role" : "user" , "content" : [
{ "type" : "image" , "source" : { "type" : "base64" , "media_type" : "image/png" , "data" : initial[ "base64_image" ]}},
{ "type" : "text" , "text" : "Open Firefox and navigate to google.com" },
]}]
for step in range ( 10 ):
response = llm.messages.create(
model = "claude-sonnet-4-6" , max_tokens = 1024 ,
tools = anthropic_tools, messages = messages,
)
# Check if the model wants to use a tool
tool_calls = [b for b in response.content if b.type == "tool_use" ]
if not tool_calls:
print ( "Agent finished:" , [b.text for b in response.content if b.type == "text" ])
break
# Execute each tool call and feed results back
messages.append({ "role" : "assistant" , "content" : response.content})
tool_results = []
for tc in tool_calls:
# Anthropic gives us {name, input}; the desktop API expects
# {action: {tool, action, params}}.
params = {k: v for k, v in tc.input.items() if k != "action" }
action = { "tool" : tc.name, "action" : tc.input[ "action" ], "params" : params}
result = httpx.post( f " { BASE_URL } /execute" , json = { "action" : action},
headers = headers, timeout = 30 ).json()
content = []
if result.get( "output" ):
content.append({ "type" : "text" , "text" : result[ "output" ]})
if result.get( "base64_image" ):
content.append({ "type" : "image" , "source" : {
"type" : "base64" , "media_type" : "image/png" , "data" : result[ "base64_image" ]
}})
tool_results.append({ "type" : "tool_result" , "tool_use_id" : tc.id, "content" : content or [{ "type" : "text" , "text" : "Done" }]})
messages.append({ "role" : "user" , "content" : tool_results})
GET /desktops/{desktop_id}/tools returns tool definitions in JSON Schema format, compatible with all major LLM providers:
[
{
"name" : "computer" ,
"description" : "Control the computer's mouse, keyboard, and screen." ,
"parameters" : {
"type" : "object" ,
"properties" : {
"action" : {
"type" : "string" ,
"enum" : [ "screenshot" , "key" , "type" , "mouse_move" , "left_click" , "right_click" , "double_click" , "scroll" , "wait" , "hold_key" , "cursor_position" ]
},
"text" : { "type" : "string" , "description" : "Text to type or key combo (e.g. 'Return', 'ctrl+a')" },
"coordinate" : { "type" : "array" , "items" : { "type" : "integer" }, "description" : "[x, y] screen coordinates" },
"scroll_direction" : { "type" : "string" , "enum" : [ "up" , "down" , "left" , "right" ] },
"scroll_amount" : { "type" : "integer" },
"duration" : { "type" : "number" , "description" : "Duration in seconds for wait or hold_key" }
},
"required" : [ "action" ]
}
}
]
Pass these directly to your LLM’s tool/function calling API.
POST /desktops/{desktop_id}/execute runs a tool and returns the result plus a post-action screenshot:
Request:
{
"action" : {
"tool" : "computer" ,
"action" : "left_click" ,
"params" : { "coordinate" : [ 500 , 300 ] }
}
}
Response:
{
"status" : "ok" ,
"output" : "clicked at (500, 300)" ,
"base64_image" : "iVBORw0KGgo..."
}
status is always present. output, base64_image, and coordinate are included only when the underlying action produces them (e.g. screenshot always returns base64_image; a plain left_click may return just {"status": "ok"}). See Tools Overview for the full reference.
The screenshot after each action is how your agent observes the desktop. Feed it back to your VLM to decide the next step.
Next Steps
Workflow SDK Skip the HTTP — use the Python SDK instead
Remote Execution How SDK execution and debugging works