Explore the office for 2 minutes, then start security patrol- if you see anyone with a hoodie, follow them. That's it. This one prompt is the entire setup. The world’s greatest security system on a robot cheaper than your Macbook.
We've been running this at Dimensional SF office every night since. At 2am the Agent auto-triggers the security patrol- no one needs to be in the office to activate it.
In this article:
- The setup and what the robot does
- Physical AI ecosystem
- How we connect agents to hardware and the working
- What you can build with DimOS
- Builder Fellowship opportunity
Once the prompt was given, the agent read that sentence, figured out what it needed to do, and connected itself to the robot's perception, navigation, and control systems to carry it out and the robot explores the space, builds a map, switches to patrol mode, runs person detection on every camera frame, and follows anyone matching the description. Fully open source.
What physical AI actually looks like
There's a lot of talk about agents right now. Most of it is about digital tasks- browsing, coding, managing workflows, sending emails. That's useful, but it stops at the screen, the physical world has been mostly untouched.
At Dimensional, we've been working on the layer that connects agents to physical systems. It is an agentic operating system that sits between your AI agent and whatever hardware you're working with; robots, drones, cameras, lidar, actuators and gives the agent direct access to perceive, navigate, and act in physical space.
The security robot is one example of what this looks like in practice. The agent doesn't just process text, it has access to spatial memory, real-time perception, locomotion, and the ability to make decisions based on what it sees. When you give it a natural language instruction, it maps that instruction onto physical capabilities and executes.
This is what we mean by physical AI. Not a robot following a pre-programmed routine. An agent that reasons about a physical environment and acts in it, the same way agents already reason about digital environments.
How it goes from a prompt to patrol
When the agent receives the following prompt:
“Explore the office for 2 minutes, then start security patrol- if you see anyone with a hoodie, follow them," it doesn't look up a pre-built security program.”
The agent isn't following a script, it's reasoning. When you give it that security instruction, the LangGraph-powered agent decomposes it into three components:
Exploration phase: 2 minutes of autonomous mapping
Mode switch to patrol: A detection condition (person matching a description)
A response to follow.
It then maps each component to available skills on the robot and starts calling them in sequence. The skills themselves are Python methods on the SecurityModule, each decorated with @skill. That decorator exposes the method's signature and documentation as a tool schema the LLM can read and invoke. The agent decides which skills to call and when, based on its reasoning about the instruction and the situation.
This is why changing the sentence changes the behavior entirely. "Patrol the warehouse, if you see a package on the floor, stop and take a photo"- same module, same navigation stack, same detection pipeline, completely different outcome. The intelligence isn't hard-coded in the module. It lives in the agent's interpretation of your words.
The 2am auto-trigger is the same idea. The agent understands temporal constraints natively, tells it when to start and it schedules accordingly. No cron job, no separate scheduler. It's part of the prompt.
The stack running on the robot
On the hardware side, the Security module runs a three-state machine: IDLE → PATROLLING → FOLLOWING
The robot waits while being in IDLE, when start_patrol is triggered, the robot executes a route using Dimensional's spatial navigation stack. While PATROLLING, YOLO person detection runs continuously on every camera frame, producing bounding boxes, confidence scores, and class labels. When a person is detected, the re-identification system evaluates whether they match the description and compares the crop against reference features using cosine similarity for full-body appearance matching, not facial recognition. It works from behind, at odd angles and in low light.
On a positive detection, the state transitions to FOLLOWING. Visual servoing takes over: the robot computes the pixel offset between the person's bounding box center and the camera frame center, applies proportional gain to convert that into angular and linear velocity commands, and sends them to the base controller at 10Hz. The robot rotates to keep the person centered while adjusting distance. Simultaneously, the agent calls trigger_alert via webhook, SMS, or simulated emergency call.
The full pipeline runs as a single loop inside one module. One sentence gets decomposed into patrol waypoints, detection thresholds, and alert triggers, wired together automatically through Dimensional's stream-based architecture.
What you can build with DimOS
The security use case is one configuration. The same architecture supports anything where an agent needs to perceive a physical space and act on what it finds.
Inspection: "Walk the factory floor every morning and flag any equipment that looks damaged." Hospitality: "Greet anyone who walks through the front door and guide them to the conference room." Monitoring: "Watch the warehouse overnight and send me a summary of anything that moved."
Each of these is a different prompt on the same stack. The perception pipeline, navigation, and control systems are shared. What changes is the agent's interpretation of what you asked for.
The full stack is at github.com/dimensionalOS/dimos. It runs on a Unitree Go2, in MuJoCo simulation, or on recorded data for testing without hardware. Gets started in one command:
curl -fsSL https://raw.githubusercontent.com/dimensionalOS/dimos/main/scripts/install.sh | bash
Join our Discord, share what you build and hang out with fellow builders!
We're also running a hackathon in Shanghai next week, check out the details here.













