Visual Intelligence & Computer Use
Proteus is an AI research lab building the next generation of computer use agents, combining fast visual understanding, efficient context compression, and frontier reasoning to let machines operate any device, any interface, autonomously.
Our Thesis
01 — Fast perception
Fine-tuned compact models for instant visual grounding and action prediction, paired with large VLMs that handle high-level planning and reasoning when it matters.
02 — Efficient encoding
We use fast OCR, segmentation, and video models to build optimized visual representations that eliminate the redundancy of feeding raw screenshots into frontier models, faster inference, longer horizons, lower cost.
03 — Continuous operation
Agents that run autonomously across diverse, long-horizon workflows, navigating real interfaces on desktop, mobile, and the web without human hand-holding.
Demo
Our agents operate across platforms, desktop, web, and mobile, including iPhone and Android. Here's a preview of autonomous operation with real-time visual understanding.
Android & iPhone agent · Live demo coming soon
Rapid computer use on Android and iPhone. The agent sees the screen, reasons about what to do, and takes action, navigating apps, filling forms, and completing multi-step tasks with no predefined scripts.
Powered by compact vision models for fast grounding and frontier LLMs for planning, with custom optimizations for efficient visual context that keep latency low and accuracy high.
Native computer use on Windows desktops. Operates Win32 and UWP apps, manages files, runs multi-step workflows across the OS.
Full desktop automation on Ubuntu Linux. Navigates GNOME, terminal, and GUI apps with the same visual understanding pipeline.
Seamless computer use on macOS. Controls native apps, Finder, and system interfaces through real-time screen understanding.
Platform
We ship production systems used by pretraining teams and social platforms to fight misinformation and protect creator rights.
01
Efficient codec-based video encoder models for continuous screen understanding. Streams desktop and mobile frames through learned compression, maintaining temporal context without encoding every raw screenshot.
02
3× faster than state-of-the-art OCR and segmentation models, purpose-built for UI understanding. Extracts text, bounding boxes, and semantic layout from screens.
03
Planning harness built on foundation models and fine-tuned VLMs with extensive computer use context baked into LoRA adapters. Decomposes complex goals into executable action sequences across any interface.
04
We built labeled screen recording software at screencap.sh. Captures action-labeled desktop and mobile recordings for training data collection at scale.
05
Offline RL training on our trajectory recordings paired with online RL training using OS-level forking for GRPO rollouts. Fork the OS state, explore multiple action trajectories in parallel, and learn from the outcomes in simulation.
06
Optimized perceptual hashing with DinoHash, a state-of-the-art hash we invented that outperforms Meta's and Apple's best algorithms, combined with aggressive VLM caching and inference optimizations to minimize time to first token and redundant computation across frames.
Team
Select papers our team members published prior to Proteus.
ICML 2025 · CODEML Workshop
A unified framework combining DinoHash perceptual hashing, multi-party fully homomorphic encryption, and AI detection models for privacy-preserving content provenance at internet scale.
+12% bit accuracy over SOTARead paper →
ICML 2025 · CODEML Workshop
The first open-source, adversarially trained neural perceptual hash. Leverages DINOv2 features to produce compact image fingerprints that resist compression, cropping, and adversarial manipulation.
Open source · PyTorch / ONNX / npmRead paper →
We also contributed to work on decoding visual imagery via fNIRS, adversarial examples, and reinforcement learning.
Our team comes from
We partner with teams building computer use agents, visual AI infrastructure, and autonomous systems that need to see and act in the real world.