Proteus - Visual Intelligence & Computer Use

Our Thesis

Reasoning has
outpaced vision

AI can reason and write code better than most humans. But it still can't reliably use a computer. Visual understanding is the bottleneck, slow image encodings, bloated context windows, and no ability to act continuously over long horizons. We're fixing that.

01 — Fast perception

Small models, rapid action

Fine-tuned compact models for instant visual grounding and action prediction, paired with large VLMs that handle high-level planning and reasoning when it matters.

02 — Efficient encoding

Efficient representations

We use fast OCR, segmentation, and video models to build optimized visual representations that eliminate the redundancy of feeding raw screenshots into frontier models, faster inference, longer horizons, lower cost.

03 — Continuous operation

Long horizon planning

Agents that run autonomously across diverse, long-horizon workflows, navigating real interfaces on desktop, mobile, and the web without human hand-holding.

Demo

Computer use, anywhere

Our agents operate across platforms, desktop, web, and mobile, including iPhone and Android. Here's a preview of autonomous operation with real-time visual understanding.

Android & iPhone agent · Live demo coming soon

Autonomous mobile agents

Rapid computer use on Android and iPhone. The agent sees the screen, reasons about what to do, and takes action, navigating apps, filling forms, and completing multi-step tasks with no predefined scripts.

Powered by compact vision models for fast grounding and frontier LLMs for planning, with custom optimizations for efficient visual context that keep latency low and accuracy high.

Windows

Native computer use on Windows desktops. Operates Win32 and UWP apps, manages files, runs multi-step workflows across the OS.

Ubuntu

Full desktop automation on Ubuntu Linux. Navigates GNOME, terminal, and GUI apps with the same visual understanding pipeline.

macOS

Seamless computer use on macOS. Controls native apps, Finder, and system interfaces through real-time screen understanding.

Platform

Engineering every layer of the stack

We ship production systems used by pretraining teams and social platforms to fight misinformation and protect creator rights.

Codec-based video encoder

Efficient codec-based video encoder models for continuous screen understanding. Streams desktop and mobile frames through learned compression, maintaining temporal context without encoding every raw screenshot.

Hyperoptimized OCR & segmentation

3× faster than state-of-the-art OCR and segmentation models, purpose-built for UI understanding. Extracts text, bounding boxes, and semantic layout from screens.

Multistep planning harness

Planning harness built on foundation models and fine-tuned VLMs with extensive computer use context baked into LoRA adapters. Decomposes complex goals into executable action sequences across any interface.

Custom labeled data collection

We built labeled screen recording software at screencap.sh. Captures action-labeled desktop and mobile recordings for training data collection at scale.

Offline & online RL training

Offline RL training on our trajectory recordings paired with online RL training using OS-level forking for GRPO rollouts. Fork the OS state, explore multiple action trajectories in parallel, and learn from the outcomes in simulation.

Rapid model inference

Optimized perceptual hashing with DinoHash, a state-of-the-art hash we invented that outperforms Meta's and Apple's best algorithms, combined with aggressive VLM caching and inference optimizations to minimize time to first token and redundant computation across frames.

Machines that
see and act

Reasoning has
outpaced vision

Small models, rapid action

Efficient representations

Long horizon planning

Computer use, anywhere

Autonomous mobile agents

Windows

Ubuntu

macOS

Engineering every layer of the stack

Codec-based video encoder

Hyperoptimized OCR & segmentation

Multistep planning harness

Custom labeled data collection

Offline & online RL training

Rapid model inference

Deep expertise in AI

Provenance Detection for AI-Generated Images

DINOHash: Adversarially Robust Perceptual Hashing from Self-Supervised Features

Let's build the
agentic layer

Machines thatsee and act

Reasoning hasoutpaced vision

Small models, rapid action

Efficient representations

Long horizon planning

Computer use, anywhere

Autonomous mobile agents

Windows

Ubuntu

macOS

Engineering every layer of the stack

Codec-based video encoder

Hyperoptimized OCR & segmentation

Multistep planning harness

Custom labeled data collection

Offline & online RL training

Rapid model inference

Deep expertise in AI

Provenance Detection for AI-Generated Images

DINOHash: Adversarially Robust Perceptual Hashing from Self-Supervised Features

Let's build theagentic layer

Machines that
see and act

Reasoning has
outpaced vision

Let's build the
agentic layer