The Capability of LLMs (and AI Agents) to Do
Most AI benchmarks measure what models know. METR, a Berkeley-based AI safety research organization, asked a different question: what can models do, and for how long?
METR’s answer is the task-completion time horizon — a metric that captures the length of tasks (measured by how long a human expert takes to complete a task) that a frontier AI agent can complete with 50% reliability. If a model has a 50%-time horizon of two hours, it can handle tasks that take a skilled human up to two hours, at least half the time (https://metr.org/).

What METR found is insightful. Over the past six years, this time horizon has doubled roughly every 7 months. If the trend holds through the end of this decade, AI agents will be able to autonomously execute month-long projects. A model operating at that level is not a faster search engine — it is an autonomous agent capable of navigating ambiguity and sequencing multi-step decisions without human input.
The nuance worth holding: METR's tasks are clean and well-specified, and performance degrades on messier, real-world work. The time horizon is best read as a capability floor, not a ceiling.
Still, the directional signal is clear. For technology services firms, the capability curve creates both pressure and opportunity — pressure to evolve delivery models, and opportunity to embed agents into workflows in ways that expand margins and unlock new service categories.
At Alten Capital, we invest in technology services businesses. Reach out to explore potential partnerships.