There’s a graph making the rounds in Silicon Valley that has everyone from venture capitalists at Sequoia to nervous engineers at Anthropic whispering. It’s got a steep, upward curve that looks suspiciously like a rocket taking off, and it’s meant to show AI getting smarter, faster. On the surface, it’s the kind of chart that either makes you want to invest a billion pounds or run for the hills. An Anthropic employee reportedly quipped, “mom come pick me up i’m scared.” But here’s the thing: almost everyone is reading it wrong.
This isn’t just a simple misunderstanding; it’s a perfect case study in how the AI hype machine can spin a single data point into a whirlwind of both utopian dreams and existential dread. So, what is this graph, and why does the METR controversy reveal so much about the messy business of AI capability measurement?
What Are We Even Measuring Here?
Before we get to the graph itself, let’s take a step back. How do we know if an AI is actually getting “better”? This is the fundamental challenge of AI capability measurement. It’s not like timing a sprinter. We’re trying to quantify something as fuzzy as intelligence, and our rulers for doing so are often crude and context-dependent.
Metrics matter. They dictate where billions in research funding go, which models from OpenAI, Google, and Anthropic get deployed, and how regulators decide to step in. Get the measurement wrong, and you could be pouring resources into a dead end or, worse, prematurely unleashing technology you don’t fully understand.
The Chart That Launched a Thousand Tweets
This brings us to the organisation at the heart of this storm: METR (formerly the Alignment Research Centre). Their now-famous ‘time horizon plot’, as detailed in a recent MIT Technology Review article, tracks the performance of AI models on a specific set of tasks. The chart’s y-axis isn’t ‘AI intelligence’ but ‘human time’. Specifically, it plots the length of time it would take a human to complete a task that an AI can now successfully handle about 50% of the time.
The data points are stark. METR researcher Sydney Von Arx notes, “Every seven-ish months, the time horizon doubled.” We’ve gone from models that could handle tasks taking humans a few seconds in 2020, to minutes in 2023, to the latest models like Claude Opus 4.5 tackling tasks that might take a human hours. The trend line shoots upwards, suggesting an unstoppable exponential growth in ability.
But here’s the grand misinterpretation that has fuelled so much debate. Many people look at the graph and think it’s measuring how long an AI can operate autonomously or the complexity of tasks it can manage. It’s not. It’s a measure of human effort for a task the AI can sometimes complete. That’s a huge difference.
Exponential Hype Meets Messy Reality
The concept of exponential growth in technology is catnip for the tech industry. It conjures images of Moore’s Law and promises of world-changing breakthroughs just around the corner. When a chart visually represents this, it becomes an incredibly powerful, and dangerous, piece of communication.
As METR’s Thomas Kwa admits with a sense of resignation, “I think the hype machine will basically, whatever we do, just strip out all the caveats.” And the caveats are enormous.
The Limits of the Ruler
First, the tasks METR primarily measures are coding challenges. While impressive, a model’s ability to write Python script doesn’t necessarily predict its ability to, say, discover a new drug, offer sound medical advice, or negotiate a peace treaty. It’s like judging a fish by its ability to climb a tree. Sure, it’s not supposed to climb at all, but that skill doesn’t tell you much about its swimming prowess.
Second, the metric itself is wobbly. As academics like Inioluwa Deborah Raji and Daniel Kang point out, using human completion time as a proxy for difficulty is flawed. Raji rightly observes, “I don’t think it’s necessarily a given fact that because something takes longer, it’s going to be a harder task.” Think about it: calculating a massive prime number would take a human ages but is trivial for a computer. Conversely, telling a convincing joke is a quick task for a human but remains fiendishly difficult for an AI.
The error bars on METR’s own data are also massive. The researchers themselves state that for a model like Opus 4.5, it might be succeeding on tasks that take humans two hours, or it could be tackling tasks that take as long as 20 hours. That’s a 10x range of uncertainty—hardly the precise measurement the hype cycle implies.
The Future of AI Progress Tracking
So, if the METR graph is a flawed and misunderstood tool, what’s the alternative? The truth is, there is no single, perfect metric for model evaluation. The industry is grappling with a difficult transition from narrow benchmarks to assessing more general, real-world capabilities.
Effective AI progress tracking will likely require a dashboard of metrics, not a single silver bullet.
– Specialised Benchmarks: We’ll still need tests for specific skills like coding, mathematics, and language translation. These are useful for iterative improvements.
– Real-World Evaluations: We need more tests that simulate complex, multi-step tasks that require reasoning, planning, and adapting to new information. Think less ‘solve this coding problem’ and more ‘plan a marketing campaign for a new product with a budget of £10,000’.
– Adversarial Testing: We need systems that actively try to trick and break AI models, probing for weaknesses, biases, and a failure to understand context. This is where sceptics like Gary Marcus provide immense value; they act as the immune system for the AI field, attacking weak claims.
The future of AI capability measurement won’t be a single, elegant line on a graph. It will be a messy, qualitative, and multi-faceted process. It’ll look less like physics and more like psychology.
The METR graph isn’t useless—far from it. As Sydney Von Arx argues, “I bet that this trend is gonna hold.” Within its narrow context of coding tasks, it does indicate a powerful and accelerating trend. The danger lies not in the data itself, but in its interpretation and the narrative built around it. A single chart has been co-opted to serve every agenda, from those promising an AI-powered utopia to those warning of an impending apocalypse.
The real story here is more subtle and, frankly, more interesting. It’s a story about the profound difficulty of knowing ourselves, or at least, knowing our creations. As we build these increasingly capable systems, our biggest challenge might not be the technology itself, but our own human tendency to oversimplify, to crave certainty, and to let a good story get in the way of messy facts.
What do you think? Is a flawed metric better than no metric at all, or does the hype it creates do more harm than good?


