Garrek Stemo

AI Benchmarks Are Confusing Journalists

Wednesday, September 03 2025

In July, The Economist repeated a claim that prognosticators at the Forecasting Research Institute (FRI) had underestimated the progress of AI by nearly a decade. OpenAI’s o3 model, they wrote, was already performing at the level of a top team of human virologists. On its face this would suggest that super-intelligent AI had arrived. If language models (LLMs) can solve problems that require a team of scientists, then they surely can do my job and can now get on with driving a new scientific and economic revolution.

The trouble is that any serious look at FRI’s benchmarks puts the claims into question. Testimony from people actually using LLMs for creative work makes me question the authors' seriousness. LLMs do not perform like a bunch of PhDs in a data center. They will not get us there, even with more training and tricks. A new architecture is needed for AI to do original research.

Those in the AI forecasting business are not testing what they think they are testing and often misinterpret their own results. The media exacerbate the problem. They must do better to understand what a benchmark actually measures and evaluate that against forecasters’ claims before reporting them. More importantly, they should interview experts trying to incorporate LLMs into their work. There is no benchmark that can substitute for integrating AI into a creative workflow. Doing so makes the shortcomings of language models like o3 obvious.

Dwarkesh Patel, a podcaster who has interviewed executives and academics working on AI, quibbles with the predictions that AI will be able to replace white collar work in a few years. His main critique is that current LLM systems’ inability to learn will hold them back. He makes the point by analogy of teaching a student the saxophone. Using an LLM is like having a student play a song, writing detailed instructions about mistakes and improvements to make, and then inviting a new student to try again based on those notes. AI systems will always be mildly infuriating in this way until they can learn and remember. Students improve with practice; LLMs are stuck until a new version is built.

Chris Rackauckas, a scientific machine learning researcher at MIT, described a similar experience on the Julia language forums. He found that Claude Code was useful for repetitive code maintenance and simple problems “that a first year undergrad can do”. But it was unable to do extensive or novel algorithmic work. “This Claude thing is pretty dumb,” he writes, “but I had a ton of issues open that require a brainless solution”. Useful, but a team of machine learning experts it is not.

The release of GPT-5 in ChatGPT coincided with me wanting to implement a quantum master equation for nonlinear spectroscopy based on a paper published this year. ChatGPT (and Claude Code) could translate well-known parts of the theory, like the Lindblad equation, into Julia. But it struggled with new material and putting the pieces together so that everything worked. As most papers go, the authors left out details that the LLMs are happy to guess about. I still needed to work through the paper, understand the theory, understand how to express it in code, and do a lot of the writing myself.

What ChatGPT-5 improves upon is retrieving fundamental information and filling in my own knowledge gaps with correct references (which wasn't the case a short while ago). It can create toy problems to test how parts of the code should work within the larger framework. What it still cannot do is implement a just-published piece of theory in code, let alone extend it into novel territory, a natural next step for research.

LLMs are interpolation machines, not extrapolation machines. They can reproduce what is in their training data and generate new output within those limits. Yann LeCun has said in interviews that LLMs are “systems with a gigantic memory and retrieval ability”. They do not function like a PhD, which requires asking the right questions and solving novel problems.

My gripe is not about the capabilities of LLMs, which are already quite powerful. It is with these benchmarks claiming there is something there that isn’t and extrapolating to predict an intelligence explosion that will start solving real problems. I agree with LeCun and Patel that progress will require a new architecture that can learn, have persistent memory, plan, and understand the physical world.

Reputable media outfits conflating AI generally with LLM technology specifically is a major problem that I hope is resolved soon. There will not be a continuous line from the current architecture to the next one. In reality, progress is likely to be discontinuous, and bridging that gap will take more than scaling up LLM technology with bolted-on tricks.


AI labs' all-or-nothing race leaves no time to fuss about safety -- The Economist

What if AI made the world's economic growth explode? -- The Economist

Why I don't think AGI is right around the corner -- Dwarkesh Patel

Everyone is judging AI by these tests, but experts say they're close to meaningless -- Jon Keegan, The Markup

This Claude thing is pretty dumb, but I had a ton of issues open that require a brainless solution. -- Chris Rackauckas, JuliaLang Discourse

Why AI won't take my job -- Rana Foroohar, Financial Times


The Vision for AI Is Disappointing So Far

Friday, August 01 2025

The automobile transformed the American urban and rural landscape from 1910 to 1930. Gone was the manure from city streets and the sight, stench, and sounds of horses. Cities expanded as people moved out from urban centers. Distant recreational activities were suddenly possible, increasing the value of leisure time. Rural isolation ended, along with single-room schoolhouses in favor of consolidated schools. Farmers were freed from the tyranny of their local merchant and could drive to towns. Today’s tech CEOs believe that AI will grow powerful enough to surpass many or all previous human inventions, but the short term promises are considerably underwhelming.

A recent interview on The Verge’s Decoder podcast illustrates the point. AI startup, Caption, uses generative AI to produce short video clips. Caption’s CEO, Gaurav Misra, says the biggest use case for AI video is marketing. Small businesses and individuals can do without hiring videographers or learning to shoot and edit video. Misra promises to democratize advertising. Perhaps a boon for some, though hardly revolutionary.

Visions from other elsewhere are similarly narrow. Agentic AI promises to reduce the burden of completing online tasks like booking flights and compiling slides. Google’s AI Applications page mentions things like “data analysis” and “increased efficiency”. On Internet forums you will find those waxing about AI being a game changer for increasing blog post output.

Some, like Anthropic CEO Dario Amodei, predict AI could increase GDP in developed countries to a sustained 5-10% per year. This would mean a doubling roughly every ten years, a rate of economic change humans have never seen before. But are we going to get there by writing more mediocre blog posts? Are we going to see massive GDP growth because AI manage our online branding by outputting a deluge of content that maximizes engagement and purchasing? Will it come from small efficiency gains in existing business operations, like summarizing meetings?

The answer is almost certainly “no”. Real change will come from real innovation — completely new products that will change how we live, like the automobile. We have not yet imagined how AI will plug in to society to effect such change. The biggest bottleneck may not be technical, but human ingenuity for finding applications.

Generative AI may yet change the world. But for now, it mostly wants to sell you things efficiently.


Robert Gordon, 2016. The Rise and Fall of American Growth. Princeton University Press, Chapter 5.

We are not ready for better deepfakes

Artificial intelligentsia: an interview with the boss of Anthropic


Sam and Jony Partnership

Wednesday, May 28 2025

Sam Altman and Jony Ive a few days ago announced an integration between OpenAI and Jony Ive’s startup, in a statement resembling a wedding announcement. I was skeptical of this partnership, and was surprised to see many tech commentators and podcasters largely positive about it. I see Jony Ive’s greatest work being behind him. Bringing him back to consumer electronics feels like a misstep. That’s why I was very pleased when I read Jason Snell’s thoughtful piece voicing similar doubts.

I agree with Snell: there is no evidence that Ive can innovate today the way that he did when he was working with Steve Jobs. Jony is an incredible designer, but he also had an incredible partner in Steve — someone who tempered aesthetics with usability. When Jony was given full creative control over Apple’s design processes, across hardware and software, his instincts went unchecked. He lacked the practical sensibility that Steve brought to the relationship, and the products made some weird tradeoffs for form over function.

Sam and Jony now claim to be working on a new central device, something beyond the phone and personal computer. But if no one is grounding the design process in the functional needs of regular people, I do not have high expectations for this new relationship.


I started a new hobby

Sunday, April 13 2025

I started photography.

This is the hobby I told myself never to get into specifically because I knew that I would get really into it. It combines all elements of a good hobby: gadgets, science, and creativity. It gets you outside. The problem is the cost. It seemed too easy to spend a lot.

Some time last year I discovered that the iPhone has a feature where the lock screen can rotate through photos from your library based on categories. I picked nature. After several months, I noticed that there were certain photos I liked a lot more than others. They were all taken with real cameras, not a smartphone. Like an old DSLR or an old Sony point-and-shoot I had ten years ago. I still liked those photos more than the ones taken on whatever latest iPhone I was carrying around.

So now I have a camera. The Nikon Z 5 with the 24-200 mm zoom lens. I specifically did not want the best camera, and the lens seems pretty good. It's be better than anything I've ever used, that's for sure.

Anyway, take a look. There's a new Photography item in the left menu that takes you to a gallery.

Oh and I decided to make a Sports section too. We'll see how long that lasts.


Liberalism in 2025

Monday, February 17 2025

Late last decade The Economist issued a manifesto and essay defending liberalism in the 21st century. Rereading it seven years later, it is not clear that the world has yet heard the call. President Biden’s term did not shore up America’s democracy against further backsliding, leaving it vulnerable to a second Trump presidency. Europe still hasn’t found its backbone.

What does this mean for science? Higher tariffs and mass government layoffs spell disaster for science funding. I expect delays in delivery of crucial technologies and medicines. Well-funded incumbants will be able to navigate the chaos, but emerging and capital-intensive technologies will struggle to get enough funding.

The essay is worth a re-read at the outset of an uncertain era in the mid-2020s.

A manifesto for renewing liberalism

The Economist at 175


Sustained Economic Growth Comes From Relentless Technological Progress

Tuesday, July 16 2024

Daniel Susskind writes in the FT urging the new UK government to invest in science and technology and the people that drive their development in order to deliver the economic growth that Keir Starmer’s government promises. He writes the following of the mechanisms of growth:

[T]he little we do know suggests that it does not actually come from the world of tangible things, but rather from the world of intangible ideas.... Or, more simply, sustained economic growth comes from relentless technological progress.

Too often governments focus on the trappings of economic growth — tangible things like housing and roads — but fail to invest adequately in the furnace that drives it: the discovery and deployment of useful ideas.

He rightly points out that leaders often have little control over the actual drivers of economic growth, certainly not within short terms that they occupy office. But if the furnace of economic productivity is fed, and the pipeline from that furnace to deployment of technologies is properly maintained, then tangibles like faster trains and better roads are the result. Productivity rises, as well as quality of life. Just not on the timescale of an election cycle.

I partly quibble with Susskind’s characterization that there is no lever for economic growth. The levers are investment in basic research, careful nurturing of promising technologies, and efficient deployment of the most promising of those. The process is not simple, nor a fast-acting one. Technological progress compounds, but only so long as governments make regular investment and smart choices.

I also want to nitpick the notion that new ideas will come less from humans and more from technology. On this Susskind writes:

The current century will be different. New ideas will come less frequently from us and more from the technologies around us. We can already catch a glimpse of what lies ahead: from large companies like Google DeepMind using AlphaFold to solve the protein-folding problem to each of us at our desks using generative AI — from GPT to Dall-E.

Humans are still the agents of technologies like AlphaFold and generative AI. They choose the questions to ask and direct the process. Even if these products do begin formulating meaningful scientific questions on their own, humans will still choose which questions to pursue based on what they perceive as having the most value; the technologies are not autonomously identifying and solving problems. And when they start doing so, it will be for the benefit and with the consult of humans.


The Full Stack

Friday, May 24 2024

The modern economy runs on scientific, engineering, and manufacturing progress. J. Bradford DeLong, in his grand narrative Slouching Towards Utopia, describes modern history as being driven by economics. The modern era was brought into existence by the development of the modern corporation, globalization, and the industrial laboratory. All of these components are necessary. Michael Strevens paints a similar picture in portions of his book, The Knowledge Machine.

What many miss is the importance of nurturing all three of these ingredients. Politicians and administrators want the niceties that the confluence of these institutions brings and they want them now. They focus on the institutions closest to the output end of the pipeline, the part where the emerging product is becoming visible. The exciting part. Money then gets funneled to the last mile of the stack while input is neglected. This is a mistake. I will argue why governments must consider the full stack and not lose sight of the slower moving, but perhaps even more crucial, elements of the pipeline.

To mix metaphors, the technological stack can be thought of as a house. Basic science is the foundation. Here lies the forefront of human knowledge. It is the cutting edge. How the technological foundation is built will determine how the rest of the building can be constructed. This means choosing the right questions to pursue with a balance of scattershot, high-risk high-reward type research and clear pathways to practical technologies. The goal is to surface interesting questions that will open avenues that contain yet more interesting questions. From there applied science and engineering science take some of that basic knowledge and explore how it might be made into useful technologies. After applied science is the engineering phase. Ideas are worked until they can be made useful give the wants and needs of society.

Manufacturing is the next layer of the stack. By this I do not merely mean making the thing. Manufacturing is itself a complicated globe-spanning pipeline that includes procurement of raw materials and processing of those materials. Components fashioned are then shipped to hubs for final assembly. The process is an engineering effort that leads to economies of scale to make products affordable to consumers. Simply engineering a new product is not enough if producing it is expensive. An entirely different kind of engineer is required to scale an idea so that economies of scale can work its magic to successfully bring a product to market.

My focus is on the under-funded, slow moving bottom of the stack: basic science. Practitioners of basic science are often poor marketers of their craft. They often resent the market economy and do not recognize their place within it. They see themselves as outside of it. When describing the importance of their profession they talk in romantic terms, as being "passionate" and describing the innate curiosity of humankind as the driving force. This is all well and good (and is indeed the reason that I am a scientist) but another sense of "greater good" is the economic one: building the furnace of the economic engine that brings down the price of goods, raises standards of living, delivers medicines. Science is the bedrock of this economic movement that has been responsible for the enormous reduction in poverty over the long twentieth century. The sooner scientists recognize our role within the economic engine instead of resenting it, the better we can make the case for significant increases in basic research spending.

Now that I have obtained my Ph.D. and have achieved some stability as a working scientist, this topic will become a focus. What can be done to reverse the trend of declining spending on basic research? How can we increase the appetite for risk and take more chances on promising technologies? These and other questions will be explored in these virtual pages.


More Podcast Paywalls

Tuesday, October 31 2023

The Economist has moved all of its podcasts behind a paywall except for its daily news show The Intelligence. This is disappointing because I somewhat regularly would share episodes with friends who don't subscribe to the newspaper. It also means that the analogy of podcasts being like radio that you download no longer holds. Another idea recedes into the past. I get the move — the advertising market is drying up and many independent podcasts are moving toward membership models. Spotify has pushed the industry towards subscription and now Apple Podcasts has gotton on board. Still, if The Economist is going to move its content behind a paywall, at least they have done it the right way. You can still use any podcast player to listen to shows. They provide a subscriber RSS feed in addition to hooking into the subscriber features of the big podcast apps. This is definitely the way to go and I'm glad they are continuing to use RSS instead of making up their own to do it.

Previously I posted about The New York Times launching its own audio app. I still think this is doomed to fail. They have since launched more shows that are available only on the app. I don't know their numbers, but I suspect they won't see a lot of growth in the long run. Thinking big picture, the internet is now going through a phase of decentralization. This is most apparent in the social media space with the rise of new microblogging platforms like Mastodon and Threads — and more importantly ActivityPub which allows them all to interconnect — and the slowly disintegrating Twitter/X. Podcasts have always used use-it-anywhere RSS feeds and I don't see that changing any time soon.


Apple Silicon Macs have a DAC that supports high-impedance headphones

Friday, June 23 2023

I bought the Blue Mo-Fi headphones shortly after they came out in 2014. They great headphones, but the fake leather on the ear pads have almost completely flaked off and now that Blue has been bought by Logitech and is killing the Blue mic brand there is little hope of trying to get replacement parts or repair them in the future (I have tried). Besides, they didn't have stellar reviews when they came out and now I'm getting into high-fidelity audio.

Wading through the online world of audiophile hardware was making me consider buying a DAC and amp in addition to new headphones, but then I found Apple's Support pages for lossless audio and it appears that Apple Silicon Macs not only support lossless audio output, but also have a built-in DAC and amp that can drive high-impedance headphones. That solves that problem. I'll just buy some entry-level audiophile headphones and go. The built-in hardware is probably not as sophisticated as dedicated hardware, but I doubt I'll ever be that into the highest-end audio equipment. There are other things to be obsessed about.

I can't find any information on the built-in DAC or amplifier in System Information, but the Audio MIDI Setup app (comes with macOS) allows you to select the input and output sample rate and other settings.

Links to Apple's support pages:

About lossless audio in Apple Music

"The entire Apple Music catalog is encoded in ALAC in resolutions ranging from 16-bit/44.1 kHz (CD Quality) up to 24-bit/192 kHz."

Supported on iPhone, iPad, Mac, HomePod, Apple TV 4K (not greater than 48 kHz), and Android.

This page says only the 14-inch and 16-inch MacBook Pros support native playback up to 96 kHz, but I think this is outdated because the other support pages all say otherwise.

Use high-impedance headphones with your Mac

Macs introduced in 2021 or later (probably meaning M1 chips or later).

Impedance detection and adaptive voltage output, and built-in digital-to-analog converter (DAC) that supports sample rates up to 96 kHz.

Play high sample rate audio on your Mac

Hardware digital-to-analog converter (DAC) that supports sample rates up to 96 kHz in Macs introduced in 2021 or later (i.e. M1 chips or later).

Set up audio devices on Audio MIDI Setup on Mac

"To set the sample rate for the headphone jack, use the Audio Midi Setup app, which is located in the Utilities folder of your Applications folder. Make sure to connect your device to the headphone jack. In the sidebar of Audio MIDI Setup, select External Headphones, then choose a sample rate from the Format pop-up menu. For best results, match the sample rate for the headphone jack with the sample rate of your source material."

Some recent Reddit threads on the topic:


The New York Times Makes a Podcast-like App

Tuesday, May 23 2023

The New York Times just released New York Times Audio, an app for “audio journalism”. It curates all of the New York Times podcasts (including a new daily podcast called “Headlines”) as well as podcasts from third parties, like Foreign Policy and This American Life. It will also include audio versions of written articles.

I think it will be difficult to penetrate the pretty well established spoken-word market. Podcasts are dominated by Apple Podcasts, and Spotify has had a hard time turning podcasting into a core part of its business. I can see NYT Audio being a niche product that appeals to a small subset of NYT subscribers, but not much more. I’m guessing the goal is to charge a fee for third parties to access NYT subscribers. I don’t really see how this app would generate more revenue from existing subscribers both because I don’t see huge numbers using the app and because podcasts are traditionally free and use open web standards. Again, see Spotify’s and other attempts to make proprietary podcasting formats.

I’ll try the app, but I don’t see it becoming a habit. Overcast is already on my Home Screen and adding another podcast app is a tall order. If I find something I like, I will most likely just add it to a playlist in Overcast.