The Long Road to AGI Begins with Control: The problem pt 1
Forget magic — it’s time for deterministic, testable engineering
Won’t be fooled again: Priors on near term AGI
We got lucky with LLMs (large language models) and there is every reason to suspect that we won’t be so lucky with AGI (Artificial General Intelligence). AGI feels so close, just like AGI felt so close in the 1980’s with breakthroughs in compute and expert systems, just like the 1970’s with the breakthroughs in compute and logic programming, just like the 1950’s with the breakthroughs in compute and code breaking…..
Each epoch was started by real progress followed by hype and a bust, but concurrently a lot of important and useful work got done applying that progress to real world problems outside of CEO’s imaginations.
So here we find ourselves under a hype bubble—not a bust yet— and I’d like to get down to business building actual systems with these amazing LLMs with the intention, as always, of getting closer to AGI.
What’s missing ?
My first big AI system was built in 1995 for a government competition with a rag-tag group of graduate students out of the University of Pennsylvania that, but for two document format errors, would have won the MUC-6 (Message Understanding Conference) coreference bake-off against giants at the time like SRI (Stanford Research Institute)—https://aclanthology.org/M95-1015/.
I learned a hard lesson on that loss: unreliable systems that fail ungracefully have consequences—we still tied for second scored on 28 of 30 documents surviving, but I became utterly paranoid about system robustness and have remained that way across the hundreds of systems I have built across my career.
Lack of determinism is the system killer, not hallucinations—but nobody cares
Non-determinism, at the scale we see with LLMs, is a completely new phenomenon and it really screws up system building. Lets start with how bad the problem can be.
Definition: An LLM is non-deterministic on an input if on run 0 we see a later run n that does not have the same exact output with a fixed seed (if available), temperature=0.0 and top_p/k = 1.0.
“Brutal” best describes above graph of determinism for anyone hoping to build a reliable system—the second run, numbered ‘1’ on the x-axis—shows that GPT-4o only deterministically produced the same exact output on 12 questions and gradually dropped to 4 of 100 by the 6th run for a multiple choice question from the college mathematics benchmark at https://huggingface.co/datasets/cais/mmlu. Deepseek R1 had 0 responses that were exactly the same on run 1 as run 0. The best performance on round 1 is 52 exact repeats with Llama3-70b. What a mess but I can’t decide what is worse:
The determinism drop from round 0 to round 1.
The decrease in determinism over more rounds suggesting a stochistically inherent instability. It is not but I’ll cover that in a later post.
Why so sad?
The non-determinism pulls the rug out from under the following aspects of system building:
Unit testing—the LLM can just change its mind for no reason. That will fail normal unit testing.
Breaks the 80:20 rule: There are many 80:20 rules but this one is 20% of the inputs cover 80% of the cases. If my LLM is not stable then I can’t be sure that my coverage of the 20% will reliably map to the 80% system I am trying to build.
Stack complexity goes way up because I have to account for wonky LLM outputs downstream—I am used to AI (artificial intelligence) systems sucking, I am not used to my classifier inventing new ways of communicating the answer “Yes” or “No” to the downstream processes.
That is enough for now, next up I’ll come up with why I think nobody cares about non-determinism and then start getting into about mitigation strategies.
Next Post: The Long Road to AGI Begins with Control: Why nobody cares pt2