Some people are convinced that LLMs have, if not consciousness, at least an internal understanding of the world; see, for example, these tweets by Eliezer Yudkowsky.
I have a simple example that I think refutes this belief — and I’d love to hear from people who agree with Eliezer why I’m wrong. I ask LLMs to play the 10-question guessing game with me. The rules are simple to understand, and AIs certainly have the knowledge to easily win the game — indeed, Akinator, a 2007 game is remarkably good at it.
If LLMs really do understand the world, then following simple instructions and creating a basic strategy to win the game ought to be really easy. And yet, Akinator easily beat all three main chatbots!
GPT
GPT started well enough…
But question two already has a red flag: it re-starts the numbering from 1:
It then wasted several questions trying to pinpoint the genre, after which it forgot it had exceeded its limit. It eventually figured out my character is from a comic book, but soon gave up and asked for help:
After many more guesses, and more help, it got the right answer (Scrooge McDuck).
Llama
Llama was far better than GPT in narrowing down the field (is your character a human, do they have supernatural powers, are they from a TV show, etc). It also managed to track the number of questions it had asked. But then, it went rogue:
OK then.
Bard
Bard also started strong (is it a human, are they a technological creation (looks like Bard is biased!), etc; but then, it just assumed my character fights crime:
Bard threw out a few random guesses that had little to do with the answers I had provided; like GPT, it then asked for help, so I specified the character I was thinking of is an anthropomorphic animal. So then, Bard comes up with these guesses:
(I really hope Black Panther made it into this list because of stupidity and not incredibly horrible racism.)
After another few questions, Bard tosses out this:
So, it guesses Groot (which by its own admission, I had already said was wrong!), and two human characters.
Conclusion
So, my question is, if LLMs do really ‘understand’ in any meaningful sense of the word, how can they get a simple game so wrong? Like, what theory or understanding of the concept of an internal model of the world is compatible with this result?
One could argue that LLMs have an understanding of the world that’s just too fuzzy and vague — like children. Children do in fact behave very similarly to LLMs! My daughter started speaking very early, and it was obvious that her ability to parrot things she’d heard was more advanced than her ability to understand them; and so, like LLMs, she’d ‘hallucinate’ — say things that were grammatically correct, but totally unmoored from reality. But that’s not to say she didn’t understand anything, just that her understanding was imperfect.
So maybe LLMs do understand, but their understanding is extremely limited? If this is the case, then even people who believe that LLMs are more than stochastic parrots will have to admit that the LLMs’ most advanced abilities are actually the result of mimicry. I don’t see how you can argue that LLMs can really understand law or computer science or philosophy but not a children’s game. So then the argument becomes, yes, when an LLM analyses Shakespeare it’s just plagiarising something it’s been trained on — but it still has some understanding of the world.
Maybe. But with a child, though their ability to speak might develop somewhat faster than their ability to interpret, the two develop in tandem. This doesn’t seem to be the case with LLMs. So what mechanism might explain an extremely advanced ability to mimic coupled with a very shallow level of actual understanding?
Would love to hear counter-arguments to this line of thought.
> But question two already has a red flag: it re-starts the numbering from 1:
Unless I'm very much mistaken, ChatGPT did not actually print that number 1. It looks like it's supposed to be an ordered-list element, which normally automatically increments regularly, only here the count was reset because its messages are separated.
As has already been pointed out in the subreddit post linking here, this is all about outdated tech. ChatGPT 3.5 (from early 2022) fails this test, while version 4.0 (from early 2023) completes it.
People have tried to find consistent "gaps" in GPT-4's ability, and largely failed. The best effort at this is the GAIA benchmark, developed by a team of researchers from Meta-FAIR, Meta-GenAI, HuggingFace, and AutoGPT, specifically to be easy for humans and difficult for LLMs. They "succeeded": GPT-4 averaged a score of only 15%, and while their human teams scored higher. Only one problem: I highly doubt even one in a thousand humans could score that high. Certainly I can't.
This is where we are now: So long as there's any one skill that at least one human can outperform the LLM at, we say it doesn't really understand. Do we treat anything else that way? If you tried to play twenty questions with a random stranger and, surprise surprise, they said things like this, would you say they don't really understand anything?
Or, let's put it differently: If some freak happenstance caused a chicken to start communicating like GPT-4, conveying its ability to understand at the same level which GPT-4 does, would you be comfortable eating it? Is this really where you draw the line?
i disagree that this proves much of anything. It’s in the same vein but opposite side as people who ask an LLM to make a limerick about transistors, then point to it and say “How can you deny this has intelligence?!”
If the LLM has any intelligence (I think it does!), then it’s an alien intelligence. and you shouldn’t expect your intuitions to match. “Intelligence” does not fit on one-axis even for humans. This means that they suck at some things that they should be good at. They can’t even play tic tac toe reliably without prompting!
One way that LLMs are unintuitive is that they can’t take time to reason — notice the that they respond at the same rate no matter the complexity of the question.
Maybe try searching for their sparks of intelligence instead of proposing yet another turing test? We have thousands of those at this point.