A simple test for AI comprehension

Aris C

Jan 21, 2024

A simple game, and how LLMs are like children

Read →

8 Comments

Odd anon

Jan 28, 2024

> But question two already has a red flag: it re-starts the numbering from 1:

Unless I'm very much mistaken, ChatGPT did not actually print that number 1. It looks like it's supposed to be an ordered-list element, which normally automatically increments regularly, only here the count was reset because its messages are separated.

As has already been pointed out in the subreddit post linking here, this is all about outdated tech. ChatGPT 3.5 (from early 2022) fails this test, while version 4.0 (from early 2023) completes it.

People have tried to find consistent "gaps" in GPT-4's ability, and largely failed. The best effort at this is the GAIA benchmark, developed by a team of researchers from Meta-FAIR, Meta-GenAI, HuggingFace, and AutoGPT, specifically to be easy for humans and difficult for LLMs. They "succeeded": GPT-4 averaged a score of only 15%, and while their human teams scored higher. Only one problem: I highly doubt even one in a thousand humans could score that high. Certainly I can't.

This is where we are now: So long as there's any one skill that at least one human can outperform the LLM at, we say it doesn't really understand. Do we treat anything else that way? If you tried to play twenty questions with a random stranger and, surprise surprise, they said things like this, would you say they don't really understand anything?

Or, let's put it differently: If some freak happenstance caused a chicken to start communicating like GPT-4, conveying its ability to understand at the same level which GPT-4 does, would you be comfortable eating it? Is this really where you draw the line?

Expand full comment

Reply (1)

Aris C

Jan 29, 2024

I don't follow your reasoning... Yes, if a person couldn't follow simple instructions, I'd say their ability to understand is impaired. No, I wouldn't eat a talking chicken. What a bizarre question to ask!

Expand full comment

Reply (1)

Odd anon

Jan 29, 2024

I apologize, I think I wandered that comment away from sense-making near the end, diverging toward other (only vaguely related) arguments about moral patienthood that you hadn't even mentioned. Sorry.

My general point was about what seem like double-standards for reasoning ability: There are some people who can't play twenty questions, but can understand many other things, and we don't often assume that most of what they do is blind mimicry. Among animals, we are often extremely quick to assume real understanding, probably even moreso than we should. Whereas among LLMs, we default to, as you say, "I don’t see how you can argue that LLMs can really understand law or computer science or philosophy but not a children’s game" which we certainly would not say about a human lawyer or programmer or philosopher with such a deficiency.

Expand full comment

Reply (2)

Aris C

Jan 29, 2024

Basically, I think we humans 'understanding' is quite general and consistent. Yes, some people may have lower cognitive skills but then this will generally be the case across domains.

Expand full comment

Aris C

Jan 29, 2024

I think it's a good argument you're making. For me, what makes LLMs different is the gap between what they can do and what they can't. It's so large that it suggests the things they can do come from something other than understanding.

Expand full comment

Jan 22, 2024

i disagree that this proves much of anything. It’s in the same vein but opposite side as people who ask an LLM to make a limerick about transistors, then point to it and say “How can you deny this has intelligence?!”

If the LLM has any intelligence (I think it does!), then it’s an alien intelligence. and you shouldn’t expect your intuitions to match. “Intelligence” does not fit on one-axis even for humans. This means that they suck at some things that they should be good at. They can’t even play tic tac toe reliably without prompting!

One way that LLMs are unintuitive is that they can’t take time to reason — notice the that they respond at the same rate no matter the complexity of the question.

Maybe try searching for their sparks of intelligence instead of proposing yet another turing test? We have thousands of those at this point.

Expand full comment

Reply (1)

Aris C

Jan 22, 2024

Don't you think that's a cop out though? 'It's unintuitive and alien and so don't try to fathom it'? I think anyone proposing LLMs are intelligent ought to be able to account for failures like the one I'm describing.

Btw, I'm not considering what I suggested a Turing test.

Expand full comment

Reply (1)

Jan 22, 2024

I didn’t mean don’t try to fathom it. In fact my last line said try other stuff! All I’m saying is that it could be like a parrot in some ways and like a human in others. Finding one in which it fails doesn’t prove the lack of real intelligence in other domains.

Why should they be able to account for its failures? Seems like a weird isolated demand for rigor. Can you account for how an LLM can write a poem that’s never existed before? Or get the right answer to newly invented word problems?

Anyway, if you’re heading into that territory, i think you need a more strict definition of intelligence and that definition needs to account for human variability too.

Expand full comment

Logos

A simple test for AI comprehension