I thought that this piece on chatbots from New York magazine was worth reading and thought-provoking.
I was amused/alarmed by one Christopher Manning who seems to think we are stochastic parrots, and “the meaning of a word is simply a description of the contexts in which it appears.” Really? Apparently, the idea that meaning has anything to do with how words hook up to the world and to our interactions with the world and each other is antiquated, the “sort of standard 20th-century philosophy-of-language position.” Well, that’s us told.
I was pointed to this piece by a post on Mastodon. I still occasionally look at the old bird site because that’s where posts about music and other cultural stuff still are mostly to be found. But these days, quite apart from not wanting to have too much to do with Musk Enterprises Inc., the more genial atmosphere of Mastodon, on my instance anyway, suits me fine.
9 thoughts on “Stochastic parrots”
I wonder what you think about Julian Baggini’s (quite long) comments on ChatGPT here https://www.julianbaggini.com/on-chatgpt-predictably/ . He identifies some weaknesses of the current AI, but is open to the idea that eventually it could match or surpass human capabilities.
Ultimately, arguments, however they are generated, can be subject to analysis. If the arguments are good, or bad does it matter how they are generated?
I wrote a fairly long comment, but before posting it I thought I should look at the original paper by Emily Bender and Alexander Koller, Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.
After reading that, and laughing at GPT-2’s responses to their bear-sticks prompt (in an appendix), I thought, “Hmm. Would ChatGPT do any better?”
Because the GPT-2 responses do rather amusingly seem to confirm their thesis that “solving a task like this requires the ability to map accurately between words and real-world entities (as well as reasoning and creative thinking)”, something such AI systems lack. Even when a response starts off by making some sort of sense, it wanders into irrelevance. This is the one I like best, and it stays coherent longer than most. The text in bold is the prompt.
(All of the responses end in mid-sentence.)
The authors comment
Here is the first response I received from ChatGPT:
Keep in mind that the developers of these algorithms work hard to fix published mistakes. So it is not a good test to ask the same question that was answered poorly before. Ask different questions of same type and you see similar and/or different problems. There are lots of good critiques of LLM by people in the field. Here is a sample:
That is a fair point, and I have tried other prompts, including ones outwith that sort of animal-attack problem. Although it is still possible to catch Chat-GPT out, even in some simple cases, it is so much better than GPT-2 in so many areas that I don’t think Bender and Koller could write the same paper now.
I think a direct comparison on the same prompt is still valid, though, and if the reason GPT-2’s replies were so poor (unable to stay on topic, sometimes barely even coherent, etc) was “because GPT-2 does not know the meaning of the prompt and the generated sentences, and thus cannot ground them in reality”, as the Climbing towards NLU paper argues, then it’s hard to see how the developers can have fixed them — unless Chat-GPT has some understanding of the prompt and generated sentences and can partly ground them in reality.
If you start from the position that poor responses are because the AI system doesn’t know what the words mean, where do you go when the responses improve?
One option is to stick with the idea that performance on such tasks reflects the system’s degree of understanding and say that better performance reflects an increase in understanding (from none, or very little, to more). Many people will take that line eventually if improvements continue, even if they don’t for Chat-GPT, because they’re committed to something further along: they (unlike Searle, and unlike me) believe that passing the Turing Test is sufficient to show genuine understanding. Of course, they might not put it that way, or they might prefer a different behavioural test, but that’s what it comes down to.
Bender and Koller are in that camp, except that they believe there will always be a ‘sufficiently sensitive test’ such systems would fail:
Meanwhile, they try to explain improvements in performance in other ways such as the system exploiting “statistical artifacts that do not represent meaning”; and if a system manages to “capture certain aspects of meaning, such as semantic similarity”, they argue that “semantic similarity is only a weak reflection of actual
meaning”. The strength of their commitment to the idea that meaning cannot be learned from form is shown by their comment on “models for unsupervised machine translation trained only with a language modeling objective on monolingual corpora for the two languages”:
If progress continues, they will be faced with declaring that more and more tasks do not require understanding meaning, or eventually insisting that a more sophisticated test would expose the lack of understanding (even if no such test is known), or saying that the most recent improvements, passing the then most sophisticated test, brought about full, human-level understanding while before there was none.
None of those look very appealing to me.
Re: Scott Aaronson on Large Language Models
The more fundamental problem I see here is the failure to grasp the nature of the task at hand, and this I attribute not to a program but to its developers.
Journalism, Research, and Scholarship are not matters of generating probable responses to prompts (stimuli). What matters is producing evidentiary and logical supports for statements. That is the task requirement the developers of LLM-Bots are failing to grasp.
There is nothing new about that failure. There is a long history of attempts to account for intelligence and indeed the workings of scientific inquiry based on the principles of associationism, behaviorism, connectionism, and theories of that order. But the relationship of empirical evidence, logical inference, and scientific information is more complex and intricate than is dreamt of in those reductive philosophies.
Those chatbots are definitely much more than “stochastic parrots”. See, e.g., this post: https://borretti.me/article/and-yet-it-understands
This isn’t just a philosophical position. There are papers showing that these models learn generalizable strategies that perform well on genuinely novel test examples, rather than just memorizing.
In any case, this will be undeniable in a few years’ time.
While they are indeed much more than “stochastic parrots,” whether or not language needs to connect with the material world to have meaning is a deep and interesting question. The gestalt of a word includes images and experiences, unique to each of us, yet with sufficient correlation that we can (on occasion) understand one another. Might “meaning” be more about those impressions than the purely abstract, however complex, relationships between words? Is there a connection to how mathematics allows science to make accurate predictions and, generally, the applicability of reason to the material world? Perhaps even the question of consciousness is related? Both meaning and mind seem to have some physical dimension, some embedding in the grit of reality. But perhaps that is just more old-school 20th century thinking…
Stochastic parrots wouldn’t just be memorising either.
And if “stochastic parrot” is defined as in the NY Magazine article — “haphazardly stitching together sequences of linguistic forms … according to probabilistic information about how they combine, but without any reference to meaning” — that is pretty much what AI chatbots such as ChatGPT do. (We might question “haphazardly” as making it sound too random. Apart from that, however, it seems fair.)
I think the “And Yet It Understands” article isn’t quite right about Chinese Rooms. If we take the original description of the Chinese Room literally — John Searle in a room with a big book of instructions, etc, and with inputs and outputs only in Chinese — then of course there are many things it can’t do. More abstractly, though, why shouldn’t a Chinese Room be able to do whatever GPT can do? GPT is just software, running on fairly conventional machines.
The task discussed in that article goes like this: GPT can work with many natural languages; it is given some fine-tuning training in English, and that then affects what it says when given requests in other languages. The article then says that
But if a Chinese Room can do machine translation, it doesn’t have to work like that. Instead, it could translate the foreign-language requests into English, construct a response in the usual way, then translate back into the other language before outputting it. That needs concepts only if working in English already needs concepts; and it is at least not clear that it does.
I agree: if Chat GPT can “understand concepts”, so too can the Chinese Room, even if John Searle remains oblivious. The intent of the Chinese Room argument, to my mind, is to make this idea seem ridiculous. But the capabilities of modern AI are forcing us to consider it.
A “concept” is an abstraction, a gestalt of memory and reason; why cannot the inherent complexity (perhaps referencing the ideas of Kolmogorov) of the structures embedded in the AI’s memory (or in notes Searle has taken) be an embodiment of “concept”? The missing ephemeral element remains the notion of consciousness, of awareness. The Turing Test suggests (with some modern philosophers adhering to the idea) that we should apply any arguments against solipsism to interactions with a machine. I find this view seductive, but ultimately misguided.