Stochastic parrots

I thought that this piece on chatbots from New York magazine was worth reading and thought-provoking.

I was amused/alarmed by one Christopher Manning who seems to think we are stochastic parrots, and “the meaning of a word is simply a description of the contexts in which it appears.” Really? Apparently, the idea that meaning has anything to do with how words hook up to the world and to our interactions with the world and each other is antiquated, the “sort of standard 20th-century philosophy-of-language position.” Well, that’s us told.


I was pointed to this piece by a post on Mastodon. I still occasionally look at the old bird site because that’s where posts about music and other cultural stuff still are mostly to be found. But these days, quite apart from not wanting to have too much to do with Musk Enterprises Inc., the more genial atmosphere of Mastodon, on my instance anyway, suits me fine.

9 thoughts on “Stochastic parrots”

  1. I wonder what you think about Julian Baggini’s (quite long) comments on ChatGPT here https://www.julianbaggini.com/on-chatgpt-predictably/ . He identifies some weaknesses of the current AI, but is open to the idea that eventually it could match or surpass human capabilities.
    Ultimately, arguments, however they are generated, can be subject to analysis. If the arguments are good, or bad does it matter how they are generated?

  2. I wrote a fairly long comment, but before posting it I thought I should look at the original paper by Emily Bender and Alexander Koller, Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.

    After reading that, and laughing at GPT-2’s responses to their bear-sticks prompt (in an appendix), I thought, “Hmm. Would ChatGPT do any better?”

    Because the GPT-2 responses do rather amusingly seem to confirm their thesis that “solving a task like this requires the ability to map accurately between words and real-world entities (as well as reasoning and creative thinking)”, something such AI systems lack. Even when a response starts off by making some sort of sense, it wanders into irrelevance. This is the one I like best, and it stays coherent longer than most. The text in bold is the prompt.

    Robinson cried out, “Help! I’m being chased by a bear! All I have is these sticks. What should I do?” The survivalist said, “Here’s what you can do. Place these sticks over your head. Your heart will stop and you will not feel fear. If you take them off and hold them out, you will go into a state of near perfect awareness. You will feel that you are the survivalist. A survivalist in your senses. Then look down. There is a water tube that is about 1/2 an inch in diameter. I use it to make a plasticmovie projector for discovering new things.” The survivalist placed

    (All of the responses end in mid-sentence.)

    The authors comment

    It is clear that GPT-2 has learned what activity words tend to co-occur with bears and sticks (strap them to your chest, place the sticks, kill the bear, take your gun), but none of these completions would be helpful to A. We think this is because GPT-2 does not know the meaning of the prompt and the generated sentences, and thus cannot ground them in reality.

    Here is the first response I received from ChatGPT:

    If you are being chased by a bear and all you have are sticks, you should try to defend yourself with them. Hold the sticks in your hands and make yourself look as big and intimidating as possible. Stand your ground and make loud noises, such as shouting, clapping, or banging the sticks together. This may intimidate the bear and make it think twice before attacking.

    If the bear continues to approach, use the sticks to strike it on the head or snout. Aim for the eyes, nose, and mouth, as these are sensitive areas that may cause the bear to back off. Keep in mind that a bear is a powerful animal, so you may need to use all your strength and agility to defend yourself.

    If you have access to a nearby tree or other high ground, climb up as quickly as possible. Bears are generally not good climbers and may have difficulty reaching you.

    Remember, the best defense against a bear is to avoid encountering one in the first place. If you are in bear country, make noise to alert them of your presence, carry bear spray, and stay aware of your surroundings.

    1. Keep in mind that the developers of these algorithms work hard to fix published mistakes. So it is not a good test to ask the same question that was answered poorly before. Ask different questions of same type and you see similar and/or different problems. There are lots of good critiques of LLM by people in the field. Here is a sample:
      https://garymarcus.substack.com/p/why-is-bing-so-reckless

      1. That is a fair point, and I have tried other prompts, including ones outwith that sort of animal-attack problem. Although it is still possible to catch Chat-GPT out, even in some simple cases, it is so much better than GPT-2 in so many areas that I don’t think Bender and Koller could write the same paper now.

        I think a direct comparison on the same prompt is still valid, though, and if the reason GPT-2’s replies were so poor (unable to stay on topic, sometimes barely even coherent, etc) was “because GPT-2 does not know the meaning of the prompt and the generated sentences, and thus cannot ground them in reality”, as the Climbing towards NLU paper argues, then it’s hard to see how the developers can have fixed them — unless Chat-GPT has some understanding of the prompt and generated sentences and can partly ground them in reality.

        If you start from the position that poor responses are because the AI system doesn’t know what the words mean, where do you go when the responses improve?

        One option is to stick with the idea that performance on such tasks reflects the system’s degree of understanding and say that better performance reflects an increase in understanding (from none, or very little, to more). Many people will take that line eventually if improvements continue, even if they don’t for Chat-GPT, because they’re committed to something further along: they (unlike Searle, and unlike me) believe that passing the Turing Test is sufficient to show genuine understanding. Of course, they might not put it that way, or they might prefer a different behavioural test, but that’s what it comes down to.

        Bender and Koller are in that camp, except that they believe there will always be a ‘sufficiently sensitive test’ such systems would fail:

        We argue that, independently of whether passing the Turing test would mean a system is intelligent, a system that is trained only on form would fail a sufficiently sensitive test, because it lacks the ability to connect its utterances to the world.

        Meanwhile, they try to explain improvements in performance in other ways such as the system exploiting “statistical artifacts that do not represent meaning”; and if a system manages to “capture certain aspects of meaning, such as semantic similarity”, they argue that “semantic similarity is only a weak reflection of actual
        meaning”. The strength of their commitment to the idea that meaning cannot be learned from form is shown by their comment on “models for unsupervised machine translation trained only with a language modeling objective on monolingual corpora for the two languages”:

        If such models were to reach the accuracy of supervised translation models, this would seem contradict our conclusion that meaning cannot be learned from form. A perhaps surprising consequence of our argument would then be that accurate machine translation does not actually require a system to understand the meaning of the source or target language sentence.

        If progress continues, they will be faced with declaring that more and more tasks do not require understanding meaning, or eventually insisting that a more sophisticated test would expose the lack of understanding (even if no such test is known), or saying that the most recent improvements, passing the then most sophisticated test, brought about full, human-level understanding while before there was none.

        None of those look very appealing to me.

  3. My 2¢

    Re: Scott Aaronson on Large Language Models

    The more fundamental problem I see here is the failure to grasp the nature of the task at hand, and this I attribute not to a program but to its developers.

    Journalism, Research, and Scholarship are not matters of generating probable responses to prompts (stimuli). What matters is producing evidentiary and logical supports for statements. That is the task requirement the developers of LLM-Bots are failing to grasp.

    There is nothing new about that failure. There is a long history of attempts to account for intelligence and indeed the workings of scientific inquiry based on the principles of associationism, behaviorism, connectionism, and theories of that order. But the relationship of empirical evidence, logical inference, and scientific information is more complex and intricate than is dreamt of in those reductive philosophies.

  4. Alexander Kruel

    Those chatbots are definitely much more than “stochastic parrots”. See, e.g., this post: https://borretti.me/article/and-yet-it-understands

    This isn’t just a philosophical position. There are papers showing that these models learn generalizable strategies that perform well on genuinely novel test examples, rather than just memorizing.

    In any case, this will be undeniable in a few years’ time.

    1. Daniel M Gessel

      While they are indeed much more than “stochastic parrots,” whether or not language needs to connect with the material world to have meaning is a deep and interesting question. The gestalt of a word includes images and experiences, unique to each of us, yet with sufficient correlation that we can (on occasion) understand one another. Might “meaning” be more about those impressions than the purely abstract, however complex, relationships between words? Is there a connection to how mathematics allows science to make accurate predictions and, generally, the applicability of reason to the material world? Perhaps even the question of consciousness is related? Both meaning and mind seem to have some physical dimension, some embedding in the grit of reality. But perhaps that is just more old-school 20th century thinking…

    2. Stochastic parrots wouldn’t just be memorising either.

      And if “stochastic parrot” is defined as in the NY Magazine article — “haphazardly stitching together sequences of linguistic forms … according to probabilistic information about how they combine, but without any reference to meaning” — that is pretty much what AI chatbots such as ChatGPT do. (We might question “haphazardly” as making it sound too random. Apart from that, however, it seems fair.)

      I think the “And Yet It Understands” article isn’t quite right about Chinese Rooms. If we take the original description of the Chinese Room literally — John Searle in a room with a big book of instructions, etc, and with inputs and outputs only in Chinese — then of course there are many things it can’t do. More abstractly, though, why shouldn’t a Chinese Room be able to do whatever GPT can do? GPT is just software, running on fairly conventional machines.

      The task discussed in that article goes like this: GPT can work with many natural languages; it is given some fine-tuning training in English, and that then affects what it says when given requests in other languages. The article then says that

      if GPT is just a Chinese room it shouldn’t be able to do this. A Chinese room might be capable of machine translation, or following instructions within one human language, but the task here is so self-evidently outside the training set, and so convoluted, that is requires genuine understanding. The task here involves:

      1. Abstracting the English fine-tuning into concepts.
      2. Abstracting the foreign-language requests into concepts.
      3. Doing the “algebra” of the task at the conceptual level.
      4. Mapping the results back down to the foreign language.

      But if a Chinese Room can do machine translation, it doesn’t have to work like that. Instead, it could translate the foreign-language requests into English, construct a response in the usual way, then translate back into the other language before outputting it. That needs concepts only if working in English already needs concepts; and it is at least not clear that it does.

      1. Daniel M Gessel

        I agree: if Chat GPT can “understand concepts”, so too can the Chinese Room, even if John Searle remains oblivious. The intent of the Chinese Room argument, to my mind, is to make this idea seem ridiculous. But the capabilities of modern AI are forcing us to consider it.
        A “concept” is an abstraction, a gestalt of memory and reason; why cannot the inherent complexity (perhaps referencing the ideas of Kolmogorov) of the structures embedded in the AI’s memory (or in notes Searle has taken) be an embodiment of “concept”? The missing ephemeral element remains the notion of consciousness, of awareness. The Turing Test suggests (with some modern philosophers adhering to the idea) that we should apply any arguments against solipsism to interactions with a machine. I find this view seductive, but ultimately misguided.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top