Why ChatGPT is bad for open psycholinguistics

The rise of ChatGPT has led to a proliferation of people and organizations scrambling to use it in their products. OpenAI, a company widely known for their lack of transparency and famous disengagement from the scientific literature, released this system to the public in November 2022. Following a series of massive generative models built on prompting-as-retrieval from a large database, ChatGPT extends this with a public-facing API and an everyday normal web page that lets you query it with human-interpretable prompts.

The potential relevance of ChatGPT for psycholinguistics, as an instance of a prompting model trained on very large datasets of web text and transcriptions, is its ability to generate templates. I know folks who have used it for making nginx templates for hosting websites, as well as yaml specs for building machine learning pipelines. Naturally, folks have started to ask, “Can I get ChatGPT to make stimuli for me?” Since we as cognitive scientists build stimuli that are meant to be relatively natural but still grammatically constrained, this might serve as an effective prompt for creating sentences. No more need to train pesky RAs who need to first learn to recognize tricky syntactic structures! No need to query norm datasets containing word counts or concreteness ratings!

There are a few things that are noteworthy about ChatGPT as a commercial product that, in my opinion, limit its effectiveness as a stimulus generation tool:

Replicability
Transparency
Non-stationary model parameters and ruleset
Computational cost
Labor and prompting as a skill that must be learned
Longevity
Data privacy and corporate profit

In all cases, alternatives exist to a ChatGPT-centered stimulus generation paradigm so that researchers can still take advantage of the latest technologies and make their workflows more efficient (if efficiency is important to you, of course).

Let us consider these points in turn:

Replicability

Within the field of psychology, replication has come to play an increasingly large role in establishing whether effects on cognition, behavior, or other mental states are justified by the data we collect. In computational linguistics, natural language processing, and machine learning more broadly, these challenges are largely left to the wayside, though some folks have covered the need for power analyses and greater use of inferential statistics to justify calling something the new “state-of-the-art.”

With ChatGPT, the replicability question is altogether different — in that it is not always possible to get the service to give you the same answer. Randomness is built into the architecture for generative models to prevent the model from always producing the highest probability linguistic sequence. If this system works similarly to other models, an initial state is selected and beam search (like a spotlight or a flashlight in a dark, black-box woods) is conducted over potential possible “chains” of responses. Typically, a model then selects the continuation or response that has the highest probability. With ChatGPT, however, the selection of items that are considered within the beam is probably sampled from the distribution at each point in time. One of the challenges posed to researchers aiming to generate many stimuli for an experiment (or follow-up experiments) that obey a few different constraints, is that the prompt they worked hard to create may not always lead to the same types of stimuli, even when nothing has changed about the prompt.

Transparency

With this in mind, you have no direct control over the generations produced by ChatGPT. This becomes even clearer when we might want to generate outputs with even more constraints. Unlike GPT-whatever’s predecessors, you cannot inspect the working components. That means no embeddings, no logits, no indices or ghosts of the representations that are being calculated. While we assume that ChatGPT behaves like its large language model predecessors and contemporaries (e.g., GPT-3, T5, etc.), we cannot know what computations it is doing at any one point in time, and are never able to retrace its steps. This is especially fraught for folks who use measures like surprisal to understand the predictability of a word in context. With smaller (but nevertheless massive compared to old standards in computational psycholinguistics) models like GPT-2, it was possible to generate many possible continuations at each time step and arrive at a proxy for, say, cloze probabilities. These days are long gone in a prompt-centered world.

OpenAI has also made it very clear that the human is doing a lot for their human-in-the-loop reinforcement learning procedure. The model is “trained” by the inputs that we give it, both in the short term (while we are interacting with it) as well as the long term (the systems OpenAI releases). There are also a lot of hidden things under the hood, similar to how image generation models that relied on prompting (e.g., DALL-E) embedded secret rules to make it less racist. While this is a good thing for the end user, this front-end filtering it not disclosed to us, and is considered proprietary information. Any number of layers and layers of constraints may be being applied to the output that have nothing to do with the generative capacity of this model, and instead are meant to keep the model from producing racist, sexist, ableist, and other -ist language. Given OpenAI’s famously secretive nature, it is unlikely that we will ever learn what the full scope of these systems are to prevent misbehavior, and it is unclear how often these constraints change.

Non-stationary parameters

Even if the model were able to be inspected, ChatGPT is conceived of as a reinforcement learning model whose parameters are updated approximately weekly. When new versions are released, they purportedly are responses to additional “debugging” that is the result of millions and billions of prompts taking place every day, in addition to any ingestion of new data and raw text that may have been added to its training data. The model does not really have a “test” dataset to be evaluated on, as it is constantly being tested by the users who interact with it. The updating of the parameters exacerbates concerns about replicability because the model’s internal representations may drift over time. While this is obviously nice from a business point-of-view since data comes in that may be important to emphasize in the representations, such as geopolitical events, this is a mess for researchers who want to probe the internal knowledge state of these systems. ChatGPT simply is not a single monolithic thing, but rather an eternally changing slime mold trying to cross a maze to get food in the shortest path possible.