What do "which is A.I.?" quizzes tell us?
What does it mean to "prefer" one text over another?
This newsletter is brought to you by Squarespace.
I think everyone should have a personal website. Not a social media profile — an actual website, a little corner of the internet that belongs to you, where no one can reply or quote-post or suggest you might also enjoy branded content. I built mine with Squarespace. (You can see it here.)
What sold me was the combination of ease and control. Squarespace has a huge library of professionally designed templates, which you can use as-is if you’re a normal person who wants to look professional online. Or, if you’re me, you can use their intuitive drag-and-drop design tools to customize everything until your site features hand-drawn KidPix art by a kindergartner and looks like a Geocities page that went to journalism school. Either way, the whole process is fast — no coding required, easy to update whenever you feel like it. And it’s not just a static page: I’ve got S.E.O. tools and analytics built in, and I can add a storefront, new pages, or even email campaigns whenever the mood strikes. It’s flexible in a way that a social media profile simply isn’t.
If you need a website, portfolio page, storefront, or nearly anything else, Squarespace is perfect. The only thing it can’t provide is KidPix images my son made.
Click here for a free trial, and when you’re ready to launch, use READMAX to save 10% off your first purchase of a website or domain.
Greetings from Read Max HQ! In this week’s edition, we discuss The New York Times quiz about human and A.I. writing, and what we get out of these side-by-side preference tests.
A reminder: Read Max is a subscription newsletter that depends on paying subscribers to survive. Every week, we lose a few paid subscriptions, which means we need to be outpacing our churn--which means that if you haven’t been subscribing, but have been enjoying the fruits of my labor (i.e. the bad jokes), perhaps it’s time for you to chip in? At $5/month or $50/year, it costs roughly as much as buying me a beer every few weeks, which I imagine you would be happy to do.
What do we get out of “A.I. or human” preference tests?
I spent the first half of this week at “Cultural A.I.: An Emerging Field,” a fascinating conference put on by N.Y.U.’s Digital Theory Lab and the Remarque Institute. I found it genuinely invigorating to hear papers from and talk with some extremely smart people who are thinking capaciously and rigorously about A.I. systems as social and cultural technologies.1 What is L.L.M.? How does it intersect with literary culture and political economy? What can its operations tell us about language, writing, and intelligence?
And then I would go home and open up Twitter, where the real School of Athens stuff was happening:
Now, I don’t particularly begrudge the Times its little widgets, which I understand to be a key strategic component to its overall business model and continued health as The Last Employer Of Journalists. But I would politely disagree with my friend Kevin that this represents “a moment.” That non-expert humans, given blind side-by-side comparisons, tend to do a bad job identifying A.I.-generated text is, at this point, a pretty well-established finding. So, too, is its black-pilling corollary that people consistently if modestly prefer the A.I. output.
Indeed, back in 2024 I wrote a little bit about a version of the human-or-A.I. game (this one about images rather than text) run by Scott Alexander, which had similar results to the Times quiz. My take at the time, calibrated to be maximally annoying, was that Alexander’s quiz didn’t really prove that people couldn’t identify, and, when asked, “prefer” A.I. art--but that nonetheless it was probably true, in a general sense, that people can’t identify, and likely prefer, A.I. art.
People prefer A.I. art because people prefer bad art
In this week’s edition, we discuss two recent experiments (one non-scientific, one scientific) comparing A.I.-generated and human fashioned art.
I don’t think my general feeling about this kind of test, or its results (such as they are) has changed much in the intervening 18 months. On the one hand, the Times quiz is a deeply defective experiment, starting from a set of patently false premises, whose results are being wildly over-interpreted. On the other, would anyone really deny that most people, in a vacuum, would have trouble identifying well-prompted A.I.-generated writing, or that furthermore they likely think it (to use the Times verbiage) “reads better”?
And yet despite the flaws of these tests, the obviousness of the result, and the repetitive tediousness of the “conversation” that follows on X, Bluesky, and Substack, we continue to craft, publish, take, and argue about these tests. So what are we getting out of them?
It seems worth asking what, actually, is happening when we misidentify, and express preference for, A.I.-generated writing in blind A/B tests like these. An influential 2024 paper by Brian Porter & Edouard Machery--linked to in the Times quiz--that asked subjects to identify A.I.-generated poems found that participants “performed below chance levels” (at 46.6 percent accuracy), which is to say they got worse results than they would’ve if they were just guessing.
What this suggests is that people were able, at least to some extent, to distinguish between human and A.I. poetry--they just thought that the A.I. poems were human, and vice versa. Porter and Machery attribute this to
shared yet flawed heuristics to differentiate AI from human poetry: the simplicity of AI-generated poems may be easier for non-experts to understand, leading them to prefer AI-generated poetry and misinterpret the complexity of human poems as incoherence generated by AI.
That is, in aggregate, participants could tell that the A.I. poems and human poems were stylistically different from each other; they simply misunderstood what those different styles actually marked.
This is a relatively common phenomenon, and the finding that people misidentify A.I. output as human (and vice-versa) at higher-than-chance levels seems to hold across other domains: A.I.-generated faces or dating profiles, for example. (It’s probably particularly acute with poetry, a field with, let’s say, a wide gap between what is “good” and “coherent” to regulars and what is “good” and “coherent” to novices2.) In many contexts most people can (more or less) correctly differentiate between A.I.-generated output and its “authentic” counterpart--but cannot correctly attribute the output.
What’s funny about this is: We actually really want to prefer human-authored writing! In open-label tests, where the excerpts are shown with attribution, people consistently express preference for whatever text is labeled human, even when the text is actually A.I.-generated. (So do A.I. evaluators, as I learned at the conference from Wouter Haverals, to an even greater degree.)
This is not a particularly satisfying set of findings insofar as it validates neither the A.I.-booster “it’s so over, A.I. writing is better than human writing” side nor the A.I.-skeptic “A.I. can never write like a human” side. What we can say is that people mostly can’t identify A.I.-generated text as A.I.-generated (crowd boos), but they can sometimes distinguish between it and human-authored text (crowd cheers); it’s just that they tend to think the A.I.-generated text is human (crowd boos), maybe because human-generated text is stranger, worse, or more difficult (crowd hesitantly cheers), which readers mistakenly believe is more typical of A.I.-generated text (crowd silent now) and thereby disprefer (crowd sort of murmuring confusedly), unless you tell them it’s actually human, in which case they change their minds and like it (crowd has mostly left at this point).
But all of it taken together suggests that, given our strong bias in favor of writing we believe to be human, A.I. vs. human “preference” tests (or “reads better” quizzes) are often second-order “identification” tests, in each case measuring not “preference” per se but the accuracy of the prevailing heuristics for identifying A.I. writing. Participants in these studies, it would seem, express preference for the A.I.-generated writing not because it’s “better” in some formal sense--cleaner, simpler, more beautiful, whatever--but because their “flawed heuristics” have led them to the conclusion that it’s human-authored, and ipso facto better.
If this is right, much of the discourse about quizzes like the Times’ is getting the order of operations wrong. It’s not that people see two paragraphs, prefer one based on its quality, and then attribute it to humans based on that preference. It’s that they see two paragraphs, attribute one to human authorship based on style, and then prefer the one they’ve attributed. What’s at stake when taking these tests isn’t quality or beauty or clarity, but style; not “which one is better,” but “which one sounds more like an L.L.M.?”
In experiments like the one documented in the Nature study, participants often express preference for “cleaner” or “smoother” text. Because this accords with our intuitions about the kind of writing that most people should or would prefer in most contexts, it’s easy to take for granted the idea that people are expressing some relatively fixed, “natural” preference for the kind of professional plainness L.L.M.s tend to exemplify.
But if they’re picking the text that better displays “cleanliness” because they mistakenly associate it with human writing--and disfavoring stranger or more difficult text because they associate these qualities with A.I.--you can easily imagine a world where people begin to express preference in certain contexts for clunkier, thornier, and “worse” writing, because those are the stylistic markers of humanness.
As long as people want to prefer human-authored to L.L.M.-generated writing, we will place a premium on whatever style we associate with human authorship--even as that style changes. You can already see this process beginning from the other direction on social networks like Twitter, where em-dashes and not-x-but-y contrastive corrections--perfectly innocuous and useful writerly tools which not five years ago would likely have been highly correlated with “good prose”--are immediately treated with derision and suspicion. By that same token, certain kinds of “bad writing” should be seen as evidence of human authorship. How long before run-on sentences are preferred to em-dashes?
L.L.M.s, of course, can and will get better at mimicking the “strangeness,” clunkiness, and badness of human prose; I’m skeptical of claims that there is some built-in technical limitation that prevents A.I. text from ever being truly indistinguishable from human prose.3 What seems more likely to me is that as L.L.M.s move away from the easily identifiable generic LinkedIn style that currently dominates, our preferences will move as well, in an attempt to stay one step ahead.
One thing that interests me about these quizzes is the extent to which they resemble a stage in “reinforcement learning from human feedback,” or R.L.H.F., a chatbot training process during which the L.L.M. will provide two (or sometimes more) responses for every prompt, one of which is selected as the “better” response by a human evaluator. The human preferences are then used as the basis for another model that predicts a “score” for any given response; finally, the L.L.M. is directed to maximize its score for any given response.
Quizzes like the Times’ are games that L.L.M.s are designed to excel at. A.I. writing is literally optimized to be the writing most people prefer in A/B preference tests; the main thing an L.L.M. chatbot “wants” when replying is to be generating the text that its users would choose as a good answer to the prompt over all other possibilities.
We’re not training L.L.M.s when we take quizzes on the Times website. But I suspect we’re training ourselves, taking the tests to measure and adjust our own heuristics for distinguishing A.I. text from human writing. I think A.I. boosters often want these blind tests to “prove” to stubborn skeptics that A.I. writing is “as good” as human writing. And I know that skeptics object that the tests don’t accurately measure anything of the sort. But my sense is that they’re not measurement instruments at all--they’re territory on which a kind of ongoing stylistic arms race is being conducted. We like these games not because they satisfactorily “prove” that A.I. or humans can produce “better” or “worse” text, but because they reveal to us--both in themselves, and in the discourse that follows--the stylistic tells that allow us to distinguish between the two.
For a taste, here’s Ben Recht’s excellent talk on “Benchmarking Culture,” which is relevant to the subject of this edition.
To quote a since-deleted tweet, “As a teacher of poetry what I can tell you for sure is people want poems to rhyme. They want poems to rhyme so bad. But we won’t give it to them”
A better objection, for my money, is that they’ll never get better at aping the awkwardness of authentic human writing because there’s no real profit in thorny human prose, even if it increases the fidelity: What paying customers want from a chatbot is puree-smooth paragraphs for their cover letters and book reports, not ever-finer-tuned Cormac McCarthy approximations.












Oddly, there's a direct ancestor to this "experiment" from a century ago. In the 1920s, the literary critic I.A. Richards handed Cambridge undergraduates poems with the author names stripped off and asked them to evaluate the work on its merits. Students consistently preferred the "mediocre" poems and dismissed the "difficult" ones as incoherent. He didn't put it this way at the time, but his later work suggests that removing the names destroyed the conditions under which real judgment could take place, it didn't "reveal" which was better.
The Times quiz repeats the setup but strips even more. Richards removed the author's name but gave the whole poem. The quiz removes that and the surrounding text, reducing everything to a paragraph (or less) floating free. At that scale, LLM prose is genuinely hard to distinguish and human prose is impossible to situate. The whole experimental design, if we can call it that, masks what LLMs truly struggle with, not style on the sentence level, but narrative control over a longer stretch.
To simplify the crowd boos metaphor.
"People hate AI art and prefer human art. Under all conditions. Only by obscuring its origins or lying can people be made to choose AI over human."
They may not be able to distinguish what they hate from what they like but that is true of polluted water versus clean water, PFAS contaminated foodware versus clean, GMO crops versus organic...
There is a massive consumer dispreference for this shit! In any other industry that would be end of story! No one wants it, it's poison, just because you can hide poison in food and people eat the poison doesn't mean they want to eat poison!
Fucking lead tastes sweet! Fucking antifreeze tastes sweet! People might prefer them in a blind taste test to a bland cracker and water *IF AND ONLY IF YOU DENY THEM THE INFORMATION THEY NEED AND WANT ABOUT WHAT YOU ARE DOING TO THEM*