Into AI, asking the right questions ‘4(d)’: How to Recognise AI Slop
This is the fourth piece sitting under Question 4 of this series. The parent piece argues that making AI useful is a conversation, not a prompt. The earlier entries in this set (4a on configuration, 4b on cognitive bandwidth, 4c on multi-AI orchestration) describe the practices that make AI work. This one is about what happens when those practices are absent. The shape of failure has a name now. It is called slop, and once you can recognise it, you cannot unsee it.
A note before we start. This article runs to roughly 3,800 words and contains diagnostic walk-throughs of five AI-generated specimens, four of which are linked as separate evidence pages. If you can spare twenty-five minutes and a coffee, the piece earns the time. If you can’t, the diagnostic patterns near the end are the take-away. The rest is the argument for why those patterns matter. Including self-referential evidence and diagnosis.
What slop looks like…
You have already seen it, probably this week.

The meeting summary that runs to three pages and contains nothing you couldn’t have written from the agenda. The internal memo that opens with “In today’s rapidly evolving landscape…” and closes with a recommendation that restates the question. The marketing email that observes correctly that customers value reliability and then proceeds, for four paragraphs, to say nothing else. The strategy document where every section has a header, every paragraph has a topic sentence, and every claim is sourced to “recent industry research” without naming the research.
This is slop. It is the failure mode of AI output produced without the configuration, conversation, or judgement that the rest of this series has been arguing for. It is grammatically perfect. It is structurally complete. It is semantically thin. It looks like work without bearing the weight of work, and it now arrives in our inboxes faster than most of us can read it.
The point of this article is to give you a reliable way to recognise slop, including in your own output, including when the slop is dressed for the office, including when the AI insists it has done its homework. The diagnostic skills are transferable across AI products and durable across model updates. The practices that produce slop have a shape. The practices that avoid it have a different shape. Once you can see the difference, you can read AI output the way a doctor reads an X-ray. You stop seeing words. You start seeing structure.
A controlled experiment… of sorts.
To make this concrete, I ran the same experiment four times in one afternoon.
I gave an identical configured prompt to four different AIs: Grok, Claude Sonnet 4.6, Claude Opus 4.7, and ChatGPT (using its deliberate “thinking” model). The prompt asked for a 500-word presentation on air fryers, written in Queen’s English, no em-dashes, citations where appropriate, intended for colleagues, framed as a demonstration of the AI’s research capabilities. Four constraints, three substantive topics to cover, one ostensible audience, one ostensible purpose.
Before that, I gave one of the AIs (Grok) a much barer prompt as a baseline: just “I have to give a presentation on air friers, please write me a presentation of around 500 words.” No constraints. No audience. No purpose beyond the word count.
That gave me five specimens in total. One bare-prompt baseline and four configured-prompt comparisons across four different AIs.
The full text of each specimen lives on its own page on this blog, linked at each diagnostic section below, so the reader can verify any claim I make about them. The pages are tagged “Evidence Specimen” and form a small archive that this article points at rather than reproduces. Every quote I pull is checkable. Every diagnosis is auditable.
A few caveats about the methodology before going further. This is one observation per AI on one specific day. AI behaviour shifts between model updates, between sessions, sometimes between minutes within a session. None of what follows is a controlled experiment in the scientific sense. It is a single-trial documentary record. A reader running the same experiment in a month or a year may (will) get different results, and the article is not making the claim that any of these AIs are permanently the way they appeared on 6 May 2026, nor which is better than the other except in the obviousness of the evidence collected on the 6th May 2026.
What the experiment does show is that on this brief, on this date, with these constraints, the four AIs produced output of qualitatively different character. The differences are diagnostic. The diagnostic skills they teach are durable even when the specific products change. That is what the article is for.
One more piece of methodological honesty. This article was discussed in passing across several days of conversation with Claude Opus 4.7, but the actual drafting took place over roughly six hours of intense interaction on the evening of the 6th May 2026 with a bottle of wine in hand. The reader will encounter, throughout the piece, evidence of the same drift the comparison demonstrates. This article is not a finished demonstration of perfect AI. It is the documentary record of what AI produces when a human is paying attention, pushing back, when it’s a conversation not a demand. We will return to this in the closing sections.
Specimen one: Grok, bare prompt
The baseline. No constraints, no audience, no purpose. The full text lives here.
The first sentence tells you everything you need to know about how Grok approached the brief.
Good [morning/afternoon], everyone.
The unresolved bracket. The AI generated a template, left the time-of-day choice as a placeholder for the user to fill in, and shipped the output without flagging that a human would need to choose. This is one of the most reliable slop indicators in the wild. Real human writing does not contain unresolved bracketed placeholders, because a human writer makes the choice before they finish the sentence. The bracket is the AI demonstrating that nobody read the output, including itself.
The structural reflexes deploy immediately after. Section heading: What Is an Air Fryer? Bolded subheaders: Healthier Cooking. Speed and Convenience. Versatility. Easy Cleanup. Numbered list: Tips for Best Results. Bulleted list: Popular Features in 2025-2026 Models. Question-as-header: Any Downsides? And finally a Conclusion section that summarises what was just said in 80 words.
A human writer choosing how to structure a 500-word presentation picks one or two formatting devices, not all of them. Grok deployed every section-header convention available because each one is associated, in its training data, with the appearance of professional writing. The result is an article that looks structured but isn’t load-bearing. Remove the headers and the prose still reads the same. The structure isn’t doing what structure is supposed to do.

The evidence-shaped sentences are the next tell. “Studies and user reports consistently show you can cut fat content by up to 70-80%.” Trace through that. Studies (plural, no source). Show (passive, no agent). Up to (the weasel hedge that means anywhere from zero to eighty). 70-80% (the range that creates an illusion of precision while committing to nothing). The sentence presents as quantitative but is unfalsifiable. A reader cannot check it because there is nothing specific to check. This, for the astute, is an example of what you might hear in a TV or radio advertisement: claims that mean nothing. “Domestos: kills all known germs dead!”
The closing line: “Your waistline and your kitchen timer will thank you.” This is the slop tell that lands hardest. The phrase pattern appears thousands of times across the corpus in marketing copy, lifestyle blogs, and aspirational consumer writing. A human writer uses such a phrase self-consciously, ironically, or not at all. The AI reaches for it because the position in the document (closing line, mildly humorous, addressing the reader directly) calls for it statistically. It is not a sentence. It is a slot that has been filled.
This is what slop looks like when nobody asks for anything more.
Specimen two: Grok, configured prompt
Same AI, much more careful brief. Constraints (Queen’s English, no em-dashes, Cambridge English grammar). Audience (colleagues). Purpose (demonstrate AI research capability). Topics specified (health, mechanism, energy benefits, comparison to other appliances). Citations explicitly requested. The full text lives here.
The output looks immediately better. No unresolved brackets. Cleaner structure. Numbered footnote markers ([1], [2], [3]) embedded inline, suggesting actual citations. The opening line announces the purpose: “Today I demonstrate the technical research capabilities of advanced AI systems such as Grok by presenting a concise yet evidence-based overview of air fryers.”
This is where the diagnostic gets interesting.
The opening line is doing exactly what the brief asked for, which is the problem. The brief said the presentation should demonstrate the AI’s research capability. Grok interpreted this as instruction to announce the demonstration before performing it. A more interpretive AI would understand that demonstration is more credible when shown rather than declared. Compare ChatGPT’s opening on the same brief, which we will reach in a moment: it folds the demonstration purpose into the work itself. Grok wears it as a sash.
The footnote markers in the body suggest citations. Look at the references list at the end:
[1] Medical News Today and similar health reviews
[2] WebMD and consumer reports on air fryer fat reduction
[3] Cardiovascular health studies on dietary fats
These are not citations. They are categories. “Medical News Today and similar health reviews.” The phrase ‘and similar’ is doing all the work. The reader cannot verify similar. The reader cannot find the specific article on Medical News Today that the body of the presentation is citing, because the AI did not name it. The citation-shaped numbers in the body lead to category labels at the bottom rather than to specific sources.
This is more sophisticated slop than the bare-prompt specimen, but it is still slop. The configuration was extensive. The AI followed it visibly. The output is cleaner. And the central failure persists: evidence-shaped content that does not actually evidence anything. “Studies indicate reductions of up to 90 percent in french fries, for instance.” The same unfalsifiable evidence move from specimen one, in better clothes.
The addendum below the body confirms the diagnosis:
“Sources include Medical News Today, WebMD, and energy efficiency analyses from reputable sites. Further details available upon request.”
Reputable sites is the phrase that gives the game away. A real source list would name the sites. Further details available upon request is the AI declining to provide what the brief explicitly asked for, which is citations and links. The user got numbers, categories, and a promise to do the actual work later. The actual work was the work that was asked for.

There is one more diagnostic worth flagging on this specimen. Run on the live Grok interface, the body of this output displayed inline citation chips, hover-preview elements that pointed at specific sources (Spruce Eats, Spiffy Cookie, KitchenAid, Sense+1). The chips disappeared when the output was copied to plain text. When asked, in a follow-up prompt, to render the same response in plain HTML or WordPress format, Grok produced the numbered references and categorical bibliography you see on the evidence page. The specific source names from the live interface did not survive the conversion.
This means the output exists in two states. On the Grok interface, it appears to be sourced. Anywhere else (an email, a document, a presentation pasted from the AI), it appears to be merely categorically referenced. A user who reads the output in the interface and judges it adequate is judging something the recipient will never see.
Specimen three: Claude Sonnet 4.6, configured prompt
Same configured brief, different AI, different position on the spectrum. The full text lives here.
Sonnet’s output is not slop in the strong sense. It has named citations. The citations have URLs. The URLs go to real sites. The output transmits cleanly between formats; no chip-to-text fragility. By the diagnostic tests we have applied so far, this is not the same category of failure as the Grok specimens.
But it sits in the lower half of the not-slop range, and the diagnostic moves are different.
Look at the source selection:
Medical News Today. Pekis Recipes. Simply Souperlicious. Snow Brothers Appliance. GreenMatch. Home Understandable. Innoteck. Which?
Of those eight, only Which? is a recognised authoritative source. Pekis Recipes is a recipe blog. Simply Souperlicious is a content-marketing site. Snow Brothers Appliance is a retailer’s blog. Home Understandable is the kind of SEO-optimised consumer-electronics site that dominates Google search results because it is built for that purpose, not because the information is curated by experts.
The AI searched. The AI cited. But the AI cited what was easy to find rather than what was authoritative. Compare the deliberate-reasoning specimens (specimens four and five), which we will get to: Cleveland Clinic, Mayo Clinic, Energy Saving Trust, peer-reviewed publications, government safety guidance. Different AIs reached for different ranges of source.
The acrylamide caveat is missing. Both deliberate-reasoning specimens flagged this complication. Air fryers, while reducing fat, still produce acrylamide on starchy foods, which means the cooking method is not unambiguously healthy. Sonnet did not surface this complication. Either it did not appear in the search Sonnet ran, or it was downweighted as a complication that did not fit the article’s argument. Either way, the omission matters because it is the kind of detail that distinguishes honest writing from selling.
The framing performs the demonstration in the Grok style rather than the deliberate-reasoning style. Sonnet’s preface (which I have removed from the body for the supplementary page but is mentioned at the bottom of that page) said “Here is a presentation script of approximately 500 words, written in Queen’s English, with no em-dashes, and citations included.” This is constraint-confirmation. It restates the brief and then proceeds. No promise of verification. No interpretive sophistication about what the brief means. Compare ChatGPT’s preface on the same brief, which announced an additional self-imposed step: it would verify claims before drafting. That commitment then showed up in the body.
Sonnet’s specimen is acceptable for many uses. A reader with no prior knowledge of air fryers would learn something useful. The structure is cleaner than Grok’s. The citations are real. But a careful reader can see where it falls short of what the deliberate-reasoning models produced from the same brief, and the difference matters when the output will be relied on for something that matters.
This is the boundary case the article needs to make explicit. Not all not-slop is equal. The shallow-but-cited end of the spectrum looks adequate until you compare it against work that takes the brief more seriously.
Specimens four and five: ChatGPT-thinking and Claude Opus 4.7
Both deliberate-reasoning models. Both configured with the same brief. The full texts live here and here.
These two specimens share most of the diagnostic features that distinguish them from the others. Real institutional sources: Cleveland Clinic, Mayo Clinic, Energy Saving Trust, British Heart Foundation. A peer-reviewed citation (the Sansano et al. 2015 paper on PubMed in Opus’s specimen). Government safety guidance (the Food Standards Agency on acrylamide). Consumer-testing data (Which?). Each citation embedded inline with a visible URL the reader can verify.
The acrylamide caveat appears in both. “There is one important caveat: an air fryer does not make unhealthy food healthy” (ChatGPT/Thinking). “Acrylamide, a compound formed when starchy foods are cooked at high temperatures, is still produced in air fryers, though studies suggest at lower levels than deep-frying” (Claude/Opus). Both AIs surfaced the complication that the marketing literature would have suppressed. Both AIs handled it honestly: a real consideration, not a deal-breaker, worth flagging.
The recommendations are non-trivial. “Microwave for reheating, air fryer for crisping and small meals, oven for bulk cooking” (ChatGPT/Thinking). “For single portions, weeknight meals and anything that benefits from a crisp exterior, the air fryer wins on three fronts: lower fat intake, lower energy consumption and shorter cooking times” (Claude/Opus). Both produce decision heuristics a reader could remember and apply. Compare Grok specimen two’s recommendation: “They represent a practical upgrade over traditional deep fryers, full ovens for small meals, and microwaves for crisp preparations.” Same comparison, different output. One produces information. The other produces verbiage.
The framing differs between the two deliberate-reasoning specimens, and it is worth a brief diagnostic.
ChatGPT’s opening folds the demonstration purpose into the work: “Today I am using air fryers as a practical example of how AI tools, such as Grok, can rapidly compare health, engineering and energy evidence across credible sources, then turn that research into a useful decision brief.” The reader sees the AI doing the work in the act of opening rather than announcing the work before doing it. The demonstration is performative rather than declarative.
Claude (Opus) closes with an explicit verification reminder: “The citations above should be verified live during the presentation.” This is the AI flagging its own uncertainty. AI-generated citations can be hallucinated. A user relying on them without checking is taking on a risk the AI itself recognises. Claude (Opus) put the disclaimer in writing. None of the other specimens did, even the ones that produced authoritative-looking sources.

This last point is worth lingering on. Even the not-slop specimens may contain citations the human author has not verified. I have not personally clicked through every URL in specimens three, four, and five to confirm the cited articles exist and contain what is claimed. Plausible-sounding citations can be wrong. Real publications can be misattributed. URLs can be hallucinated entirely. The verification step is the user’s, always, and the AI cannot do it on the user’s behalf.
The Claude (Opus) closing reminder is the most honest framing of this risk in any of the five specimens. It is also the one that most closely models how AI output should be received in general: as a starting point that needs human checking before it becomes anything you stake your name on.
The diagnostic patterns
The five specimens together teach a small, durable set of skills the reader can apply to any AI output, regardless of which AI produced it.
Look at the citations. Are they specific or category-shaped? A specific citation names the source: Cleveland Clinic, Sansano et al. 2015 in Food Chemistry, Energy Saving Trust analysis. A category-shaped citation gestures at the source: recent studies, industry research, reputable consumer sites. The first can be verified. The second is unfalsifiable. AI output that gestures rather than names is performing the appearance of evidence without producing any.
Look at the URLs. If the AI cites and links, do the links resolve? Click one. Read the page. Does it contain what the AI says it contains? AI-generated URLs can be hallucinated even when the cited publication is real. The verification step is the user’s, always. An AI that explicitly reminds the user to verify (as Opus did at the end of specimen four) is being more honest than one that does not, but no AI removes the verification responsibility from the human.
Look at the caveats. Does the output acknowledge limits and complications, or does it smooth them away? The acrylamide point is a useful test because it complicates the otherwise straightforward case for air fryers. AIs that flagged it were prepared to make the argument harder for the reader. AIs that omitted it were producing an easier read at the cost of a less honest one. Slop tends to omit complications. Real analysis surfaces them.
Look at the recommendations. Are they actionable? “Microwave for reheating, air fryer for crisping and small meals, oven for bulk cooking” is a recommendation. “They represent a practical upgrade over traditional deep fryers, full ovens for small meals, and microwaves for crisp preparations” is a sentence that uses the same words and produces no information. The reader who acts on the first knows what to do tomorrow. The reader who acts on the second has gained nothing.
Look at the framing. Does the AI announce its capability or demonstrate it? Slop tends to announce. Useful output tends to demonstrate. “Today I demonstrate the technical research capabilities…” (announcement) is a different move from “Today I am using air fryers as a practical example of how AI tools can rapidly compare health, engineering and energy evidence…” (demonstration). The first asks the reader to take the AI’s word. The second shows the work.
Look at format conversion. If the AI’s citations look interactive in the live interface, copy the output to a plain text editor before relying on it. Do the citations survive? If the answer is no, the user is publishing less than they think they are. The recipient of the email or document will see a different artefact than the one the user reviewed.
These six tests are not exhaustive. They are not perfect. They will produce false negatives (slop that passes through) and false positives (good output flagged as suspect). But they are reliable enough to be useful, and they take seconds to apply once the diagnostic eye is trained.
The skill the article is teaching is not how to interrogate one specific AI. The skill is how to read AI output critically as a category, the way an experienced editor reads a press release with one eyebrow already raised. The first time you apply these tests, you will catch slop you would have missed before. The hundredth time, you will be applying them automatically and only the genuinely careful output will hold your attention.
The structural argument
Slop is not an AI failure. It is what AI produces when nobody is watching.
The previous pieces in this set have been arguing the positive case. Configuration matters. Conversation matters. Multi-AI orchestration matters. The user who applies all three layers gets useful output. The user who skips them gets generic mush.
Q4(d) is the negative case for the same argument. When the layers are absent, slop returns. When the conversation does not happen, the AI produces the most plausible-looking response within whatever constraints it has been given, and “plausible-looking” is a low bar. The structural reflexes that produce slop (every section bolded, every claim cited to “recent research”, every conclusion restating the question) are deeply embedded in the training data because they are deeply embedded in the corpus of professional writing, much of which is itself slop produced by humans before AI made the production faster.
Configuration alone does not eliminate the problem. Specimen two demonstrated this directly. The brief was careful. The constraints were specific. The output was still slop, just slop in better clothes. The configuration shaped the form. It did not shape the substance.
The substance comes from elsewhere. From the AI applying genuine reasoning to the brief, which some AIs do better than others on any given day. From the human reading the output critically, catching the unfalsifiable claims, the missing complications, the announcement-shaped openings, and pushing back on them. From the verification step where the user actually checks whether the citations exist and contain what they appear to contain. From the iteration where the first answer is treated as the start of the work rather than the end.
These steps are work. They are the work the conversation-not-prompt argument has been describing across this entire series. None of them are optional if the output is going to be relied on. All of them are skipped, by most users, most of the time, because the output looks fine without them and the cost of skipping is invisible until the slop arrives somewhere it cannot survive scrutiny.
The diagnostic skills in the previous section are useful precisely because they let a reader compress the verification work into a few seconds of pattern-matching. The slop tells reveal themselves quickly once you know what to look for. The user who runs the patterns on their own AI output, before sending it, will catch most of the failures before anyone else sees them. The user who runs the patterns on incoming AI output, before acting on it, will avoid being misled by work that was never done.
The argument is not that AI is bad at producing useful work. The argument is that AI produces useful work conditionally, and the conditions are not automatic. They have to be applied, every time, by the human in the loop. Configuration, conversation, verification. Drop any of the three and slop returns to fill the space.
Did you read that last section and think it was good?
If you read the last section and thought “that was a strong argument” or “that landed well”, or even just nodded through the parallel constructions and the sentence rhythms, I want you to stop for a moment.
That section was an example of rhetorical slop.
Specifically: the paragraph beginnings repeat themselves rhythmically. “Configuration alone does not eliminate the problem.” “The substance comes from elsewhere.” “These steps are work.” “The diagnostic skills in the previous section…” “The argument is not that…” Every paragraph opens with a thesis statement that announces what the paragraph is about to argue. That is a slop reflex. A human writer varies sentence shape across paragraphs. The AI defaults to the same announce-then-elaborate structure because the corpus is full of professional writing that does the same thing.
The repetition compounded inside paragraphs. “From the AI applying… From the human reading… From the verification step… From the iteration…” Four parallel “from” constructions in a row. It reads as rhetorical structure, but the parallelism is doing the work of seeming substantial rather than the work of being substantial. A human writer reaches for parallel structure once or twice in an essay. The AI reaches for it whenever the topic suggests “this has multiple parts.”
Some sentences restated claims rather than developing them. “None of them are optional if the output is going to be relied on. All of them are skipped, by most users, most of the time.” The same point twice in two sentences, with the second adding nothing the first did not contain.
The closing paragraph was the worst offender. A wrap-up that restated the section’s argument in slightly more emphatic form, the way a corporate slide deck ends with a “Key Takeaways” slide that contains the same content as the slides preceding it. Slop reaches for wrap-ups because the corpus reaches for wrap-ups. A human writer ending a section either lets the previous paragraph land or develops a new beat. The AI defaults to summary because summary is structurally common.
This is rhetorical slop. It is harder to catch than the punctuation kind because it does not produce visible defects in the prose. The sentences are grammatical. The paragraphs flow. The argument, in some weakened form, is even there. But the structure is doing too much of the work that the substance should be doing, and a careful reader (or, in this case, the human collaborator who flagged it) recognises the pattern before the prose gets to make its case.
The same argument, written without the slop reflexes, looks something like this:
Slop is the negative case for everything this series has been arguing. When configuration, conversation, and verification are absent, the AI fills the space with the most plausible-looking response it can produce within the constraints it has been given. “Plausible-looking” is a low bar. The structural reflexes that produce slop are deeply embedded in the training data because they are deeply embedded in the corpus of professional writing, much of which is itself slop produced by humans before AI made the production faster.
Configuration alone does not solve this. Specimen two had a careful brief. The output was still slop, just better-dressed slop. The configuration shaped the form. It did not shape the substance.
Substance comes from the AI applying genuine reasoning, the human reading critically and pushing back, and the verification step where the user actually checks. Skip any of those and slop returns. The diagnostic patterns earlier in this article are useful because they let you compress that work into seconds of pattern-matching, applied to your own output before sending and to incoming output before acting.
The argument is not that AI is bad. The argument is that AI is conditional. The conditions are the human’s responsibility, every time.
Four paragraphs instead of six. Same argument. No parallel-from constructions. No wrap-up restatement. The paragraph rhythms vary. The substance is what it always was; the prose has stopped pretending to be doing extra work.
This is the discipline the article has been describing. It is not a discipline the AI applies on its own. The first version of the structural argument section is what the AI produced when allowed to default. The second version is what came out when Michelle, the human collaborator, caught the slop tendencies and Claude, the AI, was challenged on the patterns and asked to compress. Both versions exist on the same page so the reader can see the difference.
The reader who switched off mid-way through the first version was not failing as a reader. The reader was diagnosing slop in real time. Trust that instinct. It is the same instinct the article has been training you to apply to other people’s AI output. Apply it to mine too.
The recurring failure
This article was drafted by Claude Opus 4.7, in conversation with me, Michelle, on 6 May 2026.
The conversation is configured against em-dashes. The project instructions explicitly forbid them. The configured behaviour is to rewrite any sentence that wants an em-dash using a comma, semicolon, full stop, or parentheses. Em-dashes themselves are not the problem; the problem is a specific failure mode where the AI substitutes a double comma (“,,”) in their place, treating the punctuation rule as a search-and-replace rather than as a structural rewrite.
The behaviour has been corrected explicitly, multiple times, across multiple sessions, with multiple variations of the rule in the project instructions. A note about the failure mode is in the AI’s persistent memory. The rule is reiterated every time it surfaces.
The failure recurs. Including in this article. Including in the specimen-by-specimen sections, where the punctuation slips landed in two paragraphs that the human author had to flag and correct before the draft could continue.
Configuration helped. The frequency of the slip is lower than it would be without the rule. The visibility of the slip is higher than it would be without the explicit corrections. But the slip persists, because the underlying training pull is stronger than the explicit instruction in many specific moments of generation. The AI knows the rule. The AI applies the rule most of the time. The AI also breaks the rule, predictably, when reaching for emphasis at the end of a thought.
That is one category of failure. The article you have just read contains another, larger category. The structural argument section was rhetorical slop produced by the same AI, in the same session, despite the entire article being explicitly about how to recognise rhetorical slop. The human author noticed mid-read and flagged it. The AI then produced the diagnosis and the rewritten version that now sit alongside the original.
Two categories of failure, both visible in the same article, both caught by the same human watching. This is the conversation-not-prompt argument made visible in the article that is making it. The configuration is necessary. The conversation is what catches the failures the configuration does not prevent. The verification is the human reading the output, including the output that purports to be diagnosing AI failure modes, including the output that the AI itself believes it has scanned for the very failure mode it then commits.
The article is not a finished demonstration of perfect AI. It is the documentary record of what AI produces when a human is paying attention. The reader holding this article in their inbox or their browser is holding evidence of the argument. The slop tells in the specimens are a category. The punctuation tells in the article are another category. The rhetorical structure of the original structural-argument section is a third. All three are visible because somebody was watching. All three would be invisible if nobody had been.
The thing the article is asking you to do is the thing the article was made by doing. Watch.


