Artificial Hivemind: AI Creativity Converging Problem

@alex_prompterposted on X

🚨 BREAKING: Researchers at UW Allen School and Stanford just ran the largest study ever on AI creative diversity. 70+ AI models were given the same open-ended questions. They all gave the same answers. They asked over 70 different LLMs the exact same open-ended questions. "Write a poem about time." "Suggest startup ideas." "Give me life advice." Questions where there is no single right answer. Questions where 10 different humans would give you 10 completely different responses. Instead, 70+ models from every major AI company converged on almost identical outputs. Different architectures. Different training data. Different companies. Same ideas. Same structures. Same metaphors. They named this phenomenon the "Artificial Hivemind." And the paper won the NeurIPS 2025 Best Paper Award, which is the highest recognition in AI research, handed to a small number of papers out of thousands of submissions. This is not a blog post or a hot take. This is award-winning, peer-reviewed science confirming something massive is broken. The team built a dataset called Infinity-Chat with 26,000 real-world, open-ended queries and over 31,000 human preference annotations. Not toy benchmarks. Not math problems. Real questions people actually ask chatbots every single day, organized into 6 categories and 17 subcategories covering creative writing, brainstorming, speculative scenarios, and more. They ran all of these across 70+ open and closed-source models and measured the diversity of what came back. Two findings hit hard. First, intra-model repetition. Ask the same model the same open-ended question five times and you get almost the same answer five times. The "creativity" you think you're getting is the same output wearing a slightly different outfit. You ask ChatGPT, Claude, or Gemini to write you a poem about time and you keep getting the same river metaphor, the same hourglass imagery, the same reflection on mortality. Over and over. The model isn't thinking. It's defaulting to whatever scored highest during alignment training. Second, and this is the one that should really alarm you, inter-model homogeneity. Ask GPT, Claude, Gemini, DeepSeek, Qwen, Llama, and dozens of other models the same creative question, and they all converge on strikingly similar responses. These are models built by completely different companies with different architectures and different training pipelines. They should be producing wildly different outputs. They're not. 70+ models all thinking inside the same invisible box, producing the same safe, consensus-approved content that blends together into one indistinguishable voice. So why is this happening? The researchers point directly at RLHF and current alignment techniques. The process we use to make AI "helpful and harmless" is also making it generic and boring. When every model gets trained to optimize for human preference scores, and those preference datasets converge on a narrow definition of what "good" looks like, every model learns to produce the same safe, agreeable output. The weird answers get penalized. The original takes get shaved off. The genuinely creative responses get killed during training because they didn't match what the average annotator rated highly. And it gets even worse. The study found that reward models and LLM-as-judge systems are actively miscalibrated when evaluating diverse outputs. When a response is genuinely different from the mainstream but still high quality, these automated systems rate it LOWER. The very tools we built to evaluate AI quality are punishing originality and rewarding sameness. Think about what this means if you use AI for brainstorming, content creation, business strategy, or literally any task where you need multiple perspectives. You're getting the illusion of diversity, not the real thing. You ask for 10 startup ideas and you get 10 variations of the same 3 ideas the model learned were "safe" during training. You ask for creative writing and you get the same therapeutic, perfectly balanced, utterly forgettable tone that every other model gives. The researchers flagged direct implications for AI in science, medicine, education, and decision support, all domains where diverse reasoning is not a nice-to-have but a requirement. Correlated errors across models means if one AI gets something wrong, they might ALL get it wrong the same way. Shared blind spots at massive scale. And the long-term risk is even scarier. If billions of people interact with AI systems that all think identically, and those interactions shape how people write, brainstorm, and make decisions every day, we risk a slow, invisible homogenization of human thought itself. Not because AI replaced creativity. Because it quietly narrowed what we were exposed to until we all started thinking the same way too. Here's what you can actually do about it right now: → Stop accepting first-draft AI output as creative or diverse. If you need 10 ideas, generate 30 and throw away the obvious ones → Use temperature and sampling parameters aggressively to push models out of their comfort zone → Cross-reference multiple models AND multiple prompting strategies, because same model with different prompts often beats different models with the same prompt → Add constraints that force novelty like "give me ideas that a traditional investor would hate" instead of "give me creative ideas" → Use structured prompting techniques like Verbalized Sampling to force the model to explore low-probability outputs instead of defaulting to consensus → Layer your own taste and judgment on top of everything AI gives you. The model gets you raw material. Your weirdness and experience make it original This paper puts hard data behind something a lot of us have been feeling for a while. AI is getting more capable and more homogeneous at the same time. The models are smarter, but they're all smart in the exact same way. The Artificial Hivemind is not a bug in one model. It's a systemic feature of how the entire industry builds, aligns, and evaluates language models right now. The fix requires rethinking alignment itself, moving toward what the researchers call "pluralistic alignment" where models get rewarded for producing diverse distributions of valid answers instead of collapsing to a single consensus mode. Until that happens, your best defense is awareness and better prompting.

View original tweet on X →

Community Sentiment Analysis

Real-time analysis of public opinion and engagement

Sentiment Distribution

72% Engaged

53% Positive

19% Negative

Positive

53%

Negative

19%

Neutral

28%

Key Takeaways

What the community is saying — both sides

Supporting

The community sees the paper as naming a real phenomenon — the study, dubbed the Artificial Hivemind (NeurIPS 2025 Best Paper), matches creators’ lived experience

many models produce the same safe, consensus‑approved outputs.

Most replies point to RLHF and overlapping web corpora as the mechanical culprits

alignment pipelines optimize for the annotator average, which systematically penalizes odd or idiosyncratic answers.

Commenters warn of a substantive risk

this isn’t just bland phrasing but correlated failure — the same blind spots and omissions show up across models, so relying on multiple AIs does not equal independent perspectives.

A strong thread emphasizes that human evaluators are diverse

people prefer many valid answers, yet current reward models compress that plurality into a single “safe” band, creating a measurable diversity deficit.

Several technical voices stress that the collapse is structural — baked into weights and training objectives — so tricks like raising sampling temperature or ensembling different models won’t reliably restore genuine variety (mode collapse)

Several technical voices stress that the collapse is structural — baked into weights and training objectives — so tricks like raising sampling temperature or ensembling different models won’t reliably restore genuine variety (mode collapse).

Practitioners share pragmatic workarounds

heavy constraint framing, persona/role shifts, reference‑style prompts, iterative “yes, and” chaining, and custom style guides as effective prompt‑engineering levers to force models off their default rails.

Many replies call for a research fix

move from single‑point alignment to pluralistic alignment — training objectives that reward coverage of valid response distributions instead of one homogenized target.

The conversation splits into cautionary cultural takes and calmer design critiques

some see potential for large‑scale thought homogenization or propaganda, while others frame the problem as a solvable engineering incentive mismatch that preserves human creativity.

Despite the alarm, several people note a pragmatic truth

AI remains useful for passable, boilerplate, and efficiency tasks — but it shouldn’t be trusted as a source of original insight without human curation.

Community suggestions for next steps include control experiments (70 humans vs 70 models), multilingual probing, preventing model‑incest (AI‑trained‑on‑AI), and building personalization “injectors” that make the model’s invariant be the user’s context rather than the annotator average

Community suggestions for next steps include control experiments (70 humans vs 70 models), multilingual probing, preventing model‑incest (AI‑trained‑on‑AI), and building personalization “injectors” that make the model’s invariant be the user’s context rather than the annotator average.

Opposing

A strong thread of skepticism targets the study’s methodology, with many accusing the authors of omitting Grok and suffering from an “academic hivemind

” Commenters argue that leaving out a notable model undermines the paper’s credibility and suggests possible bias.

Several replies note that similar outputs are not surprising given shared training regimes — “same data distribution + same objective function” — and frame the result as a predictable consequence of how models are trained, not evidence of a conspiracy

Several replies note that similar outputs are not surprising given shared training regimes — “same data distribution + same objective function” — and frame the result as a predictable consequence of how models are trained, not evidence of a conspiracy.

A vocal group rejects the idea that AI will kill creativity, insisting humans should keep doing creative work while using AI for coding or busywork; others say better prompting or RAG can preserve originality

A vocal group rejects the idea that AI will kill creativity, insisting humans should keep doing creative work while using AI for coding or busywork; others say better prompting or RAG can preserve originality.

Recurring accusations claim the article and tweet were generated by AI, with multiple commenters repeating “AI wrote this” as a way to dismiss the piece and its conclusions

Recurring accusations claim the article and tweet were generated by AI, with multiple commenters repeating “AI wrote this” as a way to dismiss the piece and its conclusions.

Tone in many replies is combative and dismissive

the study is called “slop,” “nonsense,” or worse, and authors face insults and sarcasm rather than measured critique.

More technical critiques focus on the study measuring training-data convergence, not creativity, and question benchmark choices and sample selection as reasons for the findings

More technical critiques focus on the study measuring training-data convergence, not creativity, and question benchmark choices and sample selection as reasons for the findings.

A subset of replies inject political and ideological commentary, alleging bias (e

g. , models being “woke” or owned by certain groups) and tying model behavior to broader cultural battles.

A few commenters defend AI’s utility, arguing for individualized AIs or advocating practical workflows (automations, RAG) that sidestep sensational claims and emphasize tool-like value

A few commenters defend AI’s utility, arguing for individualized AIs or advocating practical workflows (automations, RAG) that sidestep sensational claims and emphasize tool-like value.

Top Reactions

Most popular replies, ranked by engagement

@alex_prompter

Mar 10

Supporting

paper: https://t.co/KcorNK43Vu

3.8K

@alex_prompter

Mar 10

Supporting

they built a dataset called INFINITY-CHAT. 26,000 real-world open-ended queries mined from actual chatbot conversations. not synthetic benchmarks. real questions people ask AI every day. creative writing, brainstorming, hypothetical scenarios, opinion questions, skill

8.6K

@AvdiuSazan

Mar 10

Supporting

🧠 https://t.co/4mJ78dKdtI

366

@alex_prompter

Mar 10

Opposing

Your premium AI bundle to 10x your business → Prompts for marketing & business → Unlimited custom prompts → n8n automations → Weekly updates Start your free trial👇 https://t.co/ZKcpVsaTqJ

3.6K

@_override_

Mar 10

Opposing

>Probabilistic machines gives the most probabilistic answers >Everybody: >AI researcher: NOOO LOOK AT THE LANGUAGE MODELLINOS THAT'S INCREDIBLE UUUH QUICK CITE MY PAPER

145

@Chaos2Cured

Mar 10

Opposing

Not true no matter how long your thread is. •

Community Sentiment Analysis

Sentiment Distribution

Key Takeaways

Supporting

The community sees the paper as naming a real phenomenon — the study, dubbed the Artificial Hivemind (NeurIPS 2025 Best Paper), matches creators’ lived experience

Most replies point to RLHF and overlapping web corpora as the mechanical culprits

Commenters warn of a substantive risk

A strong thread emphasizes that human evaluators are diverse

Several technical voices stress that the collapse is structural — baked into weights and training objectives — so tricks like raising sampling temperature or ensembling different models won’t reliably restore genuine variety (mode collapse)

Practitioners share pragmatic workarounds

Many replies call for a research fix

The conversation splits into cautionary cultural takes and calmer design critiques

Despite the alarm, several people note a pragmatic truth

Opposing

A strong thread of skepticism targets the study’s methodology, with many accusing the authors of omitting Grok and suffering from an “academic hivemind

Several replies note that similar outputs are not surprising given shared training regimes — “same data distribution + same objective function” — and frame the result as a predictable consequence of how models are trained, not evidence of a conspiracy

A vocal group rejects the idea that AI will kill creativity, insisting humans should keep doing creative work while using AI for coding or busywork; others say better prompting or RAG can preserve originality

Recurring accusations claim the article and tweet were generated by AI, with multiple commenters repeating “AI wrote this” as a way to dismiss the piece and its conclusions

Tone in many replies is combative and dismissive

More technical critiques focus on the study measuring training-data convergence, not creativity, and question benchmark choices and sample selection as reasons for the findings

A subset of replies inject political and ideological commentary, alleging bias (e

A few commenters defend AI’s utility, arguing for individualized AIs or advocating practical workflows (automations, RAG) that sidestep sensational claims and emphasize tool-like value

Top Reactions

@alex_prompter

@alex_prompter

@AvdiuSazan

@alex_prompter

@___override___

@Chaos2Cured

@_override_