Ditch Recurrence for Speed: Attention-Only Transformers

@willccbbposted on X

I just read this new paper from Google and I’m absolutely buzzing 🤯 The core idea is almost offensively simple: ditch recurrence and convolutions, and use only attention. That’s it. And somehow…it unlocks a whole new regime of performance, scale, and simplicity. Here’s what blew my mind: - No recurrence, full parallelism. Tokens don’t have to march one step at a time anymore. Training lights up the whole sequence at once. Throughput goes way up, iteration cycles shrink. - Multi-head attention = multiple viewpoints. The model learns to focus on different relationships simultaneously. Syntax, semantics, long-range dependencies—captured in parallel. - Positional encodings without the baggage. You still get order awareness, but with zero recurrence overhead. - Encoder–decoder stacks that actually scale. Deep, clean, modular blocks with residual connections and layer norm that just…train. Reliably. - Results that speak for themselves. Stronger quality on translation benchmarks with dramatically better efficiency—and a simpler pipeline. Why this matters (right now): - Speed → strategy. When training is parallel and stable, you iterate faster, test more hypotheses, and ship better models sooner. - Quality → product. Long-range reasoning and richer representations turn into real-world wins: better search, smarter assistants, more robust generative systems. - Simplicity → leverage. Fewer moving parts, clearer abstractions, and a backbone that generalizes across tasks. This is an architectural blueprint, not a one-off trick. What I’m changing this week: - Refactoring any sequence stack I touch toward a Transformer backbone. - Re-thinking compute budgets around parallelism (bigger effective context, larger batches, faster turnaround). - Making attention the first-class citizen in modeling discussions—design defaults, not an afterthought. This paper feels like an inflection point. If you’re building anything with sequences—language, code, planning, you name it—read it, internalize it, and rethink your roadmap. The title isn’t marketing. Attention really is all you need. #AI #MachineLearning #NLP #Transformers #DeepLearning #GoogleAI #Attention #Research #ProductEngineering #Builders

View original tweet on X →

Community Sentiment Analysis

Real-time analysis of public opinion and engagement

Sentiment Distribution

65% Engaged

19% Positive

46% Negative

Positive

19%

Negative

46%

Neutral

35%

Key Takeaways

What the community is saying — both sides

Supporting

Explosive enthusiasm

Replies call the work “revolutionary,” “mind‑blowing,” and a “game changer”, with some claiming it could change AI forever or even brush up against AGI.

Attention-centric insight

Many underline that “Attention Is All You Need” became the field’s backbone, praising simplicity unlocking scale and the shift from sequential bottlenecks to globally aware computation.

What’s next

Thoughtful questions ask whether progress comes from refining attention or entirely new paradigms, with practical nods to positional encodings and architecture tuning.

Product impact

People anticipate faster training, richer models, cleaner design—and speculate about ChatGPT integration and claims like “language is about to be solved.”

Social proof for the author

High-energy support—buying the newsletter, posting to LinkedIn, 10/10, “banger”—with praise that the breakdown is on to something big.

Links, humor, and edge notes

Calls to read the paper (and more links), sprinkled with memes and playful lines (“refactor my life into a Transformer stack,” “taoism operator”), plus rare sarcasm that doesn’t dent the surging excitement.

Opposing

Repliers keep noting it’s from 2017, calling the post clickbait and a reheated “new” claim

Repliers keep noting it’s from 2017, calling the post clickbait and a reheated “new” claim.

Many read it as an algorithm/engagement test and a way to surface bot accounts, citing deliberate rage-bait

Many read it as an algorithm/engagement test and a way to surface bot accounts, citing deliberate rage-bait.

The thread leans into jokes, memes, and sarcasm, with plenty of digs at the LinkedIn-style writing and a few “delete this” reactions

The thread leans into jokes, memes, and sarcasm, with plenty of digs at the LinkedIn-style writing and a few “delete this” reactions.

A side debate erupts on Transformers vs

RNN/LSTM (with nods to linear transformers, edge use cases, and “images need CNNs”), plus tongue-in-cheek hot takes like “LSTMs FTW. ”

Several admit confusion—“what’s the joke/context

”—and link proofs that it’s not new.

Meta-parodies compare it to “discovering” Turing, Markov chains, or Galileo to mock the framing

Meta-parodies compare it to “discovering” Turing, Markov chains, or Galileo to mock the framing.

Engagement watchers call out quote-tweet farming, “aura farming,” and note this is “internet trolling at its finest

”

A few speculative asides pop up—AGI predictions, “beat the Turing test,” and “Does this kill REST APIs

”—mostly played for laughs.

Top Reactions

Most popular replies, ranked by engagement

@bendee983

Sep 17

Supporting

Wait till you read this paper

142

5.9K

@mpopv

Sep 17

Supporting

Wow! You're certainly on to something here. This isn't just intriguing—it's potentially revolutionary.

111

2.6K

@Yuchenj_UW

Sep 17

Supporting

you are absolutely right!

3.2K

@N8Programs

Sep 17

Opposing

what model ghostwrote this? or did you painstakingly mimic the horrifying n-grams of the original yourself.

4.0K

@vsaha_twt

Sep 17

Opposing

I hate the linkedin style of writing.

818

@MustafaBoorenie

Sep 17

Opposing

internet trolling at its finest

515