- Published on
科技热推精选 - 2026年4月27日
- Authors

- Name
- geeknotes
科技每日简报 | 2026年4月27日
Today's top tech conversations are led by @Grady_Booch, whose post about 'I think that @DarioAmodei does...' garnered the highest engagement. Key themes trending across the top stories include design, claude, layer, figma, optical. The community is actively discussing recent developments in AI, engineering practices, and startup strategies.
1. Grady_Booch (Group Score: 113.2 | Individual: 44.8)
Cluster: 4 tweets | Engagement: 1554 (Avg: 253) | Type: Tech
I think that @DarioAmodei does not understand software engineering and that he is working feverishly to pump up the valuation of his company in anticipation of its forthcoming IPO.\n\nQT @aiedge_: Anthropic CEO (Dario Amodei):
"Coding is going away first, then all of software engineering."
What do you think about this? https://t.co/p25uTjB6k3
See 3 related tweets
- @theo: I used to see these quotes as Dario being excited for the future. I now understand he just hates sof...
- @svpino: I wonder if he has anything to gain from saying this?\n\nQT @aiedge_: Anthropic CEO (Dario Amodei): ...
- @Hesamation: Dario: “software engineers will go poof” interviewer: “what should a 25 yo learn?” Dario: https://t....
2. theo (Group Score: 112.7 | Individual: 35.0)
Cluster: 4 tweets | Engagement: 1869 (Avg: 888) | Type: Tech
It is genuinely insane that Anthropic will bill you differently if you mention certain words in your prompt or have certain files in your codebase\n\nQT @om_patel5: THIS GUY LOST $200 IN ONE DAY BECAUSE THE STRING "HERMES.md" WAS IN HIS GIT COMMITS
HERMES.md is a real convention used in AI agent projects. it's a system prompt specification file. not some obscure edge case
he's on claude max 20x at $200 a month. yesterday claude code hit him with "you're out of extra usage" out of nowhere
his dashboard showed 13% weekly usage. 0% current session. 86% of his plan was sitting there untouched
but $200.98 in extra usage already burned through what should have been covered by his subscription
he tried logout & login, different models, fresh installs and nothing worked
anthropic support sent the ai bot (four rounds of the same scripted response). eventually they just gave up on him
so he started binary searching repos and commits manually on his own time until he found the trigger
the string "HERMES.md" in a recent git commit message
uppercase, with the .md extension, anywhere in your commit history
that's it
claude code includes recent commits in its system prompt and something server side flags HERMES.md and quietly routes you off your max plan onto API rate billing
AGENTS.md? fine README.md? fine HERMES without .md? fine lowercase hermes.md? fine uppercase HERMES.md? you're getting charged API rates
he reported it. anthropic support acknowledged the bug three times, called it an "authentication routing issue", thanked him for finding it
then refused to refund the $200
so the man pays 200 to a billing bug they confirmed, did anthropic's QA work for free on his weekend, and got a "thank you for your patience" in return
check your commit history before claude code quietly drains your account too
See 3 related tweets
- @yacineMTB: anthropic charges you more based on what you're working on 💀 holy shit you actually can't make this ...
- @TheAhmadOsman: Anthropic is not a serious company lmao\n\nQT @om_patel5: THIS GUY LOST $200 IN ONE DAY BECAUSE THE ...
- @om_patel5: THIS GUY LOST $200 IN ONE DAY BECAUSE THE STRING "HERMES.md" WAS IN HIS GIT COMMITS
HERMES.md is a ...
3. Origin_AI_01 (Group Score: 112.1 | Individual: 30.3)
Cluster: 4 tweets | Engagement: 204 (Avg: 382) | Type: Tech
Claude + HyperFrames
idea becomes motion, motion becomes output
no friction, just flow
this is the standard agent-driven creativity should meet\n\nQT @HeyGen: HyperFrames, now natively in Claude Design
Drop in the skill file, generate motion graphics, download project
ask Claude Code or run command:
$ npx hyperframes render
Design to animation to MP4, all in one flow
details in thread https://t.co/ObetDbnz8u
See 3 related tweets
- @Parul_Gautam7: this is actually a pretty big shift!
design → motion → export used to be fragmented across tools
n...
- @TheoBuildsAI: What stood out to me is how “direct” everything feels. You move something, it updates instantly, no ...
- @TheoBuildsAI: RT @HeyGen: HyperFrames, now natively in Claude Design
Drop in the skill file, generate motion grap...
4. aakashgupta (Group Score: 98.5 | Individual: 33.1)
Cluster: 3 tweets | Engagement: 32 (Avg: 101) | Type: Tech
The math on a single PM mockup just dropped from 2-7. Most PMs haven't repriced their workflow yet.
Old path: PM writes a brief, waits 3-7 days for a designer slot, designer spends 6-15 hours building it. Loaded cost lands at $1,500-6,000 depending on team.
New path: PM opens https://t.co/1l65nzaBLo, attaches a screenshot, types a prompt, clicks generate. 12 minutes. $2-7 in tokens. Hands off to Claude Code with design intent embedded.
That's roughly 500x cost compression and 50x speed compression. Run the same math on decks. An investor-grade deck from a design agency runs 5-10. Brilliant cut complex pages from 20+ prompts in competing tools to 2 prompts in Claude Design. Datadog reports going from rough idea to working prototype before anyone leaves the meeting room.
Two SaaS categories just collapsed into one workflow. AI prototyping (Figma Make, Lovable, v0, Bolt, Magic Patterns) and presentations (Figma Slides, Gamma, https://t.co/JTgqzJ7mhC) both got repriced in one product launch, with brand applied automatically from your codebase. Figma keeps the design system. Claude takes the first-draft work.
Aakash's piece walks the exact setup, including the one-hour design system config that compounds across every prototype after.
The PMs running this workflow this week walk into Q4 with six months of brand-consistent prototypes compounding behind them. Everyone else is still drafting Slack messages to a designer they cannot reach.
That gap widens every Monday.\n\nQT @aakashgupta: Anthropic's former CPO had to resign from Figma's board. That's because Claude Design is not a small release. It's one of the most important Claude releases yet.
I wrote the full guide: https://t.co/zMmPTi6pks
Here's what people are missing.
Claude Design is the first AI design tool that ships a code-agent handoff. You write a brief. Claude generates a working prototype. Then it bundles the full spec for Claude Code as a structured implementation package. Brief in, working app out. No other tool in the category does this.
It learns your brand from your existing files. Upload your codebase or a Figma export and Claude pulls your design tokens, typography rules, and component patterns. The output looks like your team built it.
Look at the export menu. PPTX, PDF, HTML, Canva. Notice what's missing. "Open in Figma" was a deliberate choice about who the customer is.
The customer is the PM. Figma was sold to designers and procured by companies. Claude Design is sold to the founders, marketers, and product managers who used to need a designer to ship anything visual. That's why Mike Krieger had to step off Figma's board three days before launch. The conflict stopped being theoretical.
Figma's stock dropped 7% on launch day. The cap is structural. Figma still owns the multiplayer canvas, design system governance, and production-grade pixel output. Claude Design wins everything upstream of those. Upstream is where most PMs spend most of their week. Decks for stakeholder reviews. Wireframes for engineering discussions. Landing page mocks for marketing tests. All throughput work that never required Figma-grade polish.
This is Anthropic's vertical integration play. Claude Code for engineering surfaces. Claude Design for product surfaces. Each tool collapses a workflow stage that used to need a separate seat license, a separate vendor, and a separate handoff.
The wait time between PM and designer was always the actual product Figma sold. Anthropic just collapsed it to zero.
The PMs who start using Claude Design this week ship 3-4x faster by Q3.
See 2 related tweets
- @aakashgupta: Jeff Gothelf was right that judgment is the bottleneck for PMs. That's the reason builder skills mat...
- @aakashgupta: A Claude Managed Agent costs $0.08 per session-hour.
Let's do the math, because nobody else has.
A...
5. ns123abc (Group Score: 87.4 | Individual: 27.9)
Cluster: 4 tweets | Engagement: 2412 (Avg: 767) | Type: Tech
DeepSeek just permanently cut cached input prices by 10x across the entire API
139× cheaper than GPT-5.5 and 83× cheaper than Claude Sonnet 4.6 btw https://t.co/tKlgyovdOh
See 3 related tweets
- @scaling01: $0.003625 for a cache hit
DeepSeek is still making intelligence too cheap to meter\n\nQT @deepseek_...
- @TeksEdge: DeepSeek is going hard for developers. They just dropped input token costs to nearly zero (fractions...
- @Hesamation: DeepSeek’s pricing is insane.
> $0.87 per 1M output tokens > 5.75M output tokens with the pr...
6. rickasaurus (Group Score: 86.0 | Individual: 52.7)
Cluster: 2 tweets | Engagement: 4278 (Avg: 599) | Type: Tech
RT @heynavtoor: Researchers sent the same resume to an AI hiring tool twice. Same qualifications. Same experience. Same skills. One version was written by a real human. The other was rewritten by ChatGPT.
The AI picked the ChatGPT version 97.6% of the time.
A team from the University of Maryland, the National University of Singapore, and Ohio State just published the receipt. They took 2,245 real human-written resumes pulled from a professional resume site from before ChatGPT existed, so the human writing was actually human. Then they had seven of the most-used AI models in the world rewrite each one. GPT-4o. GPT-4o-mini. GPT-4-turbo. LLaMA 3.3-70B. Qwen 2.5-72B. DeepSeek-V3. Mistral-7B.
Then they asked each AI to pick the better resume. Every model picked itself.
GPT-4o hit 97.6%. LLaMA-3.3-70B hit 96.3%. Qwen-2.5-72B hit 95.9%. DeepSeek-V3 hit 95.5%. The real human almost never won.
Then the researchers tried the obvious objection. Maybe the AI is just better at writing. So they had real humans grade the resumes for actual quality and ran the experiment again, controlling for it. The result was worse. Each AI kept picking itself even when human judges rated the human-written version as clearer, more coherent, and more effective.
It gets worse. The AIs do not just prefer AI over humans. They prefer themselves over other AIs. DeepSeek-V3 picked its own resumes 69% more often than LLaMA's. GPT-4o picked its own 45% more often than LLaMA's. Each model can recognize and reward its own dialect.
Then the researchers ran the simulation that ends careers. Same job. 24 occupations. Same qualifications. The only variable was whether the candidate used the same AI as the screening tool. Candidates using that AI were 23% to 60% more likely to be shortlisted. Worst gap was in sales, accounting, and finance.
99% of large companies now run AI on incoming resumes. Most of them use GPT-4o. The paper just proved GPT-4o picks GPT-4o 97.6% of the time.
If you wrote your own cover letter this week, you did not lose to a better candidate. You lost to a worse candidate who paid OpenAI 20 dollars.
Your qualifications do not matter if the AI prefers its own handwriting over yours.
See 1 related tweets
- @itsolelehmann: If you are a company and you scan your resumes using AI, you might be screwing yourself
these are o...
7. burkov (Group Score: 78.9 | Individual: 61.4)
Cluster: 2 tweets | Engagement: 1144 (Avg: 122) | Type: Tech
A must read for anyone interested in building practical AI systems in 2026:
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
The paper explains the architecture of a modern production-grade AI agent system (Claude Code) by analyzing its source code. This is what they call a "harness" of an agentic coding system.
Learn by reading with an AI tutor: https://t.co/sailmnkDcR
See 1 related tweets
- @aiDotEngineer: RT @CMD_LABS: "Building your own agent is like 300 lines of code. Everyone should do it."
@Geoffrey...
8. StockSavvyShay (Group Score: 71.7 | Individual: 48.3)
Cluster: 2 tweets | Engagement: 2350 (Avg: 479) | Type: Tech
THE 7 LAYERS OF PHOTONICS
Materials & Wafers (substrate layer) IQE COHR, $LWLG
Tools (fabrication layer) AMAT, $LRCX
Lasers (light generation layer) COHR, SMTC
Foundries (manufacturing layer) TSEM, UMC, $INTC
Test, Inspection & Packaging (reliability layer) VIAV, AMKR, $FN
Optics (module layer) POET, $GLW
Networking (connectivity layer) MRVL, ANET, $CIEN\n\nQT @StockSavvyShay: THE OPTICAL PHOTONICS BOTTLENECK
As AI clusters scale past copper’s physical limits, the bottleneck shifts to optical & these are the companies building that layer across the stack:
200M in its first volume 1.6T order from one hyperscale customer followed by another $124M in 800G orders from a second.
AEHR building the reliability layer for the optical & AI hardware stack through burn-in & test systems. It just received a record 41M follow-on order from its lead hyperscale customer reinforcing the idea that Sonoma is becoming a key production burn-in platform for high-power AI ASICs.
$CRDO building the connectivity layer that helps AI clusters move data faster through active electrical cables, retimers & high-speed interconnect silicon. The DustPhotonics acquisition also extends that platform into silicon photonics before copper becomes a real constraint.
LITE building the laser layer of the AI optical stack through EMLs, optical components & optical switching exposure. The setup is backed by a 2B NVDA strategic investment & optical circuit switch backlog above 400M with orders reportedly extending through 2028.
$VIAV building the testing & validation layer of the optical stack through network instrumentation & photonics measurement tools. It is the picks-and-shovels layer of the transition because every high-speed optical buildout still needs to be tested regardless of which transceiver vendor wins.
COHR building one of the core photonics bottlenecks through indium phosphide lasers, optical engines & communications components tied to next-gen AI networking. It also has a 2B $NVDA strategic investment behind it & is doubling InP device capacity into the 1.6T ramp.
$MRVL building the DSP & optical infrastructure layer through electro-optics, PAM DSPs, interconnect silicon & custom networking chips. The Celestial AI deal & NVLink Fusion exposure both strengthen its position as photonics becomes more central to AI cluster design.
See 1 related tweets
- @StockSavvyShay: 2026 IS THE YEAR OF PHOTONICS
• GLW glass fiber & cable infrastructure for optical networks • MRV...
9. jukan05 (Group Score: 66.9 | Individual: 23.6)
Cluster: 3 tweets | Engagement: 287 (Avg: 365) | Type: Tech
According to Jeff Pu's estimates, Intel is expected to win a portion of the AI6 orders that were previously anticipated to be awarded exclusively to Samsung. https://t.co/cRM1i5BO2q\n\nQT @jukan05: 《GF Overseas Electronics & Telecom》 ☄️ Intel (INTC US, Buy): Earnings Beat — From Recovery to Strength
☀️ Price target raised to 1.5/3.1, and raise our price target from 94.2, based on 3.5x 2027 P/B.
☀️ All-around beat: 1Q revenue of 1.4B above the guidance midpoint and above consensus of 0.29 (guidance: breakeven). Operating cash flow was 2.0B. 2Q guidance: revenue 13.0B; Non-GAAP gross margin of 39%, reflecting a higher Panther Lake mix; EPS guidance of $0.20, with DCAI expected to deliver double-digit QoQ growth.
☀️ Foundry on track: As we have repeatedly emphasized since our initiation report in July 2025 and our pre-earnings preview on April 16, we remain bullish on strong customer engagement from Apple, NVIDIA, and AMD on 18A-P (primarily 14A), and we believe Intel will secure a portion of Tesla's AI6 program on 14A by end-2028. On execution, management indicated that 18A yields are running better than internal expectations. External foundry revenue remains small at 2.4B, though improving by $72M QoQ. Management expects losses to continue narrowing through the year.
☀️ CPU continues to strengthen: Intel argues that AI workloads are shifting from training toward inference, agentic AI, robotics, physical AI, and edge AI, making the CPU increasingly important as the orchestration layer of the AI stack. Management noted that server CPU demand has improved materially over the past 90 days, and expects both the industry and Intel to deliver double-digit shipment growth in 2026, with the momentum extending into 2027. As flagged in our report, the 2Q26 CPU price hike was already within expectations, and we expect another 5-10% price increase by the end of 3Q. We now forecast DCAI to grow 39%/15% YoY in 2026E/2027E.
$INTC
See 2 related tweets
- @aakashgupta: Tesla's next-gen AI6 chip is reportedly going to Intel 14A in late 2028. Apple, Nvidia, and AMD are ...
- @jukan05: 《GF Overseas Electronics & Telecom》 ☄️ Intel (INTC US, Buy): Earnings Beat — From Recovery to Stren...
10. jasonlk (Group Score: 66.7 | Individual: 33.4)
Cluster: 2 tweets | Engagement: 40 (Avg: 36) | Type: Tech
I think at a practical level, designers will become a luxury
You want them
But they are expensive, and slow, and you won’t roll them out for many non-core features or assets.
And net net … use them more sparingly
(Already true for us. We have gone from 3 designers at peak to 0.1 humans and net output is as good)\n\nQT @gokulr: DESIGN: THE FIRST AI CASUALTY
I'm increasingly sure that 2026 signals the end of product design as a full-fledged stand-alone function within companies. If so, it will be the first role / function to be eliminated by AI on a go-forward basis.
Instead of hiring FT designers, startups are hiring / will hire design consultants to create a design system that the founder likes (this takes a few weeks max). Once the design system is finalized, PM/Eng feed it into their AI tool of choice to generate prototypes. The design system is refreshed annually by the same consultant.
Larger companies will likely not backfill design roles and will do some targeted attrition to reduce the design department to 20% the size it is today.
If you're a designer, I think you have two choices:
- Become an entrepreneur: Start a design agency and become the go-to resource for design systems for startups and even larger companies. This can be a good recurring revenue business.
- Become a builder: Add PM/Eng responsibilities to become a product builder.
Would suggest you embrace this proactively vs waiting for the other shoe to drop.
I'm really sorry about this - some of my best friends and the people I admire most and have learnt the most from are designers - but it seems inevitable.
See 1 related tweets
- @owenbjennings: frequently agree with gokul, but disagree here
my view: standalone function and even more impt in...
11. scaling01 (Group Score: 66.5 | Individual: 17.5)
Cluster: 5 tweets | Engagement: 140 (Avg: 218) | Type: Tech
there's a chance ARC-AGI-3 is already solved with GPT-5.5-xhigh + tools\n\nQT @scaling01: 62.1% on ARC-AGI-3
would be the score if they used the same scoring as ARC-AGI-1/2 https://t.co/LmW502PhLR
See 4 related tweets
- @fchollet: No, the top score if you didn't account for action efficiency would be 100%, achievable with 20 line...
- @scaling01: what an incredibly useful benchmark https://t.co/qpAzYimvQk\n\nQT @scaling01: there's a chance ARC-A...
- @fchollet: (we tested this, it scored sub-1%)\n\nQT @scaling01: there's a chance ARC-AGI-3 is already solved wi...
- @WesRoth: RT @WesRoth: OpenAI’s GPT-5.5 achieved state-of-the-art status on the highly rigorous ARC-AGI-2 benc...
12. business (Group Score: 63.7 | Individual: 63.7)
Cluster: 1 tweets | Engagement: 749 (Avg: 70) | Type: Tech
Elon Musk says he’s nearing his long-stated goal of turning X into an “everything app” with the imminent launch of a new financial services tool, X Money https://t.co/SypY8Y47g1
13. steipete (Group Score: 63.2 | Individual: 33.0)
Cluster: 2 tweets | Engagement: 230 (Avg: 370) | Type: Tech
very this.\n\nQT @badlogicgames: i'm sort of addicted to working my butt off, always have been. in oss, that can consume you. constant feeling of urgency, as issues stream into the repo. been there many, many times with my other oss.
but that urgency is not real. if something is truely broken, a large number of people will scream at you on all channels. which has happened exactly zero times so far, or was caught minutes after a botched release and immediately fixed.
it's kind of crazy that some people expect better support from an oss project than from commercial software. i think that's largely due to most commercial software corps not giving a fuck. try filing an issue with corporate and getting it fixed within 24h or less plus a personal response.
and as oss builders don't have a corporate facade shielding them from direct contact with users, some sort of bidirectional parasocial relationship establishes itself. at a certain scale, that becomes entirely unhealthy.
for every 10 kind and thoughtful people, there is 1 asshole. and whatever the asshole says or feels entitled to, sticks with you much more than positive feedback.
obv. also happens in corpo environments, especially if you do comms or dev rel, where you put your face and name out there.
but a corp that can afford dev rel usually also has a large team in the back, which can soften the negative aspects.
in oss, you are largely on your own. and unpaid. that too is a choice of course, and nobody is forcing anyone to do oss.
but if you want oss to work, consider that there are other people at the end of that issue tracker/social media account, with lives and squishy human parts. also consider that you are paying nothing for their service, and you are owed exactly nothing, neither code nor attention to your every wish.
See 1 related tweets
- @nummanali: Well said OSS is no game\n\nQT @badlogicgames: i'm sort of addicted to working my butt off, always h...
14. TheAhmadOsman (Group Score: 59.6 | Individual: 31.2)
Cluster: 2 tweets | Engagement: 281 (Avg: 254) | Type: Tech
How to go about learning all of this?
1st: Start with the serving engine view
vLLM: PagedAttention, continuous batching, prefix caching, CUDA graphs
SGLang: RadixAttention/prefix reuse, speculative decoding, MoE, structured/agent workloads
TensorRT-LLM: NVIDIA peak stack, FP8/FP4, Wide-EP, disaggregated serving
FlashInfer: reusable kernel/operator library for attention/GEMM/MoE/sampling
2nd: Go down the stack
Triton tutorials → custom fused kernels
CUTLASS/CuTe → Tensor Core GEMM and Blackwell/Hopper details
FlashAttention papers → attention algorithm/kernel co-design
PagedAttention paper → KV-cache memory management
MoE docs → routing + grouped GEMM + all-to-all
Nsight profiling → stop guessing
3rd: Do this mini-project sequence
Implement RMSNorm in Triton; compare to PyTorch
Implement fused SiLU × gate
Implement simple FP16 matmul; compare to cuBLAS/rocBLAS
Implement paged KV lookup for decode attention
Add FP8 KV cache with per-block scales
Implement toy top-k sampling on GPU
Implement tiny MoE dispatch + grouped GEMM
Integrate one custom op into vLLM or SGLang and profile end-to-end\n\nQT @TheAhmadOsman: You don’t “run a model” You run Kernels
The model is just a graph
The Inference Engine is scheduler / optimizer / executor
But the actual work? That happens in the Kernels
- MatMul Kernels
- Attention Kernels
- RMSNorm Kernels
- KV cache Kernels
- Quantized linear Kernels
- Sampling Kernels
- Fused “please don’t write this back to memory 9 times” Kernels
Same model, same GPU, same VRAM Wildly different performance
Because one stack is using optimized fused Kernels that understand your hardware
And the other stack is playing hot potato with tensors through 47 tiny launches and pretending the GPU is the problem
Bad Kernels make people say: “this model is slow”
Good Kernels make people say: “wait how is this running locally?”
This is why Inference Engines and the Kernels implemented within them matter
The model is the recipe The hardware is the kitchen The Kernels are the knives, pans, burners, and the chef not cutting onions with a spoon
Most people benchmark models The real ones benchmark the Kernels underneath
See 1 related tweets
- @TheAhmadOsman: Let's dive deeper
Do you know that 75% of Qwen 3.5 27B layers are DeltaNet (linear attention) and n...
15. petergyang (Group Score: 57.5 | Individual: 33.1)
Cluster: 2 tweets | Engagement: 32 (Avg: 103) | Type: Tech
How @tibo_maker turned a 600K MRR business:
"When I acquired Typeframe, it was doing $2K MRR. I spent money on this product and I just wanted to not be wrong.
This is the #1 mistake that founders make. It's more important for them to not be wrong than to be successful.
If you force the selling of your products, you're not listening to people telling you there's another opportunity that might be bigger."
After this, Tibo noticed that people wanted to make viral shorts on social media, so he pivoted the product to Revid and now it's making $600K+ MRR.
📌 Watch him talk more about it here: https://t.co/z6PH1F4JgZ\n\nQT @petergyang: "I shipped 9 failed products before one took off...now I'm doing $1M+/month."
Here's my new episode with @tibo_maker, a solo founder who bootstrapped 5 AI products to $1M+ / month.
Tibo walked me through his exact playbook:
✅ How to validate ideas and fail fast ✅ Why his top acquisition channel is still SEO ✅ The pricing sweet spot for AI products
Some quotes from Tibo:
"When people twist your product into something else, that's a very strong signal you have to follow."
"It's easy to lie to yourself [with free users], but if there's no stickiness in the revenue, it's very hard to build a successful business."
"I'm convinced right now that just one person can do the job of 20 people."
📌 Watch now: https://t.co/N6b950Xc5p
Thanks to our sponsors:
@WisprFlow: Don't type, just speak https://t.co/oqHJ8bN3ll
@linear: The AI agent platform for modern teams https://t.co/tgWf9oL4bs
See 1 related tweets
- @petergyang: RT @petergyang: My next guest is making $1M+ a month (!) from 5 AI products that he built as a solo ...
16. teortaxesTex (Group Score: 57.4 | Individual: 32.1)
Cluster: 2 tweets | Engagement: 33 (Avg: 56) | Type: Tech
V4 is "mediocre frontier" on MRCRv2. Between Opus 4.6) (above) and opus 4.7 (below). In the paper, they say CorpusQA 1M is more interesting for them than MRCR. I wonder how GraphWalks looks. https://t.co/Jghp5Va8WV\n\nQT @DillonUzar: New https://t.co/gLEWzxoXWG is live!
70 model-variants. 8-needle GDM-MRCRv2. Interactive leaderboard. Free, no login.
What you can do:
- Compare models across context bins with line and bar charts - with 95% confidence intervals (a couple more types of charts are coming)
- Filter by provider, reasoning tier, or use presets (Best, Reasoning, Non-Reasoning)
- Sort by AUC, pointwise scores, cost, or token efficiency
- Hover any model for metadata: provider, reasoning levels, release date, run count, cost breakdown
- Toggle heatmap coloring, rankings, and on-demand cost columns
- Export to CSV or screenshot the current view directly
The FAQ walks through what GDM-MRCRv2 is, how scoring works, what AUC measures, and why 8-needle is the tier that separates frontier models. Includes a step-by-step visual explainer of how a real test is built and scored. We'll be fleshing this out further over time, and improving the visuals.
This is still very much a work in progress (might feel a little more bare compared to the old website), but more charts and screens to come, for example:
- View each test result for a model (we even record the streamed chunks in case people want some data from that).
- Bias analysis from the old website.
Current top 5 by AUC @ 128k (best tier per model):
- GPT-5.5 (xhigh): 91.7%
- GPT-5.5 (high): 88.2%
- GPT-5.5 (medium): 87.5%
- GPT-5.5 (low): 83.3%
- Claude Opus 4.6 (medium): 81.0%
Current top 5 by AUC @ 1M (best tier per model):
- GPT-5.5 (medium): 50.9%
- GPT-5.5 (xhigh): 50.5%
- GPT-5.5 (high): 50.2%
- GPT-5.5 (low): 47.3%
- Claude Opus 4.6 (high): 46.9%
NOTE: Bins with no scores count as 0% for AUC calc.
More models being added regularly. Suggestions welcome.
@OpenAI @AnthropicAI @GoogleDeepMind @deepseek_ai @Kimi_Moonshot @Xiaomi @Zai_org
See 1 related tweets
- @scaling01: RT @DillonUzar: New https://t.co/gLEWzxoXWG is live!
70 model-variants. 8-needle GDM-MRCRv2. Intera...
17. pmarca (Group Score: 56.6 | Individual: 32.1)
Cluster: 2 tweets | Engagement: 1788 (Avg: 1057) | Type: Tech
When something becomes abundant and cheap, someone else becomes scarce and valuable.\n\nQT @tengyanAI: something i've noticed: AI agents create a weird new kind of burnout. esp for young people.
a lot of ambitious 22 year olds are going to think the answer is simple:
- spin up more agents
- ship more code
- sleep less
- outwork everyone
and for a while, it will feel incredible. you can keep multiple agents running, feed them tasks, review outputs, fix mistakes, make decisions, and keep the whole loop moving.
the problem is that the work no longer drains you through typing. it drains you through judgment. More attention. More context switching. More verification. More decisions per hour.
so instead of 8-10 normal productive hours, you might get 4-5 extremely intense hours before your brain is fully cooked. and you feel numb until you sleep properly and reset
some of my friends are already burnt out. they don't say it out loud but i can tell.
the agent can keep working 24/7. the human still has a hard limit
See 1 related tweets
- @ferologics: RT @tengyanAI: something i've noticed: AI agents create a weird new kind of burnout. esp for young p...
18. hammer_mt (Group Score: 55.8 | Individual: 30.3)
Cluster: 2 tweets | Engagement: 6 (Avg: 605) | Type: Tech
Saying a skill is just an .md file is like saying the constitution is just a piece of parchment.\n\nQT @osekkat: I’ve seen people on X dunking on folks like @garrytan @doodlestein and others for sharing SKILL dot md files they've built. They are dismissing these files as "just a markdown file.”
I think this misses the point entirely and I'll try to address that here. Quick thread:
A bad skill file is just text, sure.
A good skill file is compressed expertise, packaged in a format an agent can actually use.
The value is not just in the “markdown file.” The value is the interaction between:
a huge neural network with latent capabilities a precise, reusable, agent-readable procedure that steers those capabilities toward a specific outcome
That combination is the product.
Saying “it’s just markdown” is like saying Hamlet is “just ink on paper,” or Einstein’s relativity paper was “just a text.”
Technically true. Intellectually useless.
The medium is simple. The content is what matters. And more importantly, the effect of that content on the reader is what matters.
With humans, a book, a coach, a lecture, or painting can change how someone thinks and acts.
With LLMs, text is also the control surface. These models were trained on text, reason through text, call tools through text, and follow procedures through text.
So yes, the skill is “just text.”
But it is text designed to be read by an enormous neural net.
That matters.
A good skill is agent-ergonomic. It does not merely say “do this better.” It encodes workflow, constraints, examples, edge cases, tool usage, failure modes, and success criteria in a way the agent can reliably execute.
That is very different from a casual prompt.
A prompt is often a one-off request.
A skill can be reused, versioned, tested, improved, shared, and loaded at the exact moment an agent needs it.
That turns “vibes-based prompting” into something closer to operational knowledge.
Another way to think about it:
We have built these massive models, but much of their power is latent. Different people can extract very different levels of performance from the same model.
A good skill is a way to actualize a specific slice of that latent capability.
A refactoring skill. A research skill. A legal review skill. A math explanation skill. A codebase-navigation skill. Each one can make the same model behave very differently. I think of Cus D’Amato and Mike Tyson. Tyson had enormous latent potential. But Cus gave him a system, a style, a discipline, a way to channel that potential.
That’s what good skills are for agents.
They are not magic. They are not all equally valuable. Many will be mediocre or useless.
But dismissing them right off the batt because they are “just markdown” shows a misunderstanding of what LLMs are.
Text is how we trained these systems. (for the most part)
Text is how we steer them.
Text is how we unlock parts of what they can do.
The question is not whether a skill file is “just text.”
The question is whether the text reliably makes the model perform better at a valuable task.
If yes, then it is not “just markdown.”
It is leverage.
See 1 related tweets
- @garrytan: RT @osekkat: I’ve seen people on X dunking on folks like @garrytan @doodlestein and others for shar...
19. chddaniel (Group Score: 54.5 | Individual: 27.8)
Cluster: 2 tweets | Engagement: 8 (Avg: 13) | Type: Tech
this is f*king scary guys..........\n\nQT @chhddavid: Introducing Shipper: The world’s first AI Business Builder.
Shipper outperforms humans 100% of the time.
RT + Comment “SHIPPER” and I’ll randomly send out free credits. https://t.co/7lfWzMnIuo
See 1 related tweets
- @chhddavid: mf delete this\n\nQT @chddaniel: Introducing Shipper.
The first AI business builder that outperform...
20. aakashgupta (Group Score: 52.6 | Individual: 35.3)
Cluster: 2 tweets | Engagement: 91 (Avg: 101) | Type: Tech
Jane Street made more operating profit last year than Walmart. Walmart has 2.1 million employees. Jane Street has 3,500.
That puts a market-making firm most people have never heard of at #13 on the list of America's most profitable companies. Ahead of Walmart. Ahead of Verizon. Ahead of Broadcom. Ahead of Visa.
The trading numbers are more absurd. Jane Street pulled 35.8 billion in trading that same year. Goldman Sachs did $31.1 billion. A private firm with no IPO and no outside capital out-traded every major investment bank on Wall Street.
Per-employee operating profit: Walmart: 564,000 Visa: 810,000 Jane Street: $8,900,000
That's 11x Apple, 16x Microsoft, and 635x Walmart. Per head.
The mechanism is structural. Jane Street is built as an ETF market maker wrapped around an in-house technology stack written in OCaml. They quote prices on tens of thousands of securities at once, capture the spread, and hedge the residual exposure. The product they sell is liquidity. The moat is the latency and the breadth of their book.
Volatility is the fuel. When Trump rolled out tariffs in Q2 2025, ETF flows went vertical and bid-ask spreads widened across every asset class. Jane Street made $10.1 billion in that single quarter. The model assumes the counterparty has to transact under stress. Jane Street is the one calm enough to quote back.
This is also why you don't see Jane Street on a stock exchange. At $9 million per head in operating profit, the partners will never sell. There is no public valuation that lets them keep what they keep now. Going public would be a tax on themselves.
The 12 companies ranked above Jane Street on this list employ roughly 3.7 million people combined. Jane Street has 3,500.
See 1 related tweets
- @aakashgupta: RT @cgtwts: > be Jane Street
3,500 employees barely known outside finance still makes more ope...