科技推特精选 - 2026年2月23日

2026年2月23日科技每日简报

Today's top tech conversations are led by @gdgtify, whose post about 'I love Midjourney and it is sp...' garnered the highest engagement. Key themes trending across the top stories include models, reasoning, training, language, modern. The community is actively discussing recent developments in AI, engineering practices, and startup strategies.

1. gdgtify (Group Score: 60.2 | Individual: 32.4)

Cluster: 2 tweets | Engagement: 30 (Avg: 93) | Type: Tech

I love Midjourney and it is special but I can generate almost every cute style from there with Nano Banana.

Prompt: 2x2 grid, do this for 4 famous historical events for humans Anchor: [Input]::3 Morphology: Thick rounded shapes, thumbprint indentations, imperfect smoothing, slight asymmetry::3 Material Physics: Modelling clay (Plasticine), matte texture, slight oiliness, dust fibers caught in clay::3 Illumination: Stop-motion stage lighting, warm gels, hard shadows indicating small scale::1.5 Render Stack: Dragonframe capture, macro lens, shallow depth of field (miniature effect)::1 Negative: [CGI smoothness, reflective metal, digital, vector, sharp edges]:: -1

See 1 related tweets

@gdgtify: I used to make these styles with Midjourney but Nano Banana now makes it a lot easier.

Prompt: 2x...

2. steipete (Group Score: 46.6 | Individual: 46.6)

Cluster: 1 tweets | Engagement: 3500 (Avg: 762) | Type: Tech

Been wrangling a lot of time how to deal with the onslaught of PRs, none of the solutions that are out there seem made for our scale.

I spun up 50 codex in parallel, let them analyze the PR and generate a JSON report with various signals, comparing with vision, intent (much higher signal than any of the text), risk and various other signals.

Then I can ingest all reports into one session and run AI queries/de-dupe/auto-close/merge as needed on it.

Same for Issues. P rompt R equests really are just issues with additional metadata.

Don't even need a vector db. Was thinking way too complex for a while.

There's like 8 PRs for auto-update in the last 2 days alone (still need to ingest 3k PRs, only have 1k so far).

3. TheAhmadOsman (Group Score: 45.4 | Individual: 45.4)

Cluster: 1 tweets | Engagement: 695 (Avg: 168) | Type: Tech

BREAKING

Elon Musk endorsed my Top 26 Essential Papers for Mastering LLMs and Transformers

Implement those and you’ve captured ~90% of the alpha behind modern LLMs.

Everything else is garnish.

This list bridges the Transformer foundations with the reasoning, MoE, and agentic shift

4. MLStreetTalk (Group Score: 44.0 | Individual: 44.0)

Cluster: 1 tweets | Engagement: 2996 (Avg: 383) | Type: Tech

RT @victorianoi: In 20 years, vibe coders will look at the Linux kernel repo the way we look at the pyramids. In awe, unable to imagine how…

5. garrytan (Group Score: 43.4 | Individual: 25.9)

Cluster: 2 tweets | Engagement: 1245 (Avg: 336) | Type: Tech

Bernie Sanders Can’t Explain American Innovation

Asked point-blank why the US dominates tech while Europe stagnates, the Senator pivoted to healthcare and homelessness. The honest answer would destroy his worldview.

https://t.co/MabJnDOnbg

See 1 related tweets

@ashugarg: RT @garrytan: Bernie Sanders Can’t Explain American Innovation

Asked point-blank why the US dominat...

6. rohanpaul_ai (Group Score: 43.0 | Individual: 43.0)

Cluster: 1 tweets | Engagement: 1106 (Avg: 106) | Type: Tech

RT @rohanpaul_ai: Demis Hassabis’s “Einstein test” for defining AGI:

Train a model on all human knowledge but cut it off at 1911, then se…

7. garrytan (Group Score: 37.1 | Individual: 37.1)

Cluster: 1 tweets | Engagement: 1341 (Avg: 336) | Type: Tech

Software engineering accounts for nearly 50% of all AI agent tool calls. Healthcare, legal, finance, and a dozen other verticals are barely touched, each under 5%. That's a hundred AI unicorns waiting to be built.

https://t.co/cdJnGqsjHM https://t.co/IvvdPviCCu

8. business (Group Score: 37.0 | Individual: 19.1)

Cluster: 2 tweets | Engagement: 140 (Avg: 105) | Type: Tech

Apple CEO Tim Cook is signaling that Visual Intelligence will be the defining feature of the company’s push into wearable AI devices, writes Mark Gurman.

Read this week's Power On newsletter: https://t.co/lDDRC4E54k

📷️: David Paul Morris/Bloomberg https://t.co/9LSy9mx7hm

See 1 related tweets

@business: Apple’s next big thing is visual artificial intelligence, something CEO Tim Cook has already dropped...

9. HiTw93 (Group Score: 36.0 | Individual: 36.0)

Cluster: 1 tweets | Engagement: 529 (Avg: 183) | Type: Tech

Mole 1.27 is live. The Mac cleaning tool that can free up tens of GBs in one go. 36K stars. https://t.co/rVM1P2nZ1O

Here’s what’s new: · mo clean: adds safe cleanup for Group Containers, Maven local repo, Chrome and Google Updater caches, Expo ecosystem files, and improves npm residual detection with custom cache path support. · mo purge: expands coverage for React Native and Expo targets including DerivedData, Pods, NDK, and .expo, with safer size handling and better trap behavior. · mo status: prioritizes internal disks, improves layout during terminal resize, and fixes duplicate rendering in error states. · Compatibility and stability: fixes macOS find argument handling and strengthens safe deletion paths with more consistent protection checks.

This release expands deep cleanup coverage across modern dev environments while keeping safety first. If Mole helps, I’d love your ideas on where to dig deeper for safe cleanup and more hidden junk.

10. leeoxiang (Group Score: 35.2 | Individual: 19.0)

Cluster: 2 tweets | Engagement: 115 (Avg: 59) | Type: Tech

claude code 官方支持了 worktree 就很爽了，目前完全是 github issue 驱动开发。

1、在 github 上创建一个 issue； 2、claude code 启动一个 worktree，读取这个 issue，进入 plan mode 设计方案，设计方案自动提交到 issue 中； 3、提交 PR，并把最终的方案的说明更新到 issue。

进入下一个 issue。

See 1 related tweets

@aigclink: Claude Code现在内置了原生Git Worktree支持，智能体可以并行运行互不干扰，可以"开多个分身"同时干活

每个智能体都拥有自己的工作树，可以独立工作

claude --worktr...

11. TheAhmadOsman (Group Score: 34.8 | Individual: 34.8)

Cluster: 1 tweets | Engagement: 193 (Avg: 168) | Type: Tech

local llms 101

running a model = inference (using model weights) inference = predicting the next token based on your input plus all tokens generated so far together, these make up the "sequence"

tokens ≠ words they're the chunks representing the text a model sees they are represented by integers (token IDs) in the model "tokenizer" = the algorithm that splits text into tokens common types: BPE (byte pair encoding), SentencePiece token examples: "hello" = 1 token or maybe 2 or 3 tokens "internationalization" = 5–8 tokens context window = max tokens model can "see" at once (2K, 8K, 32K+) longer context = more VRAM for KV cache, slower decode

during inference, the model predicts next token by running lots of math on its "weights" model weights = billions of learned parameters (the knowledge and patterns from training)

model parameters: usually billions of numbers (called weights) that the model learns during training these weights encode all the model's "knowledge" (patterns, language, facts, reasoning) think of them as the knobs and dials inside the model, specifically computed to recognize what could come next when you run inference, the model uses these parameters to compute its predictions, one token at a time

every prediction is just: model weights + current sequence → probabilities for what comes next pick a token, append it, repeat, each new token becomes part of the sequence for the next prediction

models are more than weight files neural network architecture: transformer skeleton (layers, heads, RoPE, MQA/GQA, more below) weights: billions of learned numbers (parameters, not "tokens", but calculated from tokens) tokenizer: how text gets chunked into tokens (BPE/SentencePiece) config: metadata, shapes, special tokens, license, intended use, etc sometimes: chat template are required for chat/instruct models, or else you get gibberish you give a model a prompt (your text, converted into tokens)

models differ in parameter size: 7B means ~7 billion learned numbers common sizes: 7B, 13B, 70B bigger = stronger, but eats more VRAM/memory & compute the model computes a probability for every possible next token (softmax over vocab) picks one: either the highest (greedy) or samples from the probability distribution (temperature, top-p, etc) then appends that token to the sequence, then repeats the whole process this is generation: generate; predict, sample, append over and over, one token at a time rinse and repeat each new token depends on everything before it; the model re-reads the sequence every step

generation is always stepwise: token by token, not all at once mathematically: model is a learned function, f_θ(seq) → p(next_token) all the "magic" is just repeating "what's likely next?" until you stop

all conversation "tokens" live in the KV cache, or the "session memory"

so what's actually inside the model? everything above-tokens, weights, config-is just setup for the real engine underneath

the core of almost every modern llm is a transformer architecture this is the skeleton that moves all those numbers around it's what turns token sequences and weights into predictions designed for sequence data (like language), transformers can "look back" at previous tokens and decide which ones matter for the next prediction

transformers work in layers, passing your sequence through the same recipe over and over each layer refines the representation, using attention to focus on the important parts of your input and context every time you generate a new token, it goes through this stack of layers-every single step

inside each transformer layer: self-attention: figures out which previous tokens are important to the current prediction MLPs (multi-layer perceptrons): further process token representations, adding non-linearity and expressiveness layer norms and residuals: stabilize learning and prediction, making deep networks possible positional encodings (like RoPE): tell the model where each token sits in the sequence so "cat" and "catastrophe" aren't confused by position

by stacking these layers (sometimes dozens or even hundreds) transformers build a complex understanding of your prompt, context, and conversation history

transformer recap: decoder-only: model only predicts what comes next, each token looks back at all previous tokens self-attention picks what to focus on (MQA/GQA = efficient versions for less memory) feed-forward MLP after attention for every token (usually 2 layers, GELU activation) everything's wrapped in layer norms + linear layers (QKV projections, MLPs, outputs) residuals + norms = stable, trainable, no exploding/vanishing gradients RoPE (rotary embeddings): tells the model where each token sits in the sequence stack N layers of this → final logits → pick the next token scale up: more layers, more heads, wider MLPs = bigger brains

VRAM: memory, the bottleneck VRAM must must fit:
weights (main model, whether quantized or not)
KV cache (per token, per layer, per head) weights: FP16: ~2 bytes/param → 7B = ~14GB 8-bit: ~1 byte/param → 7B = ~7GB 4-bit: ~0.5 byte/param → 7B = ~3.5GB add 10–30% for runtime overheads KV cache: rule of thumb: 0.5MB per token (Llama-like 7B, 32 layers, 4K tokens = ~2GB) some runtimes support KV cache quantization (8/4-bit) = big savings

throughput = memory bandwidth + GPU FLOPs + attention implementation (FlashAttention/SDPA help) + quantization + batch size offload to CPU? expect MASSIVE slowdown

GPU or bust: CPUs run quantized models (slow), but any real context/model needs CUDA/ROCm/Metal CPU spill = sadness (check device_map and memory fit)

quantization: reduce precision for memory wins (sometimes a tiny quality hit) FP32/FP16/BF16 = full/floored INT8/INT4/NF4 = quantized 4-bit (NF4/GPTQ/AWQ) = sweet spot for most consumer GPUs (big memory win, small quality hit for most tasks) math-heavy or finicky tasks degrade first (math, logic, coding)

KV cache quantization: even more memory saved for long contexts (check runtime support)

formats/runtimes: PyTorch + safetensors: flexible, standard, GPU/TPU/CPU GGUF (llama.cpp): CPU/GPU/portable, best for quant + edge devices ONNX, TensorRT-LLM, MLC: advanced flavors for special hardware/use protip: avoid legacy .bin (pickle risk), use safetensors for safety

everything is a tradeoff smaller = fits anywhere, less power more context = more latency + VRAM burn quantization = speed/memory, but maybe less accurate local = more control/knobs, more work

what happens when you "load a model"? download weights, tokenizer, config resolve license/trust (don't use trust_remote_code unless you really trust the author) load to VRAM/CPU (check memory fit) warmup: kernels/caches initialized, first pass is slowest inference: forward passes per token, updating KV cache each step

decoding = how next token is chosen: greedy: always top-1 (robotic) temperature: softens or sharpens probabilities (higher = more random) top-k: pick from top k top-p: pick from smallest set with ≥p prob typical sampling, repetition penalty, no-repeat n-gram: extra controls deterministic = set a seed and no sampling tune for your use-case: chat, summarization, code

serving options? vLLM for high throughput, parallel serving llama.cpp server (OpenAI-compatible API) ExLlama V2/V3 w/ Tabby API (OpenAI-compatible API) run as a local script (CLI) FastAPI/Flask for local API endpoint

local ≠ offline; run it, serve it, or build apps on top

fine-tuning, ultra-brief: LoRA / QLoRA = adapter layers (efficient, minimal VRAM) still need a dataset and eval plan; adapters can be merged or kept separate most users get far with prompting + retrieval (RAG) or few-shot for niche tasks

common pitfalls OOM? out of memory. Model or context too big, quantize or shrink context gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source

why run locally? control: all the knobs are yours to tweak: sampler, chat templates, decoding, system prompts, quantization, context cost: no per-token API billing-just upfront hardware privacy: prompts and outputs stay on your machine latency: no network roundtrips, instant token streaming

challenges: hardware limits (VRAM/memory = max model/context) ecosystem variance (different runtimes, quant schemes, templates) ops burden (setup, drivers, updates)

running local checklist: pick a model (prefer chat-tuned, sized for your VRAM) pick precision (4-bit saves RAM, FP16 for max quality) install runtime (vLLM, llama.cpp, Transformers+PyTorch, etc) run it, get tokens/sec, check memory fit use correct chat template (apply_chat_template) tune decoding (temp/top_p) benchmark on your task serve as local API (or go wild and fine-tune it)

glossary: token: smallest unit (subword/char) context window: max tokens visible to model KV cache: session memory, per-layer attention state quantization: lower precision for memory/speed RoPE: rotary position embeddings (for order) GQA/MQA: efficient attention for memory bandwidth decoding: method for picking next token RAG: retrieval-augmented generation, add real info

misc: common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc base model: not fine-tuned for chat (LLaMA, Falcon, etc) chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc) instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc)

chat/instruct models usually need a special prompt template to work well chat template: system/user/assistant markup is required; wrong template = junk output base models can do few-shot chat prompting, but not as well as chat-tuned ones

quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss quantization is a tradeoff: memory/speed vs quality 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks) math-heavy or finicky tasks degrade first (math, logic, code) quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc. some runtimes support quantized KV cache (8/4-bit), big savings for long contexts

formats/runtrites: PyTorch + safetensors: flexible, standard, works on GPU/TPU/CPU GGUF (llama.cpp): CPU/GPU, portable, best for quant + edge devices ONNX, TensorRT-LLM, MLC: advanced options for special hardware

avoid legacy .bin (pickle risk), use safetensors for safety

everything is a tradeoff: smaller = fits anywhere, less power more context = more latency + VRAM burn quantization = faster/leaner, maybe less accurate local = full control/knobs, but more work

final words: local LLMs = memory math + correct formatting fit weights and KV cache in memory use the right chat template and decoding strategy know your knobs: quantization, context, decoding, batch, hardware

master these, and you can run (and reason about) almost any modern model locally

12. danshipper (Group Score: 34.6 | Individual: 34.6)

Cluster: 1 tweets | Engagement: 312 (Avg: 57) | Type: Tech

PSA if you're iterating on front-end designs, you should try Claude Code desktop. it's great https://t.co/yywVbDT5r0

13. thdxr (Group Score: 34.4 | Individual: 34.4)

Cluster: 1 tweets | Engagement: 704 (Avg: 582) | Type: Tech

a lot of people ask why we don't manage our own GPUs

people imagine that when your company gets bigger you automatically bring more things in house

but there has been a lot of capital thrown at companies building inference with the expectation that the world will need a lot of it (and it's not easy at all)

these companies cannot serve openai or anthropic models so they're looking for open source/private model workloads

and the risk of under-building is way worse than over-building so at some point it's likely there will be too much supply

we have a real shot at being these companies' biggest customer given how much volume we're already doing

and this is an amazing position to be in

14. OpenBMB (Group Score: 33.6 | Individual: 33.6)

Cluster: 1 tweets | Engagement: 112 (Avg: 49) | Type: Tech

Sparse attention cuts computation, but GPU memory limits still bottleneck batch size and throughput due to the massive KV cache. Current offloading also faces training-inference mismatch. 🤯 Today, we present NOSA—new research from THUNLP (OpenBMB member) and collaborators: A native, offloadable sparse attention framework that introduces locality constraints during training to enable efficient KV cache offloading. 🤗 Paper: https://t.co/Fu4pa7z7dS 📄 arXiv: https://t.co/DfVvasT75R 💻 Code: https://t.co/2FXec1kfcL 🤖 Models: https://t.co/c0jsCkdn4O

Why it matters: 1️⃣ Native KV Offloading: While standard sparse attention has some inherent locality, it's often insufficient for efficient CPU-GPU transfer. NOSA introduces explicit locality constraints (lower bounds on cache hits) during training. This minimizes PCIe communication bottlenecks while preserving the original attention computation. 💾

2️⃣ Hybrid Selection Mechanism: NOSA decomposes token selection into Query-Aware (for retrieval accuracy) and Query-Agnostic (for stable eviction) components. This "best of both worlds" design ensures high locality for offloading without sacrificing the model's ability to capture long-range dependencies.⚡

3️⃣ High Throughput & Lossless: Paired with our custom NOSI inference system, NOSA achieves up to 5.04x and 1.92x higher decoding throughput compared to Full Attention and InfLLM-v2 respectively. It maintains near-lossless performance on LongBench and RULER, surpassing ShadowKV and ArkVale. 🚀

NOSA eliminates the training-inference mismatch, offering a scalable path for serving long-context models and deep-thinking tasks that generate massive outputs. #AI #THUNLP #OpenBMB #LLM #LongContext #SparseAttention #Efficiency

15. alex_prompter (Group Score: 33.5 | Individual: 33.5)

Cluster: 1 tweets | Engagement: 564 (Avg: 127) | Type: Tech

RT @alex_prompter: This site is literally a prompt library with thousands of prompts for Claude, Gemini & Nano Banana. https://t.co/oXyUxKQ…

16. rohanpaul_ai (Group Score: 33.2 | Individual: 27.0)

Cluster: 2 tweets | Engagement: 122 (Avg: 106) | Type: Tech

Ben Affleck doesn’t quite like the progress of AI.

Says AI "is not progressing in exactly the same way they sort of presented... this is going to be just a tool, just like VFX or visual effects.... it is not gonna be able to write anything meaningful.." https://t.co/bzmj78yhjo

See 1 related tweets

@rohanpaul_ai: RT @rohanpaul_ai: Ben Affleck doesn’t quite like the progress of AI.

Says AI "is not progressing in...

17. gdgtify (Group Score: 32.9 | Individual: 32.9)

Cluster: 1 tweets | Engagement: 25 (Avg: 93) | Type: Tech

I am working on prompts for AI titans. Kind of a fun experiment.

Prompt: Input Variable: [INSERT TECH CEO] (e.g., Elon Musk, Steve Jobs, Bill Gates, Jensen Huang)

System Instruction:
Generate a hyper-realistic product shot of a "Limited Edition Tech Founder" Vinyl Toy inside a premium acrylic display case.

Persona Analysis:
Analyze the Input: Identify the CEO's iconic outfit, facial shape, their "Vibe" (e.g., Musk = Chaos/Space; Jobs = Zen/Minimalist), and their primary product. The Pose:
If Visionary: Meditating or Pointing to the sky. If Engineer: Holding a tool or chip. If Corporate: Arms crossed, power stance.
Container (The Collector's Case):
The Box: A pristine, museum-grade Clear Acrylic Cube with a black or white base. The Packaging: Behind the case, the cardboard box features minimalist vector graphics of their company logo (e.g., Circuit lines, Apples, Rockets).
The Figure (The Vinyl):
Style: "Art Toy" Aesthetic. Smooth, matte plastic skin. Simplified facial features (cartoonish but recognizable). The Throne: The figure sits or stands on a Miniature Server Rack, Rocket Engine, or Stack of Cash . This acts as the pedestal. Accessories: VR Headsets, Flamethrowers, Floppy Disks, or Leather Jackets depending on the lore.
Typography:
The Plaque: A small metal tag on the base reads: "THE [LAST NAME] - [Edition Name]" (e.g., "THE MUSK - MARS EDITION"). The Serial Number: "1 of 1000" printed on the corner.

Output: ONE image, 1:1 Aspect Ratio, Studio Product Photography, White Background, Soft Shadows.

18. aakashgupta (Group Score: 32.6 | Individual: 32.6)

Cluster: 1 tweets | Engagement: 234 (Avg: 472) | Type: Tech

I’d argue almost the opposite.

The most valuable PMs in 2026 are moving down the abstraction ladder. Building prototypes. Shipping working code. Testing with real users before writing a single PRD.

Google, Stripe, and Netflix added vibe coding rounds to PM interviews. They’re testing whether you can turn a product idea into a working prototype in 15 minutes.

Microsoft’s Work Trend Index found that 71% of leaders would rather hire a less experienced candidate with strong AI building skills than a senior PM without them. The premium is on execution speed.

“Define goals, constraints, and long-term strategy” describes every mediocre VP of Product who’s ever existed. That was always the easy part. The hard part was building, which is why PMs who couldn’t build were dependent on engineering capacity.

Now that AI collapses the build cycle, the winning move is to close the gap between “what to build” and “building it.” The PMs gaining the most leverage right now prototype on Monday, test on Tuesday, and ship on Wednesday. They skip the 30-page strategy doc entirely.

Reforge calls it “the rise of the builder PM.” 54% of engineering leaders expect to reduce junior engineer hiring because PMs and designers can now build directly. The walls between PM, design, and engineering are collapsing into one person.

“Goal Architect” sounds like a promotion. In practice, it’s a layoff memo. The PMs who survive the next two years will be the ones who can show a working prototype, not a strategy deck.

19. GenAI_is_real (Group Score: 32.6 | Individual: 32.6)

Cluster: 1 tweets | Engagement: 56 (Avg: 62) | Type: Tech

sam is playing with words here. a human brain runs on ~20 watts of power to achieve general intelligence. compare that to the megawatts we're dumping into h100 clusters just to get gpt 5.3 to write bloated code.

the real "bitter lesson" isn't just about compute scale, it's about efficiency. this is why we’re so obsessed with sglang omni and kernel-level optimizations lately. if we can't get the inference tax down, ai will never match the biological elegance of human reasoning.

scaling is easy when you have unlimited power; building lean systems is the real engineering.

20. adcock_brett (Group Score: 32.3 | Individual: 32.3)

Cluster: 1 tweets | Engagement: 2017 (Avg: 1110) | Type: Tech

Running 24/7 without any human babysitters has been really hard

We want robots operating at all times - even at 2am, on weekends, or on Christmas Day

The robots run until their battery is low. When one heads to dock for recharging, a second robot receives a message to leave the dock and make room for the incoming robot. The first robot then autonomously docks. By the time the first robot is charging, the second is already back to work

We never want downtime. If a robot has an issue, it goes to a triage area to dock while a replacement robot swaps in from another area. This could be due to a hardware or software issue

The robots dock onto a wireless inductive charger built into their feet. They step onto a pad that charges them via coils in their feet at up to 2 kW. It takes about an hour to fully charge at roughly a 1C rate

We’re now up and running across many different use cases like this. Crazy to see it