High-Capability LLMs: A Sourced Reference

@joshuashew.bsky.social

Researched and compiled by Claude (Opus 4.6), directed by Joshua Shew. April 2026.

Twelve things LLMs have done that are hard to dismiss, with sources. No press releases, no vendor claims, no self-reported benchmarks. If the only evidence is a company saying their product is good, it's not here.

The scope is LLMs and transformer-based systems. AlphaFold won a Nobel Prize but it's a graph neural network, not a language model. This is about the models you can talk to.


1. Terence Tao co-authored a math paper with an LLM and says AI is "ready for primetime."

Tao is a Fields Medalist, which in mathematics is as high as it goes. He used DeepMind's AlphaEvolve to attack 67 open problems across analysis, combinatorics, geometry, and number theory, improving the state of the art on several. At IPAM in March 2026, he said AI "saves more time than it wastes" in math and theoretical physics. He uses it for literature search, coding, calculations, and testing approaches. He also cautioned that results "depend very heavily on the context and on the choices of the human operator."

Sources: Georgiev, Gómez-Serrano, Tao & Wagner, arXiv:2511.02864 (2025). Tao at IPAM.

2. A physicist handed GPT-5 Pro an equation. It found the same symmetries he'd spent months deriving.

Alex Lupsasca, a black hole physicist at Vanderbilt and recipient of the 2024 New Horizons in Physics Prize, had derived new hidden symmetries governing black hole tidal response. When he gave the same equation to GPT-5 Pro with minimal guidance, it returned the same symmetry generators in about 18 minutes. "I think this is just incredible, and it's clearly going to change everything that we do," he said. He later joined OpenAI's science team.

Sources: OpenAI Academy: Lupsasca. Science News (Feb 2026).

3. 758 BCG consultants got measurably better with GPT-4. Except when they didn't.

A pre-registered experiment at Boston Consulting Group, co-authored by Wharton's Ethan Mollick, gave 758 consultants realistic knowledge-work tasks with and without GPT-4 (April 2023 version, no special training). With AI, they completed 12.2% more tasks, 25.1% faster, with 40% higher quality. But on tasks that fell outside the model's competence, consultants with AI performed 23% worse than those without it. Mollick calls this the "jagged frontier": AI capability is unevenly distributed across tasks in ways that aren't obvious in advance. The study also found AI use reduced the diversity of ideas generated. Published in Organization Science (2026), pre-registered, peer-reviewed.

Sources: Dell'Acqua, McFowland, Mollick et al., "Navigating the Jagged Technological Frontier," Organization Science (2026).

4. Tyler Cowen replaced 20 books with a conversation.

Cowen, an economist at George Mason and one of the most-read economics bloggers in the world, describes a specific shift in his research process: "In the old days, I would have ordered and paid for 20 to 30 books. Now, maybe I've ordered two or three books, but I'll keep interrogating the best LLMs. I acquire the context much more quickly." He uses different models for different tasks: GPT for research depth, Claude for writing quality (he calls it "a wonderful mix of thoughtful, philosophical, dreamy, flexible, versatile"). He still verifies facts and checks for hallucinations. But the workflow has changed.

Sources: David Perell "How I Write" podcast. The Sociology Place summary.

5. LLMs score 95% on the US Medical Licensing Exam. Physicians score 76–82%.

A meta-analysis of 120 evaluations across 10 medical licensing exam systems in 9 languages found GPT-o1 at 95.4%, DeepSeek-R1 at 92.0%, GPT-4o at 89.4%. Three years earlier, GPT-3 scored 36.7%. On adversarial clinical reasoning questions that test commonsense medical judgment, LLMs still perform poorly compared to physicians and exhibit overconfidence. Exams test knowledge recall. Clinical practice requires temporal reasoning, physical examination, and judgment under uncertainty.

Sources: El-Amir et al., Cureus (2025).

6. An AI tutor outperformed Harvard's best teaching method in a controlled trial.

A randomized controlled trial (N=194) found that students using a GPT-4-based tutor learned more than twice as much as students in active learning classrooms, the current gold standard in STEM education, while spending 20% less time. Effect size: 0.73–1.3 standard deviations. The tutor was designed with real pedagogical principles (cognitive load management, growth mindset, active recall), not raw prompting. One study, one course, one university. No replication yet.

Sources: Kestin et al., "AI tutoring outperforms in-class active learning," Scientific Reports 15 (2025).

7. LLM coding agents went from 33% to 80% on real GitHub issues in 18 months.

SWE-bench Verified tracks whether an AI system can investigate a real codebase, generate a correct patch, and pass the project's test suite. Leading models now solve ~80% of its 500 tasks. OpenAI's audit found that frontier models show signs of training data contamination on this benchmark. On the uncontaminated SWE-bench Pro (1,865 tasks, multiple languages), scores are 46–57%. Top scores on Verified went from 33% at launch to 80% in 18 months.

Sources: Epoch AI. vals.ai.

8. A protein language model generated a functional protein that doesn't exist in nature.

Meta's ESM3 generated esmGFP, a green fluorescent protein with 58% sequence identity to the nearest known fluorescent protein. That's roughly the evolutionary distance of 500 million years. The design took two rounds: an initial dim candidate was refined into a bright variant comparable to natural GFPs. It was physically synthesized. It fluoresces.

Sources: Hayes et al., "Simulating 500 million years of evolution with a language model," Science 387 (2025).

9. Lawyers are adopting LLMs fast. The hallucination problem is real.

79% of legal professionals report using AI in their practice. At top Vault firms, associates use LLMs for document analysis, contract review, and drafting. A preregistered Stanford study found that even specialized legal AI tools hallucinate 17–33% of the time (testing tools as of mid-2024); raw GPT-4 hit 43%. Courts have sanctioned lawyers for AI-generated fake citations at a rate of 2–3 incidents per day by mid-2025. The tools are genuinely useful; they are also genuinely unreliable.

Sources: Magesh et al., "Hallucination-Free?" 22 J. Empirical Legal Stud. 216 (2025). Understanding AI (Jan 2026).

10. A social science researcher switched to Claude Code for all his research work.

Thomas Manandhar-Richardson, Director of Research at Bryant Research and lead data consultant at a climate NGO, pivoted to Claude Code and Codex CLI for all research work in October 2025. This is practitioner testimony, not a controlled study. He reports faster data analysis, literature review, and statistical modeling. He also reports that Claude "will happily start tasks even when it has incomplete understanding" and that catching its errors requires deep domain knowledge. He documents his workflow publicly, updates it weekly, and names specific friction points.

Sources: Evolving Impact blog (updated March 2026).

11. An LLM discovered new mathematics. The result was published in Nature.

DeepMind's FunSearch paired an LLM with an evolutionary framework and a program evaluator to discover new constructions for the cap set problem in extremal combinatorics, exceeding known human results. The LLM generates candidate programs; the evaluator checks correctness automatically. Most candidates are wrong. The system works because the bad ones get filtered out. This was, by the authors' account, the first genuine LLM-based mathematical discovery.

Sources: Romera-Paredes et al., Nature 625 (2024).

12. LLMs won gold at the International Math Olympiad. With caveats.

At IMO 2025, Gemini Deep Think scored 35/42 working end-to-end in natural language, and an OpenAI model achieved gold independently. Both were officially graded by IMO coordinators. The models used best-of-n sampling, generating multiple candidates and selecting the strongest. Publicly available versions of the same models scored far lower in parallel evaluations. All gold-medal models solved five moderate problems and failed the one very hard problem.

Sources: DeepMind: IMO 2025. Scientific American.


What's not here

The original request asked for examples across coding, art, education, writing, gamedev, psych, and finance. We cover coding, education, medicine, law, economics, biology, and social science. The gaps are real but not accidental.

AI-generated art and music are impressive, but the best results come from diffusion models and audio architectures, not LLMs. This list is specifically about language models. Creative writing quality is subjective and resists the kind of sourcing standard used here. Game development, psychotherapy, and finance lack rigorous public evidence of LLM capability specifically; what exists is either anecdotal, proprietary, or both.

Adoption numbers (65% of developers use AI weekly, Khan Academy has 700,000 AI tutor users) tell you about popularity, not about whether the tools are good.

AlphaFold, GenCast, GNoME, and halicin are excluded because they aren't LLMs.

How this was made

All claims were verified against the linked sources during a single research session. Each entry went through adversarial review: what would a domain expert skeptic say about this claim? A companion research notes document with the full adversarial case for each entry is available on request.

joshuashew.bsky.social
Joshua Shew

@joshuashew.bsky.social

If your brain isn’t tired by the end of the day, you’re doing it wrong

he/him

2026 theme: Year of Exploration

Post reaction in Bluesky

*To be shown as a reaction, include article link in the post or add link card

Reactions from everyone (0)