QFM069: Machine Intelligence Reading List - June 2025
Source: Photo by Jonathan Kemper on Unsplash
This month's Machine Intelligence Reading List features comprehensive guides and research. A Practical Guide to Building Agents provides OpenAI's official framework for agent development. Built Multi-Agent Research System shares Anthropic's engineering lessons.
Six Months in LLMs offers Simon Willison's mid-year retrospective, while The Gentle Singularity presents Sam Altman's optimistic vision.
As always, the Quantum Fax Machine Propellor Hat Key will guide your browsing. Enjoy!

Links
Apple's new research demonstrates that large language models, including advanced "reasoning models" like o1, fundamentally fail to generalize beyond their training distribution on classic reasoning tasks such as the Tower of Hanoi—validating long-standing critiques that neural networks cannot reliably extrapolate outside the data they've been exposed to. The paper also validates concerns that chain-of-thought reasoning traces don't accurately reflect how these models actually arrive at answers, showing that inference-time compute scaling cannot overcome the core limitation that LLMs break down when faced with out-of-distribution problems.
The author presents a structured approach to AI-assisted project development using Claude Code, centered on creating a clear PLAN.md file that breaks work into testable milestones with automated verification scripts, then leveraging Claude Code's ability to self-iterate autonomously for 5-8 minutes while running fast feedback loops. The key advantage over tools like Cursor is Claude Code's capacity to make changes, fix them independently, and continue working without interruption when given a clear plan and deterministic testing/linting infrastructure, with the developer reviewing diffs as pull requests and managing staged commits to track reasoning chains.
This repository is a curated list of AI coding tools organized by category, including code completion assistants (GitHub Copilot, Codeium, Tabnine), refactoring tools, code search capabilities, and LLM-based code generation systems. The repository was archived in February 2026 and is now read-only, but previously served as a comprehensive resource documenting the landscape of AI-powered development tools from both commercial providers and open-source projects.
PRFAQ (Press Release/FAQ) documents provide essential context that dramatically improves LLM output quality across brainstorming, writing, and coding tasks by articulating vision and strategy, rather than relying on raw prompts alone. The document format forces clarity on "why" behind work rather than just "what," enabling AI to generate contextually appropriate, nuanced results instead of generic suggestions. This approach mirrors how effective professionals operate by understanding the big picture before execution, transforming LLM usage from a lottery into a reliable tool.
LLM-assisted coding tools have improved dramatically in recent months and can be genuinely productive when treated as guided collaborators rather than autonomous code generators, though skepticism about long-term codebase effects and hype cycles remains warranted given the rapid iteration and uncertain counterfactual of alternative tooling investments. The shift from viewing these tools as either magic solutions or useless "stochastic parrots" reflects a "stone soup" dynamic where billions in investment and complementary technologies are driving real improvements, but stabilization may take years before their true impact can be assessed.
As AI language models become central to both humanistic research and AI development itself, humanities skills—particularly understanding of language, culture, and rhetoric—have become unexpectedly valuable rather than obsolete. The article argues that universities pretending AI won't transform teaching and research is untenable, and that humanistic knowledge is now essential both for using AI tools effectively (in paleography, translation, data mining) and for fixing AI systems when they fail due to cultural or linguistic misunderstandings. Non-technical humanists now have the capability to write their own code, fundamentally reshaping what humanities scholarship entails.
Anthropic's multi-agent Research system uses a lead agent (Claude Opus 4) that coordinates parallel subagents (Claude Sonnet 4) to explore complex research queries simultaneously, achieving 90.2% better performance than single-agent systems by distributing token budgets across independent search trajectories. The system's effectiveness stems from token efficiency—token usage alone explains 80% of performance variance—combined with parallelization that enables breadth-first exploration unsuitable for sequential pipelines, allowing dynamic path adjustments as investigations unfold.
Altman argues that AI systems like GPT-4 and o3 represent a genuine technological takeoff toward superintelligence, with the hardest scientific insights already achieved; the trajectory suggests agents capable of novel research by 2026 and real-world robots by 2027, fundamentally transforming human productivity and scientific discovery rates. He contends that intelligence and energy have been humanity's primary constraints on progress, and their imminent abundance through advanced AI—combined with improved governance—could unlock transformative improvements in quality of life, medicine, and scientific understanding, even as fundamental human experiences remain unchanged.
As AI democratizes technical execution across writing, design, and coding, judgement—the ability to know what to create, make meaningful choices, and evaluate quality—has become the primary differentiator between professionals, paralleling Brian Eno's 1995 observation that computer sequencers shifted music production from a skill problem to a judgement problem. The most valuable workers in an AI-enabled future will be those who can ask the right questions, frame problems effectively, and provide strategic direction rather than execute technical tasks.
The author argues that AI skeptics in software development are wrong because they're evaluating LLMs based on outdated usage patterns (copy-pasting from ChatGPT), not how modern AI coding agents actually work—agents that autonomously navigate codebases, run tools, compile code, and iterate on results. LLMs significantly reduce boilerplate coding, eliminate the friction of starting new projects, and overcome the psychological inertia that prevents developers from tackling ambitious work, making them the second most important technological development in the author's career regardless of future progress.
LLMs enable a new "intention economy" where companies capture and commodify human motivations and desires through hyper-personalized manipulation, natural language analysis, and inference of both explicit and implicit intent signals—extending beyond the attention economy by targeting not just what users attend to, but what they want to want. Tech companies are racing to develop infrastructure that elicits, forecasts, and modulates human plans and purposes across mundane and consequential decisions, then sells this behavioral and psychological data to the highest bidder.
The LLM landscape has become so rapidly evolving that covering even six months rather than a year is challenging, with over 30 significant models released recently including Meta's Llama 3.3 70B (which achieved GPT-4-class performance on consumer hardware) and DeepSeek's undocumented open-weight model that emerged as a top performer. Rather than relying on traditional benchmarks and leaderboards, the author uses a creative evaluation method of prompting models to generate SVG code for a pelican riding a bicycle—an intentionally difficult task that reveals both capability and the model's reasoning through comments in the generated code.
Fine-tuning advanced LLMs for knowledge injection is counterproductive because it overwrites existing knowledge rather than adding new information—neurons are finite resources where updating weights risks erasing the intricate patterns already encoded in the network. Instead of fine-tuning, modular techniques like retrieval-augmented generation, adapters, and prompt engineering should be used to inject new knowledge without compromising the model's carefully built foundational ecosystem.
Claude Code now supports remote MCP servers, allowing developers to connect external tools and data sources like Sentry and Linear directly to their coding environment without managing local infrastructure. Remote MCP servers reduce maintenance overhead through vendor-managed updates and scaling, while native OAuth support eliminates the need to manually handle API keys or credentials. This integration enables Claude Code to access real-time project context and debugging information, streamlining workflows by keeping developers within a single interface.
Regards,
M@
[ED: If you'd like to sign up for this content as an email, click here to join the mailing list.]
Originally published on quantumfaxmachine.com and cross-posted on Medium.
hello@matthewsinclair.com | matthewsinclair.com | bsky.app/@matthewsinclair.com | masto.ai/@matthewsinclair | medium.com/@matthewsinclair | xitter/@matthewsinclair
Was this useful?