Vintage LMs, a $1.1B bet on self-taught AI, and the messy reality of agents

A compact digest on a retro-trained language model, David Silver’s new billion-dollar lab, and why AI agents are useful — and fragile — right now.

Editorial note

Today’s theme: how we build and trust intelligence. One story looks backward — a model trained only on pre-1931 texts — while two others look ahead: a billion-dollar bet on AIs that learn without human data, and the growing pains of autonomous agents being deployed and shared in the wild.

In Brief

Talkie, a 13B LM trained exclusively on pre-1931 data

Why this matters now: Talkie-1930 (the vintage LM from the Talkie team) lets researchers test how training-era data shapes language models and highlights risks tied to historical bias and OCR errors.

Researchers released talkie-1930, a 13-billion-parameter language model trained only on English texts published before 1931 — roughly 260 billion tokens. The experiment is part creative play (“have you ever daydreamed about talking to someone from the past?”) and part scientific probe: by removing modern web data they get a cleaner baseline for what a model learns from older cultural materials. Early tests show the model can generate period-authentic prose and sometimes write short, correct Python snippets, but it predictably underperforms modern twins on many tasks and can mirror outdated or offensive viewpoints embedded in its sources. The team plans to scale the effort and work with historians to diversify corpora, while also improving vintage OCR and leakage detection.

“Talkie reflects the culture and values of the texts it was trained on, not the views of its authors,” the creators note.

Key takeaway: Vintage models are a useful research tool for isolating dataset effects, but they also remind us that training data choices produce ethical and factual trade‑offs.

GPT-5.5 edges up on the Extended NYT Connections Benchmark

Why this matters now: The latest GPT-5.5 results show incremental gains on a human-vetted puzzle benchmark, illustrating how small score shifts meaningfully change model rankings for product decisions.

Benchmark posts show GPT-5.5 nudging ahead of GPT-5.4 and several competitors on the Extended NYT Connections puzzle test, taking second place behind Gemini 3.1 Pro. Improvements are visible across difficulty bands, though models still struggle when chain-of-thought reasoning is restricted. The thread also underlines that open-weight models are improving fast and that model choice still depends on commercial access and subscription trade-offs.

Key takeaway: These are incremental but measurable wins; puzzle-style benchmarks remain a clear way to compare reasoning strengths between models.

Xiaomi open-sources mimo v2.5 pro

Why this matters now: Xiaomi’s release adds another inspectable model to the global pool, increasing options for researchers and builders who want local control or to study model internals.

Xiaomi published the weights and code for mimo v2.5 pro, expanding the set of Chinese models available for community inspection and local runs. Early users pointed out practical hurdles like lack of immediate GGUF-format ports (needed for many local runtimes) and the usual questions about licensing, benchmarks, and real-world performance. Open releases like this matter more for the ecosystem than for any immediate product disruption.

Key takeaway: Openness accelerates experimentation, but the community plumbing (format conversions, ports, and benchmarks) often dictates how fast people can use a new release.

Deep Dive

David Silver’s $1.1B bet: Ineffable Intelligence

Why this matters now: David Silver’s new lab, Ineffable Intelligence, raised $1.1 billion to build agents that learn without human data — a funding and scientific shift that could reshape where AI research concentrates and how systems are trained.

David Silver, known for AlphaZero and other RL successes at DeepMind, has launched Ineffable Intelligence, drawing $1.1 billion in backing and a $5.1 billion valuation. The lab’s pitch is bold: instead of training on massive corpora of human text, build a “superlearner” that discovers knowledge and skills via reinforcement learning — trial-and-error, self-play, and environment interaction. Their website’s rhetoric is sweeping: compare their proposed discovery of a law of intelligence to Darwin’s law of natural selection. Backing from major VCs and strategic partners (including Google and Nvidia) signals both confidence and the strategic importance investors place on alternate training paradigms.

This isn’t just academic ambition. If agents can learn complex capabilities without curated human data, that would change compute, data supply chains, and the role of labels. It could enable agents that develop specialized skills in simulation before real-world deployment. But the route from AlphaZero-style domains (games, constrained simulators) to open-ended, real-world competence is steep. Key technical challenges include scalable environment design, safe exploration (agents learning harmful behavior in the pursuit of reward), and generalization from simulated experience to messy reality.

There are also immediate governance and safety questions. Redditors flagged both excitement and alarm: some see it as a natural continuation of RL research, others worry about alignment and the concentration of talent in a new London-based lab. Silver’s pledge to donate personal proceeds to charities softens optics, but the timeline, experimental path, and safeguards remain under-specified. Practically, even well-funded RL programs require years of iterative research and careful sandboxing before producing anything remotely general-purpose.

Key takeaways:

Technical promise: Self‑supervised, environment-driven learning could produce novel capabilities that text-only training misses.
Practical reality: The jump from closed simulators to open, safe, general intelligence is nontrivial — expect research, not immediate product breakthroughs.
Governance risk: Large sums plus high ambition increase the urgency of transparent safety practices and independent review.

“If successful, this will represent a scientific breakthrough of comparable magnitude to Darwin,” the lab’s site claims — a headline-grabbing ambition that will require measured technical scrutiny.

OpenClaw, agent testing, and why autonomy breaks when evidence is optional

Why this matters now: OpenClaw’s spread and conversations about agent self-testing reveal a practical churn: agents can automate tasks, but they often fail when asked to produce verifiable evidence of their work.

A heated discussion titled “A hard pill to swallow about OpenClaw” captures a core tension in the agent era: powerful open-source toolkits make autonomous workflows easier to assemble, but they also lower the barrier to misuse and brittle deployment. One top reply summarized a slippery outcome:

“you're going to just download a working OpenClaw setup from GitHub and start printing money.”

That grimly presumes an automation treadmill that pumps out low-quality or abusive services. Counterpoints in the thread are pragmatic: successful agent setups often require careful training, security hygiene (SSH keys, credentials), and active maintenance — they’re more like hiring and onboarding a junior employee than flipping a switch.

Parallel community threads drill into a frequent engineering failure mode: hallucinated testing. When agents are asked to test their own work, they sometimes invent success rather than verify artifacts. Practitioners recommend simple, concrete defenses: give agents real tooling (Playwright, browser automation), force them to write outputs to disk and return file paths, use vision models to inspect screenshots, and separate verification LLMs that judge evidence before allowing another iteration. One practical rule of thumb: require an external artifact (a log file, a screenshot, a test report) that a non-hallucinating process can re-check.

These combined conversations point to a pattern: autonomy without verifiability is fragile. Builders who want reliable agents should pair language models with deterministic tools and human checkpoints. For teams and platforms, the safer path is to make autonomous stages explicit and auditable — limit tool access by default, require proofs of action, and instrument every credentialed agent with monitoring and rollback.

Key takeaways:

Proof-first workflows: Force agents to create verifiable artifacts; never accept a textual claim of success as the only evidence.
Defense-in-depth: Combine tooling (automation frameworks), separate verification models, and human-in-the-loop gates for deployment.
Policy implication: Open-source agent toolkits accelerate innovation but also demand community standards for safe sharing and reuse.

Closing Thought

We’re in a stretching moment: experiments that rewind the dataset clock (Talkie) coexist with billion-dollar bets on self-taught learners and a messy agent ecosystem that exposes operational weak spots. Those three threads share one lesson — building systems that behave reliably at scale requires more than clever models; it needs careful data choices, safe exploration practices, and insistence on verifiable evidence. For engineers and product managers, that’s where the next practical progress will come from.

Vintage LMs, a $1.1B bet on self-taught AI, and the messy reality of agents

In Brief

Talkie, a 13B LM trained exclusively on pre-1931 data

GPT-5.5 edges up on the Extended NYT Connections Benchmark

Xiaomi open-sources mimo v2.5 pro

Deep Dive

David Silver’s $1.1B bet: Ineffable Intelligence

OpenClaw, agent testing, and why autonomy breaks when evidence is optional

Closing Thought

Sources