Open weights, open risks: AI wins and ID checks lose

Today’s picks: an open-weight model beats a frontier agent in security tests, while U.S. kids‑safety law pressures platforms toward identity checks — plus wild AI grading and hiring side effects.

Editorial note

AI is reshaping both what systems can do and what societies expect of them. Today’s stories pair a practical surprise in model performance with a mounting policy fight over identity, and a few cautionary tales about putting probabilistic models into high‑stakes gates like hiring, exams, and medicine.

In Brief

HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No – 88

Why this matters now: HackerRank’s open-source applicant-tracking system (hiring-agent) can quietly decide who gets interviews because its AI grading is inconsistent and non-deterministic.

HackerRank published an open-source ATS that parses resumes and grades candidates by calling an LLM multiple times. The author ran the same PDF repeatedly and saw scores from 66 to 99 across 100 runs — categories like “technical skills” were stable, but subjective rubrics (projects, experience) swung wildly. The writeup highlights a simple but consequential point: when you replace human judgment with multiple probabilistic LLM calls, hiring outcomes become a lottery unless you lock down determinism and auditing.

"I fail 65% of the time. Same exact resume, different luck."

Key takeaway: Use LLMs for checklist parsing, not unsupervised pass/fail gates, unless you add repeatable prompts, calibration, and legal review. (See the author’s account here.)

I used Claude Code to get a second opinion on my MRI

Why this matters now: A developer used Anthropic’s Claude Code with Opus to analyze a shoulder MRI and received an AI read that contradicted his clinic, illustrating both potential and peril of code-capable LLMs in medicine.

The author fed DICOM exports to Claude Code, which installed analysis packages, generated a radiology-style report, and produced an arbitration summary that concluded no discrete tear — opposing the clinic’s finding. This is an eye‑opening example of how accessible, code‑capable LLM tooling can surface alternative readings and treatment flags, but it’s a single anecdote: clinicians and radiologists warned on Hacker News about inconsistency and the risk of over‑ or under‑treatment when AI disagrees with human experts.

"NO discrete partial- or full-thickness tear identified, including at the apical insertion."

Key takeaway: Patients can get cheaper second opinions from LLM tooling, but these outputs need cross-validation and provenance before changing care. (Original post: antoine.fi.)

Professor denounces mass AI fraud on an exam at Brown

Why this matters now: A take‑home midterm at Brown produced implausibly high scores that match ChatGPT outputs, forcing a supervised in‑person retake and sharpening the debate over assessment design in the AI era.

Economist Roberto Serrano says his take‑home average was 96/100 with 40 perfects; after an in-person final the average plunged to 48, leading Serrano to call the take‑home incident “the biggest known scandal” in Ivy‑League AI cheating. The episode crystallizes a practical administrative problem: if take‑home work can be trivially aided by generative systems, institutions must redesign assessment formats or accept supervised, oral, or handwritten alternatives — each with equity and logistical consequences.

"The empirical evidence of fraud is overwhelming."

Key takeaway: Colleges will need mixed approaches (redesign + supervised checks) rather than one-size-fits-all bans. (Reporting: El País.)

Age verification is just a precursor to automated attribution of speech

Why this matters now: A thoughtful thread argues that modern age‑verification rules pave the way to tying online accounts to real identities, with chilling effects for speech and dissent.

The post warns that “age checks collapse the second question” of investigations (who did it), effectively making platforms into identity-attribution systems. Commenters pointed to design alternatives — like cryptographic zero‑knowledge proofs that confirm age without handing over identity — but the political momentum behind “protecting kids” risks normalizing systems that make speech traceable and automatable.

"age verification laws are -by design- identity attribution systems"

Key takeaway: If platforms adopt intrusive age checks, widespread attribution — and automated enforcement — becomes technically and politically easier. (Thread: nonogra.ph.)

Deep Dive

GLM 5.2 beats Claude in our benchmarks

Why this matters now: Semgrep’s security team found that Zhipu AI’s open-weight GLM 5.2 outperformed Anthropic’s Claude Code on an IDOR detection benchmark while costing far less — a concrete sign that open weights matter again for security tooling.

Semgrep ran a controlled IDOR (Insecure Direct Object Reference) benchmark and reported that GLM 5.2 achieved a 39% F1 score, beating Claude Code on the same prompts, and doing so at roughly one‑sixth the token cost of frontier agents. Importantly, GLM 5.2 wasn’t given the multimodal harnessing that their specialized pipeline uses; it only saw a repo and a standard prompt. Semgrep’s multimodal harness — which enumerates endpoints and scaffolds discovery — still sits at the top, but the unexpected win underscores two points: model architecture and access matter less than data, fine‑tuning, and total prompt/harnessing design; and open weights let teams run, inspect, and iterate locally in ways closed models do not.

"you can download them, run them on your own hardware, fine-tune them, and inspect them" — a property Semgrep highlights about GLM 5.2.

Why this changes procurement and security planning: teams that must keep code and data in-house (air‑gapped networks, regulated industries) can now consider running competitive, cheap LLMs locally without surrendering capabilities. That reduces vendor lock‑in and gives defenders practical options for building repeatable, auditable pipelines. But caveats matter: open models can be bench‑maxxed, misused, or ill‑calibrated for particular threat models, and the Semgrep post stresses harnessing still produces the biggest gains. Expect more head‑to‑head public benchmarks and a renewed focus on reproducible harnesses and cost-per-ticket metrics.

Key takeaway: Open weights like GLM 5.2 are no longer niche curiosities; they’re viable, inspectable alternatives for security teams — but success depends on harness design, calibration, and operational controls. (Read Semgrep’s report: semgrep.dev.)

The KIDS Act would require age checks to get online

Why this matters now: The proposed KIDS Act bundles multiple internet bills and uses a low “knows or should have known” standard that will practically force platforms toward age verification systems to avoid liability.

The Electronic Frontier Foundation’s analysis warns that although the bill states "age verification is not required," its negligence standard and broad moderation mandates will push companies to implement verification, likely via identity uploads, biometric scans, or error‑prone automated age estimation. That’s a legal design problem: when regulatory incentives favor certainty about age, platforms will default to systems that tie accounts to real‑world identity, eroding anonymity and privacy for many users.

"age verification is not required." — text the EFF notes is effectively undercut by the bill’s other provisions.

Practical harms are not hypothetical: automated age estimation mislabels people of color, disabled users, and trans and nonbinary people at higher rates; enforced age checks can chill whistleblowers and political dissidents; and companies may weaken encryption or restrict ephemeral messaging to meet enforcement expectations. Commenters suggested technical mitigations — for example, zero‑knowledge proofs that certify age without revealing identity — but such solutions require strong policy mandates and careful standards to avoid being co‑opted. For businesses, the immediate risk is compliance cost plus a privacy‑engineering headache; for civil liberties, the risk is a structural shift toward identity‑bound speech online.

Key takeaway: If Congress passes the KIDS Act as drafted, expect platforms to pivot toward identity‑linking age checks with broad privacy and speech consequences unless alternative cryptographic designs are mandated. (Analysis: EFF.)

Closing Thought

Two trends are colliding: models that you can run, tune, and audit locally are getting surprisingly capable and cheap, while policy pressure is steering online services toward more identity‑bound interactions. That mix is potent — better tools for defenders and users, but also easier surveillance rails if lawmakers and companies choose the wrong defaults. Today’s job for technologists and policy wonks is practical: build auditable harnesses, demand privacy‑preserving age checks where needed, and treat probabilistic AI outputs as helpers, not final authorities.