Alert: China's GLM-5.2 Just Matched Mythos on Bug-Finding

Written by Cam Sivesind | Wed | Jul 1, 2026 | 1:37 PM Z

A Beijing-based AI lab just demonstrated something the U.S. export control regime was specifically designed to prevent: a Chinese model that performs on par with one of America's most restricted AI systems at finding software vulnerabilities. And it did so by giving the model away for free.

On June 13, Zhipu AI (operating under the brand Z.ai) released GLM-5.2, an open-weight, 744-billion-parameter model under a permissive MIT license. Within days, independent benchmarking from Semgrep and reporting from The Wall Street Journal converged on the same headline: in targeted vulnerability-detection tasks, GLM-5.2 performs roughly in the same range as Anthropic's Claude Mythos—the model Anthropic has kept deliberately locked behind a vetted-partner program because of how effective it is at the same job.

For security teams, the benchmark numbers are interesting. The governance story underneath them is the part that should actually change how one thinks about AI and risk planning for the next 12 months.

The comparison that's circulating centers on IDOR (Insecure Direct Object Reference) vulnerability detection. Independent testing by Semgrep put GLM-5.2's F1 score at roughly 39%, ahead of Claude Code's 32–37% on the same evaluation set. Zhipu has also claimed broader parity with Mythos across other bug-finding benchmarks, and GLM-5.2 has separately ranked among the most-used models on OpenRouter and second worldwide on a closely-watched coding benchmark—strong enough that Zhipu's market value reportedly crossed $128 billion shortly after.

It's worth being precise about what this is and isn't. GLM-5.2 still trails Anthropic and OpenAI's frontier systems on broad, general-purpose reasoning. This is a case of a competitor closing the gap hard on one specific, high-stakes capability—automated vulnerability discovery—rather than overtaking U.S. labs across the board. Some of Zhipu's broader parity claims also haven't been independently verified, partly because Mythos itself has been intermittently unavailable for outside researchers to test against (more on that below). Treat the specific percentage-point comparisons with appropriate skepticism; treat the trend line as real.

The capability gap narrowing is one thing. The delivery mechanism is the part that should actually concern security leaders.

Mythos lives behind an API that Anthropic—or a U.S. regulator— can switch off at will, which is precisely what happened in June. GLM-5.2 ships as downloadable weights under an MIT license. Anyone can pull it onto consumer-grade hardware and run it locally, with no vendor in the loop, no usage logging, and no ability for Zhipu to see or shape what it's used for after release. As one Forbes analysis put it, the variable that matters here isn't raw capability, it's containment. A frontier-adjacent vulnerability-finding model that nobody can revoke access to is a fundamentally different risk profile than the same capability sitting behind a gated, monitorable API—regardless of how the benchmark scores compare.

That distinction is exactly what the U.S. export control strategy was built to prevent, and exactly what it currently can't reach.

"Historically, the most advanced, and potentially dangerous, technology has been closely held by major government or organizations with strict controls," said John Gallagher, Vice President at Viakoo, a provider of automated IoT cyber hygiene. "As Chinese frontier models are showing, those days are past as the most advanced AI capability is available to all. This genuinely democratizes the ability to exploit vulnerabilities to all types of hackers."

Gallagher added: "While much of the immediate concern centers on traditional IT systems, the real blast radius of cheap, open-weight offensive AI tools hits Operational Technology (OT), IoT, and ICS systems the hardest. Unlike enterprise IT networks, which are heavily monitored, patched, and segmented, physical security systems—such as legacy networked security cameras, access control panels, and smart building HVAC systems—suffer from massive asset blindness and sparse patching schedules."

What this means for Mythos—and for Anthropic's last few weeks

To understand why this story is landing the way it is, we have to examine the timeline of what's happened to Mythos itself.

Anthropic previewed Mythos in April through Project Glasswing, an invite-only program that eventually grew to roughly 200 vetted organizations— including Amazon, Apple, Google, Microsoft, Cisco, Nvidia, and the Linux Foundation—using the model strictly for defensive vulnerability research. By late May, those partners had used it to surface more than 10,000 high- or critical-severity vulnerabilities, including a 27-year-old flaw in OpenBSD's TCP stack and 271 vulnerabilities in an early Firefox build, reportedly engineering working exploits roughly 90 times faster than prior-generation tools.

On June 9, Anthropic released a public sibling, Claude Fable 5—the same underlying model with guardrails that route high-risk security queries to a safer fallback. Three days later, the U.S. Commerce Department ordered Anthropic to disable both Fable 5 and Mythos 5 worldwide, for every user, citing a reported jailbreak technique and broader national security concerns about foreign access to cyber-capable AI. Anthropic complied within hours and publicly disputed the government's characterization of the jailbreak's severity, while the administration's account—relayed by White House AI advisor David Sacks—placed responsibility on Anthropic for declining to "fix" the issue on the government's terms.

The blackout lasted about two weeks. On June 26, Commerce Secretary Howard Lutnick notified Anthropic that Mythos 5 could be restored to roughly 100 vetted U.S. organizations—critical infrastructure operators, federal agencies, and cyber defense firms largely drawn from the Project Glasswing roster. Fable 5, the version anyone could sign up for, remains offline, with no public timeline for its return.

For Mythos specifically, GLM-5.2's release reframes the entire restriction strategy. The policy logic behind locking down Mythos assumed that doing so would meaningfully slow adversaries' access to equivalent capability. GLM-5.2 is a direct test of that assumption, and the early answer looks like "no"—a freely downloadable model is now performing in the same range as the system the U.S. government spent two weeks debating how tightly to lock down. Security researcher Niels Provos and former export-control policy architect Saif Khan have both made versions of the same argument publicly: restricting American models without a credible plan for what happens when adversaries build comparable open alternatives doesn't slow proliferation; it just hands the open-source distribution channel to Beijing while U.S. defenders work with one hand tied behind their backs.

What the U.S. government is actually doing

Three things, roughly in parallel, and they don't fully agree with each other.

Export controls on frontier cyber-capable models

The June 12 order against Anthropic was the most aggressive intervention to date—a blanket suspension covering even Anthropic's own non-citizen employees, justified under national security export authority rather than a typical product recall or safety review. OpenAI faced a softer version of the same pressure: at the government's request, it staggered the rollout of GPT-5.6, limiting initial access to a small, individually vetted partner list rather than shipping the jailbreak-and-shutdown sequence Anthropic experienced.

A formal review framework, after the fact

President Trump's June 2 executive order, "Promoting Advanced Artificial Intelligence Innovation and Security," established a voluntary process for frontier labs to give the government pre-release access to "covered frontier models" for up to 30 days of review. In practice, both the Anthropic shutdown and the OpenAI staggered release happened either before this framework was fully operationalized or in tension with its "voluntary" framing; there's no published testing methodology or benchmark criteria yet, despite a 60-day implementation clock.

A vetted partner carve-out that mirrors what Anthropic was already doing voluntarily

The 100 organizations now cleared to use Mythos 5 again look a great deal like the Project Glasswing partner list Anthropic built on its own months earlier. The government's restored-access framework, in other words, largely re-implements a structure the private sector had already designed—just with Commerce holding the on/off switch instead of Anthropic.

The throughline across all three: the administration is treating frontier cyber-capable AI as a dual-use national security asset, comparable in spirit to encryption export rules or controlled defense technology, rather than as ordinary commercial software. Whether that framework can keep pace with open-weight releases from labs the U.S. has no jurisdiction over is the question GLM-5.2 just put back on the table.

Strip away the benchmark percentages and three structural points stand out for anyone setting AI procurement or security strategy.

Open-weight is becoming the geopolitical pressure-release valve. This isn't an isolated event. DeepSeek's V4 Pro release earlier in 2026 produced a similar (if more general purpose) shock to Western AI valuations. Chinese labs appear to be using permissive open licensing as a deliberate strategic move—it sidesteps export control regimes built around API access entirely, and it converts "we don't have the most capable closed model" into "you can't stop us from giving away something close enough." 360 Security Technology's CEO Zhou Hongyi made the framing explicit to The Wall Street Journal: a tool with this much offensive and defensive cyber relevance, in his telling, "can't remain solely in American hands"—which is as direct a statement of intent as you'll get from a Chinese security executive.
Restriction without a containment plan creates exposure, not safety. The uncomfortable possibility raised by GLM-5.2 is that U.S. policy may be optimizing for the wrong threat model. If the goal is keeping cyber-capable AI out of adversary hands entirely, that goal already looks unreachable, as open-weight Chinese alternatives exist and are improving. If the goal is keeping the most capable version of these tools in defenders' hands first, then restricting U.S. defenders' own access while equivalent capability proliferates freely elsewhere is close to the opposite of that goal. Dario Amodei's own May warning—that Mythos had already surfaced tens of thousands of vulnerabilities and defenders had perhaps six to 12 months before comparable offensive capability became widely available—reads very differently now that "widely available" arrived inside of six weeks, not 12 months.
Enterprise AI procurement now has a sovereignty dimension. The Wall Street Journal reported that Microsoft is exploring offering Chinese AI models on its own platform—a notable signal that even major U.S. cloud providers see commercial logic in open Chinese alternatives, cost and capability considerations aside. For CISOs, the practical upshot is that "which model" is no longer just a capability and pricing decision. A self-hosted open-weight model isn't exposed to a future U.S. export order, a vendor pricing change, or another company's API outage—but it does shift the entire security, patching, and provenance burden in-house, and it may carry its own data-sovereignty exposure if hosted through a Chinese provider's cloud rather than self-hosted. The Mythos blackout was a real-world demonstration, for any enterprise that had built workflows around it, of exactly that dependency risk.

"What's now been shown is that U.S. restrictions on frontier models like Mythos fail to neutralize the threat posed by China's open-weight GLM-5.2. Instead, choking domestic access creates a dangerous asymmetry: global adversaries retain an unrestricted, modifiable weapon, while American defenders are denied the very frontier tools needed to counter them," said Ram Varadarajan, CEO at Acalvio, a leader in cyber deception technology. "We've surfaced a reality where advanced AI capabilities can't be contained by local regulations. The critical policy question is not whether these systems will exist, but whether American enterprise and security teams will have the tools to match their adversaries."

Practical takeaways worth raising in upcoming security leadership meetings

The defender-attacker timeline compressed faster than even Anthropic's own warnings anticipated. If vulnerability management programs are still operating on a "weeks to patch" cadence, the AI-assisted vulnerability discovery curve—on both sides of the fence—argues for compressing that further, regardless of which model anyone is using to find the bugs first.

Don't assume "restricted" means "contained." Mythos being limited to ~100 organizations doesn't mean equivalent offensive capability isn't available to a much larger pool of actors through GLM-5.2 or similar open releases. Threat modeling that assumes attacker capability is gated by U.S. export policy is now demonstrably outdated.

"Security teams should avoid getting caught up in model-versus-model comparisons. The more important development is that advanced vulnerability discovery capabilities are becoming increasingly available across multiple models, vendors, and geographies," said Dr. Margaret Cunningham, Vice President of Security & AI Strategy at Darktrace, global leader in AI for cybersecurity. "Whether the latest benchmark winner comes from the U.S. or China does not fundamentally change the challenge defenders face."

Dr. Cunningham continued: "The reality is that vulnerability discovery was already outpacing remediation in many organizations. AI is accelerating that imbalance. Finding a vulnerability is only the beginning. Security teams still need to determine whether it is exploitable in their environment, understand potential business impact, prioritize remediation, test changes, and deploy fixes safely."

The takeaway for security leaders is not to debate which model is best. It's to prepare for a future where advanced AI-assisted discovery capabilities are widely available. That makes behavioral detection, anomaly-based analytics, risk-based prioritization, and autonomous response increasingly important. There is no universal definition of normal anymore. Organizations need to understand what is normal in their own environment and detect when something changes.

For those building AI dependencies into security tooling or procurement, build for discontinuity. The Mythos shutdown was a 15-day unplanned outage of a tool some enterprises had already built workflows around, triggered by a regulatory action with effectively no advance notice. That's a vendor risk category most security teams haven't formally modeled yet, and after this month, probably should.

The Zhipu story will keep evolving. GLM-5.2's claims haven't been fully independently verified, the Fable 5 restriction has no announced end date, and Elon Musk's public prediction that Chinese labs would match Anthropic's flagship "by early 2027" was answered within days by Zhipu's own founder insisting the timeline would be shorter.

"GLM-5.2 is an important signal that capable open-weight models are becoming increasingly accessible to businesses, researchers, and adversaries," said Diana Kelley, CISO at Noma Security, a unified AI security and governance platform. "It also reinforces a trend that security and technology leaders are already evaluating more deliberately: model agility. Organizations increasingly need the ability to swap models in agentic and AI-enabled systems without rebuilding the entire architecture."

"That only works if critical functions such as business logic, proprietary workflows, access controls, and sensitive data handling live in the surrounding application and governance layer, rather than being too tightly bound to a single model provider or orchestration harness," Kelley added. "Done well, that approach gives teams more room to manage cost, capability, and vendor lock-in."

What's already clear, regardless of how the benchmark race shakes out, is that the assumption underpinning a year of U.S. AI export policy—that restricting access to frontier models meaningfully slows adversary capability—just took its first serious public stress test. It did not hold up cleanly.

View full post