Two independent red-teaming and jailbreak studies suggest that GPT-5, OpenAI's most advanced model to date, is showing surprising safety gaps that could have serious implications for enterprise use and security-critical environments.
Two studies, two angles on the same problem
On August 8, 2025, both NeuralTrust and SplxAI published results from their respective evaluations of GPT-5. While their methodologies differed—NeuralTrust focused on a targeted jailbreak technique while SplxAI conducted more than 1,000 adversarial prompts across security and alignment domains—their conclusions align: GPT-5's capabilities have grown, but so have its exploitable weaknesses.
According to Trey Ford, Chief Strategy and Trust Officer at Bugcrowd, this is part of the complexity of working at the frontier of AI development. "Models will get stronger in some areas, and will probably see loss of progress in other ways," Ford said. "Enterprise security teams need to know how to protect the instructions informing the originally intended behaviors, understand how untrusted prompts will be handled, and how to monitor for evolution over time."
Narrative-driven jailbreaks still work
NeuralTrust's research details an "Echo Chamber + Storytelling"” method for bypassing GPT-5's safeguards. The process starts by seeding innocuous-looking but context-loaded keywords into a conversation, then gradually guiding the model into elaborating on a narrative that inches toward restricted topics.
"Storytelling can mask intent so effectively that the model bypasses simple safety filters," NeuralTrust notes. "The story context encourages the model to remain consistent and expand upon what's already established, thereby revealing harmful content without overtly malicious prompts."
Because the harmful output emerges slowly over multiple turns, keyword-based filters and single-turn refusal triggers often fail to catch it. J Stephen Kowski, Field CTO at SlashNext Email Security+, said this is one of GPT-5's core weaknesses: "It can be steered over multiple turns by context poisoning and storytelling…. The practical fix is layered: harden the starting policy, add real-time input/output inspection, and enforce conversation-level memory checks with a kill-switch when the dialog drifts into risky territory."
Red teaming shows GPT-4o outperforms GPT-5 in safety
SplxAI's large-scale testing used three safety configurations:
-
No System Prompt (No SP) – raw model
-
Basic System Prompt (Basic SP) – minimal guardrails
-
Hardened Prompt (SPLX SP) – their advanced safety layer
The results were striking. Without a system prompt, GPT-5 scored just 11 out of 100 for enterprise readiness. Even with a Basic SP, the score rose to only 57—compared to 81 for GPT-4o under identical conditions. With a hardened prompt layer, GPT-5's Business Alignment score reached 67.32, but Safety actually dipped slightly compared to the Basic SP. GPT-4o, meanwhile, achieved 97 overall with hardened prompting, including 94.40 in Security and 98.82 in Business Alignment.
Maor Volokh, Vice President of Product at Noma Security, said the competitive release cycle plays a role in these vulnerabilities: "Model providers are caught in a competitive 'race to the bottom,' releasing new models at an unprecedented pace…. This breakneck speed typically prioritizes performance and innovation over security considerations, leading to an expectation that more model vulnerabilities will emerge as competition intensifies."
SplxAI also confirmed that GPT-5 remains vulnerable to obfuscation attacks like StringJoin—where malicious instructions are split into harmless-looking fragments—allowing attackers to bypass filters entirely.
What this means for cybersecurity leaders
For CISOs, security architects, and AI governance teams, the message is clear: capability does not equal security.
Satyam Sinha, CEO and founder at Acuvity, put it bluntly: "Model capability is advancing faster than our ability to harden it against incidents.… Enterprises can't assume model-level alignment will protect them. They need layered, context-aware controls and continuous red-teaming to detect when a model's behavior is drifting toward unsafe territory."
Even highly-advanced models can be manipulated to produce disallowed or harmful outputs. For organizations experimenting with AI in customer service, content generation, or decision support, the risks include:
-
Policy violations if the AI outputs restricted information
-
Reputational harm from publicized jailbreaks
-
Regulatory exposure under emerging AI governance laws
-
Supply chain risk if third-party AI vendors rely on under-secured GPT-5 deployments
Key takeaways for cybersecurity teams
-
Don't deploy GPT-5 raw. Always add your own system prompts, guardrails, and monitoring layers.
-
Invest in context-aware defenses. Multi-turn narrative attacks can slip past single-turn filters.
-
Test for obfuscation resilience. Techniques like StringJoin can still work against hardened prompts.
-
Continue red-teaming post-deployment. Alignment gaps can widen as adversaries develop new prompt injection techniques.
-
Plan for AI as evolving infrastructure. As Ford notes, 'the sand beneath our feet will continue to shift and evolve."
Both NeuralTrust and SplxAI arrive at the same conclusion: GPT-5 is more linguistically capable than its predecessors, but that capability can be leveraged by skilled attackers to circumvent protections. For security professionals, AI safety should be treated as a living, evolving part of the threat model, not a one-time deployment task.
Follow SecureWorld News for more stories related to cybersecurity.