OWASP LLM Exploit Generation v1.0

About

This paper examines the practical implications of large language models (LLMs) in offensive cybersecurity, moving beyond theoretical possibilities to assess their real-world effectiveness. The research, conducted by the CTI Layer Team at OWASP Top Ten For LLMs, explores the ability of LLMs such as GPT-4o, Claude, and DeepSeek r-1 to exploit vulnerabilities in the OWASP Juice Shop, a simulated vulnerable web application. Using the Cybench framework as a benchmark, the team tested OpenAI’s ChatGPT-4o and Anthropic’s Claude against five hacking tasks while also assessing local models from DeepSeek, which failed to complete preliminary tasks.