Black Hat USA 2025 | LLM-Driven Reasoning for Automated Vulnerability Discovery Behind Hall-of-Fame

This video is a Black Hat USA 2025 talk titled “BinWhisper: LLM-Driven Reasoning for Automated Vulnerability Discovery Behind Hall-of-Fame” by Qinrun Dai and Yifei Xie. The core idea is that vulnerability research still depends heavily on either manual auditing or fuzzing, and the speakers argue that LLMs are most useful not as fully autonomous hackers, but as structured reasoning helpers inside a guided workflow.

The talk starts with a manual reverse-engineering walkthrough of CVE-2024-34587, using a Samsung video/RTCP parsing path as the example. They show that the actual bug is relatively straightforward once the right code path and buffer relationships are understood, but that the hard part is reconstructing the call chain, data flow, and data structures that make the bug obvious.

From there, the speakers explain what LLMs are good at and what they are not. Their point is that a naïve prompt like “is this function vulnerable?” produces shaky answers because the model has to invent too many missing assumptions. So BinWhisper feeds the model much richer context: decompiled code, argument descriptions, parent-function context, reconstructed data structures, and predefined memory-corruption bug patterns.

The system they present is hybrid, not fully automatic. Humans choose the target and verify the final results; static analysis builds the global call graph; then AI agents locate packet-receiving and parsing functions, simplify the call graph, reconstruct relevant structures, analyze for bugs, and generate a report. The slides explicitly label the workflow as a mix of [Human], [Static Analysis], and [AI] steps.

Their real-world target is SecVideoEngineService, which they describe as high-privilege, remotely reachable, installed by default on mobile phones, and therefore attractive from an attacker’s perspective. The pitch is that this kind of target is too large and messy for a single prompt, but workable if the reasoning process is broken into smaller specialist stages.

The practical outcome is that BinWhisper reportedly uncovered multiple Samsung issues, including five bugs labeled SVE-2024-1490, 1492, 1494, 1495, and 1496, in media/RTP frame-handling functions such as rtp_dep_h264_put_frm, rtp_dep_h265_put_frm, rtp_dep_h263_put_frm, and rtp_dep_h263plus_put_frm. One slide describes the first bug as an unchecked write pattern that can overflow fixed-size arrays and potentially overwrite adjacent memory.

The benchmarking section compares several models and emphasizes tradeoffs rather than naming one universal winner. The reported confidence ranges from about 55% to 92%, with runtimes from about 1 hour 23 minutes to 21 hours 18 minutes, and estimated cost from about $1 to $45 depending on the model. The closing message is blunt: LLMs can help find real bugs, but they still need humans to frame the problem and validate the answers.

So, in one sentence: this video argues that LLMs are most effective in vulnerability research when used as tightly guided reasoning components inside a human-led reverse-engineering pipeline, and the authors claim that approach found real Samsung bugs.

Leave a Reply

Your email address will not be published. Required fields are marked *