“Training Specialist Models: Automating Malware Development” explores how small, specialized Large Language Models (LLMs) can be trained to outperform massive generalist models in specific, highly technical tasks—specifically, the creation of evasive malware.
Here is a summary of the key points:
The Problem with Current Models
Avery identifies a gap in the current AI landscape for offensive security professionals:
- Large Generalists (OpenAI, Anthropic): These models are highly capable but come with privacy concerns, high costs, and strict safety filters (refusals) that make them difficult to automate for red teaming.
- Small Local Models (Llama, Qwen): These are private and cheap but generally lack the reasoning capabilities required for complex tasks like malware development.
The Solution: Reinforcement Learning with Verifiable Rewards (RLVR)
Avery proposes using RLVR—the same training technique used behind reasoning models like OpenAI’s o1 and DeepSeek’s R1—to bridge this gap.
- Unlike traditional RLHF (Reinforcement Learning from Human Feedback), which relies on slow and expensive human grading, RLVR uses a programmatic “verifier” to instantly and objectively score the model’s output.
- Verifier’s Law: For a task to be suitable for RLVR, it needs an objective truth, fast automated verification, scalability, and a continuous reward signal.
Case Study: The “Dante” Model
Avery applied this methodology to automate the creation of evasive shellcode loaders.
- The Verifier: He built an automated pipeline that takes the model’s code, compiles it, executes it in a sandbox to check for functionality (callbacks), and tests it against a live instance of Microsoft Defender for Endpoint (MDE).
- The Reward System: The model received higher rewards for code that compiled successfully, executed properly, and generated the fewest alerts in MDE.
- Training: He utilized Qwen 2.5 Coder (7B) as the base model.
- Stage 1 (SFT): Supervised Fine-Tuning on coding problems and malware templates to teach the model the required output format.
- Stage 2 (RLVR): Trial-and-error training where the model generated thousands of loaders, learning from the automated verifier which techniques successfully evaded detection.
Results and Key Takeaways
- Cost Efficiency: The entire training process cost approximately $1,350 using rented GPUs, proving accessible for organizations.
- Performance: The resulting model, Dante (7B), significantly outperformed massive models like DeepSeek R1 (671B) and Gemini 2.5.
- While DeepSeek mostly failed due to safety refusals or formatting errors, Dante achieved a >30% success rate in generating fully evasive, functional malware with zero alerts.
- Trial and Error: The model was not explicitly taught how to evade AV; it learned successful reasoning patterns on its own through the feedback loop provided by the verifier.
Conclusion
The presentation demonstrates that training small, specialist models using automated verification systems is a viable, low-cost strategy for solving complex domain-specific problems, potentially rendering large, general-purpose models unnecessary for specialized tasks.