0000000000000bc0 <bbCallback>:
bc0: 90000102 adrp x2, 20000 <_exit@GLIBC_2.17>
bc4: f9404c43 ldr x3, [x2, #152]
bc8: b4000263 cbz x3, c14 <bbCallback+0x54>
bcc: d53bd042 mrs x2, tpidr_el0
bd0: a9bf7bfd stp x29, x30, [sp, #-16]!
bd4: 12003c01 and w1, w0, #0xffff
bd8: 910003fd mov x29, sp
bdc: 90000100 adrp x0, 20000 <_exit@GLIBC_2.17>
be0: f9403404 ldr x4, [x0, #104]
be4: 9101a000 add x0, x0, #0x68
be8: d63f0080 blr x4
bec: 78606844 ldrh w4, [x2, x0]
bf0: 53017c25 lsr w5, w1, #1
bf4: 78206845 strh w5, [x2, x0]
bf8: 4a040021 eor w1, w1, w4
bfc: 92403c21 and x1, x1, #0xffff
c00: 38616860 ldrb w0, [x3, x1]
c04: 11000400 add w0, w0, #0x1
c08: 38216860 strb w0, [x3, x1]
c0c: a8c17bfd ldp x29, x30, [sp], #16
c10: d65f03c0 ret
c14: d65f03c0 ret
c18: d503201f nop
c1c: d503201f nop
This code snippet is an implementation of the AFL (American Fuzzy Lop) coverage instrumentation callback for the AArch64 (ARM64) architecture.
Specifically, it appears to be the function often named __afl_maybe_log or bbCallback inserted by the compiler (like afl-gcc or afl-clang-fast) at the start of basic blocks to track code coverage during fuzzing.
High-Level Logic
The function performs the following standard AFL logic:
- Check Initialization: Checks if the shared memory bitmap (the “Map”) is initialized.
- Thread Local Storage (TLS): Retrieves a thread-local variable that stores the
previous_location(ID of the last block visited). - Edge Calculation: Computes a unique hash for the transition (edge) between the previous block and the current block using:
index = current_location ^ previous_location. - Coverage Recording: Increments the counter in the Map at that
index. - Update State: Updates
previous_locationtocurrent_location >> 1for the next block.
Step-by-Step Instruction Breakdown
1. Check if the Map exists
bc0: 90000102 adrp x2, 20000 ; Load page address of global variables
bc4: f9404c43 ldr x3, [x2, #152] ; Load the global pointer `__afl_area_ptr` (the Map) into x3
bc8: b4000263 cbz x3, c14 ; If x3 is NULL (0), jump to c14 (return immediately)
- The code loads the pointer to the shared memory coverage map. If the fuzzer hasn’t initialized this yet, it simply returns to avoid a crash.
2. TLS Setup & Context Saving
bcc: d53bd042 mrs x2, tpidr_el0 ; Move Thread Pointer ID Register (TLS base) into x2
bd0: a9bf7bfd stp x29, x30, [sp, #-16]! ; Push Frame Pointer (FP) and Link Register (LR) to stack
bd4: 12003c01 and w1, w0, #0xffff ; Mask the input (Current Block ID) to 16 bits. Keep in w1.
bd8: 910003fd mov x29, sp ; Set up the frame pointer
w0contains the Current Block ID (passed by the caller).tpidr_el0is used to access thread-local variables. This is crucial for multi-threaded fuzzing to ensure theprevious_locationvariable is unique per thread.
3. Resolve TLS Offset for __afl_prev_loc
bdc: 90000100 adrp x0, 20000 ; Load address for GOT/Global area
be0: f9403404 ldr x4, [x0, #104] ; Load a function pointer (likely a TLS descriptor helper)
be4: 9101a000 add x0, x0, #0x68 ; Prepare argument for the helper
be8: d63f0080 blr x4 ; Call helper function
- This block calls a helper function to determine the offset of the thread-local variable
__afl_prev_loc. - After
blr,x0contains the offset of__afl_prev_locrelative to the thread pointer (x2).
4. Load Previous Location & Calculate Edge
bec: 78606844 ldrh w4, [x2, x0] ; Load `prev_loc` (16-bit) from [TLS_Base + Offset]
bf0: 53017c25 lsr w5, w1, #1 ; Calculate `next_prev_loc` = (Current_ID >> 1)
bf4: 78206845 strh w5, [x2, x0] ; Store `next_prev_loc` back to TLS for the next time
bf8: 4a040021 eor w1, w1, w4 ; Calculate Edge Index: (Current_ID ^ prev_loc)
bfc: 92403c21 and x1, x1, #0xffff ; Ensure Index fits in 64KB (16 bits)
- AFL Algorithm Core:
- Current Edge =
Current_ID ^ Prev_ID. - Future Prev_ID =
Current_ID >> 1.
- Current Edge =
- The shift (
lsr) distinguishes the direction of the edge (i.e., A -> B creates a different hash than B -> A).
5. Update Coverage Map
c00: 38616860 ldrb w0, [x3, x1] ; Load byte from Map[Index]
c04: 11000400 add w0, w0, #0x1 ; Increment counter
c08: 38216860 strb w0, [x3, x1] ; Store byte back to Map[Index]
- This increments the “hit count” for this specific code path transition.
6. Return
c0c: a8c17bfd ldp x29, x30, [sp], #16 ; Pop FP and LR
c10: d65f03c0 ret ; Return to the instrumented function
c14: d65f03c0 ret ; Early return path (if Map was NULL)
Summary
This is the assembly equivalent of the following C logic used in fuzzing:
// w0 is the compile-time random ID for the current block
void bbCallback(unsigned int cur_loc) {
unsigned char *map = __afl_area_ptr;
if (!map) return;
// Access thread-local storage for previous location
unsigned int prev_loc = __afl_prev_loc;
// Calculate edge index
unsigned int index = (cur_loc ^ prev_loc) & 0xFFFF;
// Update map
map[index]++;
// Update previous location for the next callback
__afl_prev_loc = cur_loc >> 1;
}
Here is an explanation of these three concepts(Thread Local Storage, thread-local variable and tpidr_el0), moving from the high-level software concept down to the specific hardware implementation in ARM64.
1. Thread Local Storage (TLS) — The System
Thread Local Storage is a computer programming method that uses static or global memory local to a thread.
- The Problem: Normally, global variables are shared by all threads in a program. If Thread A changes a global variable, Thread B sees the change. This creates race conditions and requires locking (mutexes), which is slow.
- The Solution (TLS): TLS allows you to define a “global” variable where each thread gets its own unique copy.
- Analogy:
- Global Memory: A whiteboard in the middle of an office. Everyone shares it. If you write on it, I see it.
- TLS: A notebook given to every employee. Everyone has a notebook called “Notes,” but what I write in mine is completely private from what you write in yours.
2. Thread-Local Variable — The Programming Object
A thread-local variable is the specific variable that lives inside the TLS.
In C or C++, you declare them using keywords like __thread, _Thread_local, or thread_local.
Classic Example: errno
In standard C programming, errno contains the error code of the last failed system call.
- If Thread A tries to open a missing file,
errnobecomes 2 (ENOENT). - If Thread B is running successfully at the same time, its
errnoshould remain 0. - Therefore,
errnois a thread-local variable. If it were a standard global variable, Thread B might incorrectly think it encountered an error because Thread A failed.
In the context of your previous AFL code snippet, the variable __afl_prev_loc is a thread-local variable. This ensures that if you are fuzzing with multiple threads, Thread A’s execution history doesn’t mix with Thread B’s history.
3. tpidr_el0 — The Hardware Register
This is the specific ARM64 (AArch64) CPU register used to implement TLS efficiently.
- Name Breakdown:
- TP: Thread Pointer
- ID: ID
- R: Register
- EL0: Exception Level 0 (User Mode)
- Purpose: It holds the base address of the memory region allocated for the currently running thread.
When the Operating System switches threads (context switch), it updates tpidr_el0 to point to the new thread’s private data area.
How they work together (The Flow)
When your code wants to access a thread-local variable, the CPU performs the following steps (simplified):
- Get the Base: The CPU reads
tpidr_el0to find out “Where does the current thread’s private memory start?” - Get the Offset: The compiler knows that Variable X is located, say, 16 bytes from the start of that memory block.
- Calculate Address: Target Address =
tpidr_el0+ 16. - Access: Read/Write to that address.
Applying it to your code snippet
Let’s look at lines bcc to bec from your snippet again:
bcc: d53bd042 mrs x2, tpidr_el0 ; 1. READ the Thread Pointer base address into x2
...
be8: d63f0080 blr x4 ; 2. CALL a helper to get the variable's OFFSET (puts it in x0)
bec: 78606844 ldrh w4, [x2, x0] ; 3. ACCESS memory at [Base (x2) + Offset (x0)]
Why was this necessary in the code?
AFL updates prev_loc (previous location) to track coverage. If the fuzzer is multi-threaded, two threads running the same function simultaneously would corrupt each other’s coverage map if prev_loc were a standard global variable. By using tpidr_el0 to access a thread-local copy, each thread tracks its own path independently.
“CALL a helper to get the variable’s OFFSET”
The helper function is necessary because of Dynamic Linking.
When you compile this code (likely as a shared library or a Position Independent Executable), the compiler does not know where the variable __afl_prev_loc will be located in memory relative to the thread pointer.
Here is the breakdown of why the helper is needed and what it does.
1. The Core Problem: “I don’t know where I am yet”
If you are writing a standard executable (like main.exe), the compiler can calculate exactly where every variable is. It can say “Variable X is at offset 16.”
However, this code is likely compiled as Position Independent Code (PIC). This means:
- This code might be inside a shared library (
libafl.so). - That library could be loaded into memory at any address.
- The library might be loaded after the program starts (using
dlopen).
Because of this, the offset of __afl_prev_loc is unknown at compile time. The compiler cannot simply write add x0, x2, #16. It needs to ask the “runtime linker” where the variable ended up.
2. The Solution: TLS Descriptors (TLSDESC)
The specific instruction sequence you see (adrp, ldr, add, blr) is the signature of the TLSDESC (TLS Descriptors) model, which is the default for AArch64/ARM64.
It works like a “lazy question”:
- Preparation:
adrpandaddprepare the arguments. They point to a “descriptor” in the Global Offset Table (GOT). - The Call (
blr): You call the helper function stored in that descriptor. - The Answer: The helper calculates just the offset for you and returns it in
x0. - The Access: You add that offset to your Thread Pointer (
tpidr_el0) to find the variable.
3. Why is it a “Helper” and not just a standard function?
It is a special, highly optimized function provided by the dynamic linker (like ld-linux.so).
- It preserves registers: Notice that the code didn’t save
x1orx2before callingblr x4. A standard function call would require saving these. The TLS helper guarantees it won’t mess up your temporary registers (exceptx0), making it very fast. - It can change: When the program first starts, the helper might point to a “slow path” that calculates the offset and caches it. The next time you call it, it might point to a “fast path” that just returns the now-known constant.
Summary
You need the helper because the variable’s location is dynamic.
- Static Build: “Go to offset 16.” (Fast, no helper needed)
- Dynamic Build (Your Code): “Call this helper; it will figure out where the variable lives right now, then tell me the offset.”
If this code were part of the main executable (e.g., the .exe file), you would be absolutely correct: the offset would be static/constant. This is called the Local Executable (LE) model.
However, the assembly you provided is using the General Dynamic (GD) or TLSDESC model. This happens because the compiler is being conservative. It assumes this code might end up in a Shared Library (.so).
Here is why the offset cannot be static in a Shared Library.
1. The “Train” Analogy
Imagine the Thread Local Storage (TLS) memory area as a long train attached to the Thread Pointer (TP).
- The Engine: The Thread Pointer (
tpidr_el0). - Car 1: The variables for the Main Executable.
- Car 2: The variables for Library A.
- Car 3: The variables for Library B.
The Problem for the Compiler
When you are compiling Library B, the compiler has no idea:
- How big Car 1 (the main app) will be.
- If Library A will be loaded before or after Library B.
- If Library B will be loaded at startup or loaded later (via
dlopen).
Because the compiler doesn’t know “how many cars are in front of it,” it cannot hardcode the distance (offset) from the Engine (TP) to its own variables.
2. The Dynamic Thread Vector (DTV)
In complex scenarios (like dlopen on Linux), the memory layout isn’t even a contiguous block (a single train). It looks more like this:
tpidr_el0points to a Thread Control Block.- Inside that block, there is a pointer to an array called the DTV (Dynamic Thread Vector).
- The DTV is a list of pointers:
- Slot 0 -> points to Main App TLS data
- Slot 1 -> points to Lib A TLS data
- Slot 2 -> points to Lib B TLS data
To find a variable in Library B, the code has to:
- Read
tpidr_el0. - Find the DTV.
- Read the pointer at Slot 2.
- Add the offset inside that specific block.
Since the OS assigns “Slot 2” at runtime, the compiler cannot know the final offset relative to tpidr_el0 at compile time. It requires the helper function to look up the DTV and calculate the current address.
3. “Linker Relaxation” (Why you might see static offsets anyway)
You mentioned that in your opinion it should be static. You are right in the sense that performance-wise, we want it to be static.
If you compile this code as a shared object (.so), the compiler generates the slow blr helper code (General Dynamic) because it has to be safe.
However, if you verify that this object file ends up in the main executable during the linking phase, the Linker is smart enough to rewrite the code! This is called Linker Relaxation.
The Linker sees:
- “Hey, this code asked for a dynamic offset lookup…”
- “…but I know this is the main executable.”
- “…so I know the offset is definitely 16.”
It will physically overwrite the assembly instructions:
- It replaces the
blr(call) withnop(do nothing). - It replaces the complex setup with a simple
movoradd.
Summary
- Compiler’s View: “I don’t know where this code will live (Exe vs Lib), so I will generate the Helper Function code just to be safe.”
- Linker’s View: “I am building the final binary now. If I can prove the offset is static, I will optimize the assembly to remove the helper.”
The snippet you pasted contains the un-optimized, “safe” version (TLSDESC), implying it was compiled with -fPIC and the linker didn’t (or couldn’t) relax it to the Local Executable model.
AFL++ is not multi-threaded by default.
By design, a single instance of afl-fuzz is a single-threaded process that occupies exactly one CPU core.
If you run afl-fuzz once on a 64-core machine, it will use 1 core and leave the other 63 idle. To utilize your full hardware, you must manually launch multiple separate instances of the fuzzer (processes), not threads.
1. Why isn’t it multi-threaded?
AFL++ (and the original AFL) relies on determinism and global state.
- It uses a “Forkserver” mechanism where the fuzzer pauses the target at
main(), thenfork()s a new child process for every single input attempt. - Managing this via threads would be incredibly complex because the “target” (the program you are fuzzing) might crash, hang, or corrupt memory. If the fuzzer were just a thread in the same process, a crash in the target would crash the fuzzer too.
- Process isolation ensures that when the target crashes (which is the goal!), the fuzzer survives to record the crash.
2. How to run it in “Parallel Mode”
Since it doesn’t spawn threads automatically, you use Distributed/Parallel Fuzzing by running multiple processes that share a folder to sync their findings.
You typically run one “Main” instance (deterministic) and many “Secondary” instances (randomized).
Example for a 4-core system:
- Core 1 (Main):
./afl-fuzz -i inputs -o syncdir -M fuzzer01 ./target_app(The-Mflag tells it to be the Master/Main instance). - Core 2 (Secondary):
./afl-fuzz -i inputs -o syncdir -S fuzzer02 ./target_app(The-Sflag tells it to be a Secondary instance). - Core 3 & 4 (More Secondaries):
bash ./afl-fuzz -i inputs -o syncdir -S fuzzer03 ./target_app ./afl-fuzz -i inputs -o syncdir -S fuzzer04 ./target_app
All instances watch the syncdir directory. If fuzzer02 finds a new interesting path, fuzzer01 will see it in the folder and add it to its own queue.
3. Connection to your previous question (TLS)
This explains why the TLS instrumentation (tpidr_el0) you asked about earlier is so important.
Even though AFL++ itself is single-threaded, the Target Program you are fuzzing might be multi-threaded.
- If your target app spawns 4 threads, they will all try to write to the coverage map (the “Map” in your assembly code).
- Without TLS (Thread Local Storage), Thread A and Thread B in the target would overwrite each other’s
prev_locvariable, corrupting the coverage data. - The
tpidr_el0instrumentation ensures that even if the target is multi-threaded, the coverage reporting remains accurate for each thread.