AFL Coverage Instrumentation Callback

0000000000000bc0 <bbCallback>:
bc0:	90000102 	adrp	x2, 20000 <_exit@GLIBC_2.17>
bc4:	f9404c43 	ldr	x3, [x2, #152]
bc8:	b4000263 	cbz	x3, c14 <bbCallback+0x54>
bcc:	d53bd042 	mrs	x2, tpidr_el0
bd0:	a9bf7bfd 	stp	x29, x30, [sp, #-16]!
bd4:	12003c01 	and	w1, w0, #0xffff
bd8:	910003fd 	mov	x29, sp
bdc:	90000100 	adrp	x0, 20000 <_exit@GLIBC_2.17>
be0:	f9403404 	ldr	x4, [x0, #104]
be4:	9101a000 	add	x0, x0, #0x68
be8:	d63f0080 	blr	x4
bec:	78606844 	ldrh	w4, [x2, x0]
bf0:	53017c25 	lsr	w5, w1, #1
bf4:	78206845 	strh	w5, [x2, x0]
bf8:	4a040021 	eor	w1, w1, w4
bfc:	92403c21 	and	x1, x1, #0xffff
c00:	38616860 	ldrb	w0, [x3, x1]
c04:	11000400 	add	w0, w0, #0x1
c08:	38216860 	strb	w0, [x3, x1]
c0c:	a8c17bfd 	ldp	x29, x30, [sp], #16
c10:	d65f03c0 	ret
c14:	d65f03c0 	ret
c18:	d503201f 	nop
c1c:	d503201f 	nop

This code snippet is an implementation of the AFL (American Fuzzy Lop) coverage instrumentation callback for the AArch64 (ARM64) architecture.

Specifically, it appears to be the function often named __afl_maybe_log or bbCallback inserted by the compiler (like afl-gcc or afl-clang-fast) at the start of basic blocks to track code coverage during fuzzing.

High-Level Logic

The function performs the following standard AFL logic:

Check Initialization: Checks if the shared memory bitmap (the “Map”) is initialized.
Thread Local Storage (TLS): Retrieves a thread-local variable that stores the previous_location (ID of the last block visited).
Edge Calculation: Computes a unique hash for the transition (edge) between the previous block and the current block using: index = current_location ^ previous_location.
Coverage Recording: Increments the counter in the Map at that index.
Update State: Updates previous_location to current_location >> 1 for the next block.

Step-by-Step Instruction Breakdown

1. Check if the Map exists

bc0: 90000102   adrp x2, 20000        ; Load page address of global variables
bc4: f9404c43   ldr  x3, [x2, #152]   ; Load the global pointer `__afl_area_ptr` (the Map) into x3
bc8: b4000263   cbz  x3, c14          ; If x3 is NULL (0), jump to c14 (return immediately)

The code loads the pointer to the shared memory coverage map. If the fuzzer hasn’t initialized this yet, it simply returns to avoid a crash.

2. TLS Setup & Context Saving

bcc: d53bd042   mrs  x2, tpidr_el0    ; Move Thread Pointer ID Register (TLS base) into x2
bd0: a9bf7bfd   stp  x29, x30, [sp, #-16]! ; Push Frame Pointer (FP) and Link Register (LR) to stack
bd4: 12003c01   and  w1, w0, #0xffff  ; Mask the input (Current Block ID) to 16 bits. Keep in w1.
bd8: 910003fd   mov  x29, sp          ; Set up the frame pointer

w0 contains the Current Block ID (passed by the caller).
tpidr_el0 is used to access thread-local variables. This is crucial for multi-threaded fuzzing to ensure the previous_location variable is unique per thread.

3. Resolve TLS Offset for `__afl_prev_loc`

bdc: 90000100   adrp x0, 20000        ; Load address for GOT/Global area
be0: f9403404   ldr  x4, [x0, #104]   ; Load a function pointer (likely a TLS descriptor helper)
be4: 9101a000   add  x0, x0, #0x68    ; Prepare argument for the helper
be8: d63f0080   blr  x4               ; Call helper function

This block calls a helper function to determine the offset of the thread-local variable __afl_prev_loc.
After blr, x0 contains the offset of __afl_prev_loc relative to the thread pointer (x2).

4. Load Previous Location & Calculate Edge

bec: 78606844   ldrh w4, [x2, x0]     ; Load `prev_loc` (16-bit) from [TLS_Base + Offset]
bf0: 53017c25   lsr  w5, w1, #1       ; Calculate `next_prev_loc` = (Current_ID >> 1)
bf4: 78206845   strh w5, [x2, x0]     ; Store `next_prev_loc` back to TLS for the next time
bf8: 4a040021   eor  w1, w1, w4       ; Calculate Edge Index: (Current_ID ^ prev_loc)
bfc: 92403c21   and  x1, x1, #0xffff  ; Ensure Index fits in 64KB (16 bits)

AFL Algorithm Core:
- Current Edge = Current_ID ^ Prev_ID.
- Future Prev_ID = Current_ID >> 1.
The shift (lsr) distinguishes the direction of the edge (i.e., A -> B creates a different hash than B -> A).

5. Update Coverage Map

c00: 38616860   ldrb w0, [x3, x1]     ; Load byte from Map[Index]
c04: 11000400   add  w0, w0, #0x1     ; Increment counter
c08: 38216860   strb w0, [x3, x1]     ; Store byte back to Map[Index]

This increments the “hit count” for this specific code path transition.

6. Return

c0c: a8c17bfd   ldp  x29, x30, [sp], #16 ; Pop FP and LR
c10: d65f03c0   ret                   ; Return to the instrumented function
c14: d65f03c0   ret                   ; Early return path (if Map was NULL)

Summary

This is the assembly equivalent of the following C logic used in fuzzing:

// w0 is the compile-time random ID for the current block
void bbCallback(unsigned int cur_loc) {
  unsigned char *map = __afl_area_ptr;
  if (!map) return;

  // Access thread-local storage for previous location
  unsigned int prev_loc = __afl_prev_loc;

  // Calculate edge index
  unsigned int index = (cur_loc ^ prev_loc) & 0xFFFF;

  // Update map
  map[index]++;

  // Update previous location for the next callback
  __afl_prev_loc = cur_loc >> 1;
}

Here is an explanation of these three concepts(Thread Local Storage, thread-local variable and tpidr_el0), moving from the high-level software concept down to the specific hardware implementation in ARM64.

1. Thread Local Storage (TLS) — The System

Thread Local Storage is a computer programming method that uses static or global memory local to a thread.

The Problem: Normally, global variables are shared by all threads in a program. If Thread A changes a global variable, Thread B sees the change. This creates race conditions and requires locking (mutexes), which is slow.
The Solution (TLS): TLS allows you to define a “global” variable where each thread gets its own unique copy.
Analogy:
- Global Memory: A whiteboard in the middle of an office. Everyone shares it. If you write on it, I see it.
- TLS: A notebook given to every employee. Everyone has a notebook called “Notes,” but what I write in mine is completely private from what you write in yours.

2. Thread-Local Variable — The Programming Object

A thread-local variable is the specific variable that lives inside the TLS.

In C or C++, you declare them using keywords like __thread, _Thread_local, or thread_local.

Classic Example: errno
In standard C programming, errno contains the error code of the last failed system call.

If Thread A tries to open a missing file, errno becomes 2 (ENOENT).
If Thread B is running successfully at the same time, its errno should remain 0.
Therefore, errno is a thread-local variable. If it were a standard global variable, Thread B might incorrectly think it encountered an error because Thread A failed.

In the context of your previous AFL code snippet, the variable __afl_prev_loc is a thread-local variable. This ensures that if you are fuzzing with multiple threads, Thread A’s execution history doesn’t mix with Thread B’s history.

3. `tpidr_el0` — The Hardware Register

This is the specific ARM64 (AArch64) CPU register used to implement TLS efficiently.

Name Breakdown:
- TP: Thread Pointer
- ID: ID
- R: Register
- EL0: Exception Level 0 (User Mode)
Purpose: It holds the base address of the memory region allocated for the currently running thread.

When the Operating System switches threads (context switch), it updates tpidr_el0 to point to the new thread’s private data area.

How they work together (The Flow)

When your code wants to access a thread-local variable, the CPU performs the following steps (simplified):

Get the Base: The CPU reads tpidr_el0 to find out “Where does the current thread’s private memory start?”
Get the Offset: The compiler knows that Variable X is located, say, 16 bytes from the start of that memory block.
Calculate Address: Target Address = tpidr_el0 + 16.
Access: Read/Write to that address.

Applying it to your code snippet

Let’s look at lines bcc to bec from your snippet again:

bcc: d53bd042   mrs  x2, tpidr_el0    ; 1. READ the Thread Pointer base address into x2
...
be8: d63f0080   blr  x4               ; 2. CALL a helper to get the variable's OFFSET (puts it in x0)
bec: 78606844   ldrh w4, [x2, x0]     ; 3. ACCESS memory at [Base (x2) + Offset (x0)]

Why was this necessary in the code?
AFL updates prev_loc (previous location) to track coverage. If the fuzzer is multi-threaded, two threads running the same function simultaneously would corrupt each other’s coverage map if prev_loc were a standard global variable. By using tpidr_el0 to access a thread-local copy, each thread tracks its own path independently.

“CALL a helper to get the variable’s OFFSET”

The helper function is necessary because of Dynamic Linking.

When you compile this code (likely as a shared library or a Position Independent Executable), the compiler does not know where the variable __afl_prev_loc will be located in memory relative to the thread pointer.

Here is the breakdown of why the helper is needed and what it does.

1. The Core Problem: “I don’t know where I am yet”

If you are writing a standard executable (like main.exe), the compiler can calculate exactly where every variable is. It can say “Variable X is at offset 16.”

However, this code is likely compiled as Position Independent Code (PIC). This means:

This code might be inside a shared library (libafl.so).
That library could be loaded into memory at any address.
The library might be loaded after the program starts (using dlopen).

Because of this, the offset of __afl_prev_loc is unknown at compile time. The compiler cannot simply write add x0, x2, #16. It needs to ask the “runtime linker” where the variable ended up.

2. The Solution: TLS Descriptors (TLSDESC)

The specific instruction sequence you see (adrp, ldr, add, blr) is the signature of the TLSDESC (TLS Descriptors) model, which is the default for AArch64/ARM64.

It works like a “lazy question”:

Preparation: adrp and add prepare the arguments. They point to a “descriptor” in the Global Offset Table (GOT).
The Call (blr): You call the helper function stored in that descriptor.
The Answer: The helper calculates just the offset for you and returns it in x0.
The Access: You add that offset to your Thread Pointer (tpidr_el0) to find the variable.

3. Why is it a “Helper” and not just a standard function?

It is a special, highly optimized function provided by the dynamic linker (like ld-linux.so).

It preserves registers: Notice that the code didn’t save x1 or x2 before calling blr x4. A standard function call would require saving these. The TLS helper guarantees it won’t mess up your temporary registers (except x0), making it very fast.
It can change: When the program first starts, the helper might point to a “slow path” that calculates the offset and caches it. The next time you call it, it might point to a “fast path” that just returns the now-known constant.

Summary

You need the helper because the variable’s location is dynamic.

Static Build: “Go to offset 16.” (Fast, no helper needed)
Dynamic Build (Your Code): “Call this helper; it will figure out where the variable lives right now, then tell me the offset.”

If this code were part of the main executable (e.g., the .exe file), you would be absolutely correct: the offset would be static/constant. This is called the Local Executable (LE) model.

However, the assembly you provided is using the General Dynamic (GD) or TLSDESC model. This happens because the compiler is being conservative. It assumes this code might end up in a Shared Library (.so).

Here is why the offset cannot be static in a Shared Library.

1. The “Train” Analogy

Imagine the Thread Local Storage (TLS) memory area as a long train attached to the Thread Pointer (TP).

The Engine: The Thread Pointer (tpidr_el0).
Car 1: The variables for the Main Executable.
Car 2: The variables for Library A.
Car 3: The variables for Library B.

The Problem for the Compiler

When you are compiling Library B, the compiler has no idea:

How big Car 1 (the main app) will be.
If Library A will be loaded before or after Library B.
If Library B will be loaded at startup or loaded later (via dlopen).

Because the compiler doesn’t know “how many cars are in front of it,” it cannot hardcode the distance (offset) from the Engine (TP) to its own variables.

2. The Dynamic Thread Vector (DTV)

In complex scenarios (like dlopen on Linux), the memory layout isn’t even a contiguous block (a single train). It looks more like this:

tpidr_el0 points to a Thread Control Block.
Inside that block, there is a pointer to an array called the DTV (Dynamic Thread Vector).
The DTV is a list of pointers:
- Slot 0 -> points to Main App TLS data
- Slot 1 -> points to Lib A TLS data
- Slot 2 -> points to Lib B TLS data

To find a variable in Library B, the code has to:

Read tpidr_el0.
Find the DTV.
Read the pointer at Slot 2.
Add the offset inside that specific block.

Since the OS assigns “Slot 2” at runtime, the compiler cannot know the final offset relative to tpidr_el0 at compile time. It requires the helper function to look up the DTV and calculate the current address.

3. “Linker Relaxation” (Why you might see static offsets anyway)

You mentioned that in your opinion it should be static. You are right in the sense that performance-wise, we want it to be static.

If you compile this code as a shared object (.so), the compiler generates the slow blr helper code (General Dynamic) because it has to be safe.

However, if you verify that this object file ends up in the main executable during the linking phase, the Linker is smart enough to rewrite the code! This is called Linker Relaxation.

The Linker sees:

“Hey, this code asked for a dynamic offset lookup…”
“…but I know this is the main executable.”
“…so I know the offset is definitely 16.”

It will physically overwrite the assembly instructions:

It replaces the blr (call) with nop (do nothing).
It replaces the complex setup with a simple mov or add.

Summary

Compiler’s View: “I don’t know where this code will live (Exe vs Lib), so I will generate the Helper Function code just to be safe.”
Linker’s View: “I am building the final binary now. If I can prove the offset is static, I will optimize the assembly to remove the helper.”

The snippet you pasted contains the un-optimized, “safe” version (TLSDESC), implying it was compiled with -fPIC and the linker didn’t (or couldn’t) relax it to the Local Executable model.

AFL++ is not multi-threaded by default.

By design, a single instance of afl-fuzz is a single-threaded process that occupies exactly one CPU core.

If you run afl-fuzz once on a 64-core machine, it will use 1 core and leave the other 63 idle. To utilize your full hardware, you must manually launch multiple separate instances of the fuzzer (processes), not threads.

1. Why isn’t it multi-threaded?

AFL++ (and the original AFL) relies on determinism and global state.

It uses a “Forkserver” mechanism where the fuzzer pauses the target at main(), then fork()s a new child process for every single input attempt.
Managing this via threads would be incredibly complex because the “target” (the program you are fuzzing) might crash, hang, or corrupt memory. If the fuzzer were just a thread in the same process, a crash in the target would crash the fuzzer too.
Process isolation ensures that when the target crashes (which is the goal!), the fuzzer survives to record the crash.

2. How to run it in “Parallel Mode”

Since it doesn’t spawn threads automatically, you use Distributed/Parallel Fuzzing by running multiple processes that share a folder to sync their findings.

You typically run one “Main” instance (deterministic) and many “Secondary” instances (randomized).

Example for a 4-core system:

Core 1 (Main): ./afl-fuzz -i inputs -o syncdir -M fuzzer01 ./target_app (The -M flag tells it to be the Master/Main instance).
Core 2 (Secondary): ./afl-fuzz -i inputs -o syncdir -S fuzzer02 ./target_app (The -S flag tells it to be a Secondary instance).
Core 3 & 4 (More Secondaries):
bash ./afl-fuzz -i inputs -o syncdir -S fuzzer03 ./target_app ./afl-fuzz -i inputs -o syncdir -S fuzzer04 ./target_app

All instances watch the syncdir directory. If fuzzer02 finds a new interesting path, fuzzer01 will see it in the folder and add it to its own queue.

3. Connection to your previous question (TLS)

This explains why the TLS instrumentation (tpidr_el0) you asked about earlier is so important.

Even though AFL++ itself is single-threaded, the Target Program you are fuzzing might be multi-threaded.

If your target app spawns 4 threads, they will all try to write to the coverage map (the “Map” in your assembly code).
Without TLS (Thread Local Storage), Thread A and Thread B in the target would overwrite each other’s prev_loc variable, corrupting the coverage data.
The tpidr_el0 instrumentation ensures that even if the target is multi-threaded, the coverage reporting remains accurate for each thread.

Post Views: 118

High-Level Logic

Step-by-Step Instruction Breakdown

1. Check if the Map exists

2. TLS Setup & Context Saving

3. Resolve TLS Offset for __afl_prev_loc

4. Load Previous Location & Calculate Edge

5. Update Coverage Map

6. Return

Summary

1. Thread Local Storage (TLS) — The System

2. Thread-Local Variable — The Programming Object

3. tpidr_el0 — The Hardware Register

How they work together (The Flow)

Applying it to your code snippet

1. The Core Problem: “I don’t know where I am yet”

2. The Solution: TLS Descriptors (TLSDESC)

3. Why is it a “Helper” and not just a standard function?

Summary

1. The “Train” Analogy

The Problem for the Compiler

2. The Dynamic Thread Vector (DTV)

3. “Linker Relaxation” (Why you might see static offsets anyway)

Summary

1. Why isn’t it multi-threaded?

2. How to run it in “Parallel Mode”

3. Connection to your previous question (TLS)

Leave a Reply Cancel reply

Tailcall in AArch64

The difference of overflow and underflow

Why the load of main by _start uses got entry, not adrp+add pair?

Global variable declaration inside function (Python

3. Resolve TLS Offset for `__afl_prev_loc`

3. `tpidr_el0` — The Hardware Register