https://www.npopov.com/2023/04/07/LLVM-middle-end-pipeline.html
Components of LLVM:
LLVM is designed with a three-phase architecture: Frontend, Middle-end, and Backend. Each component plays a crucial role in translating high-level source code into machine-executable code, while allowing modularity and optimizations across different languages and architectures.
1. Frontend
The frontend is responsible for:
- Parsing the high-level source code (written in languages like C, C++, Rust, etc.).
- Checking for syntax and semantic correctness.
- Generating the LLVM Intermediate Representation (IR), which is an abstract, platform-independent code representation.
Key Processes in the Frontend:
- Lexical Analysis (Tokenization): The source code is broken into meaningful chunks called tokens (e.g., keywords, identifiers, literals, and operators). This step converts the text into a structured format.
- Parsing (Syntax Analysis): After tokenization, the parser constructs an Abstract Syntax Tree (AST), a hierarchical tree that represents the grammatical structure of the program. The AST reflects how different code elements (like loops, function calls, or expressions) are nested and organized.
- Semantic Analysis: The compiler checks for semantic correctness — meaning whether the operations in the code are logically valid. For example, trying to add an integer and a string would cause a semantic error. Type-checking, scope resolution, and identifier binding occur here.
- Code Generation (to LLVM IR): The frontend converts the AST into LLVM IR. LLVM IR is a low-level, hardware-independent representation that is simpler than high-level languages but still structured enough to represent complex language constructs. This code is suitable for optimization and is used as the input for the next phase (Middle-end).
Popular LLVM Frontends:
- Clang: A frontend for C, C++, Objective-C.
- Rustc: The Rust compiler uses LLVM as its backend, generating LLVM IR from Rust source code.
- Swift: The Swift language frontend also uses LLVM for code generation and optimization.
- LLVM Py: Tools that allow you to build frontends for custom languages (like the SimpleLang example earlier).
2. Middle-end (IR Optimizations)
The middle-end of LLVM operates on the LLVM Intermediate Representation (IR) generated by the frontend. This phase is where most of the optimizations take place. These optimizations aim to improve performance, reduce code size, and enhance runtime efficiency, without changing the program’s functionality.
Key Characteristics of LLVM IR:
- Hardware-agnostic: LLVM IR is not tied to any specific machine architecture, making it portable across different platforms.
- Three Forms: LLVM IR comes in three equivalent forms — textual, binary bitcode, and in-memory data structures — enabling different usages and transformations.
- Textual: Human-readable format (as shown in the previous example).
- Bitcode: A compressed binary format for efficient storage and transfer.
- In-memory: Data structures that the LLVM compiler uses internally for processing and transformation.
Optimizations in the Middle-end:
- Dead Code Elimination: Removes code that is never executed or whose results are not used (e.g., unused variables or unreachable code).
- Constant Folding: Replaces constant expressions (like
5 + 3
) with their computed result (8
), which reduces the runtime workload. - Common Subexpression Elimination (CSE): Identifies and eliminates duplicate calculations. For example, if the same expression is computed multiple times, it only computes it once and reuses the result.
- Loop Optimizations:
- Loop Unrolling: Expands loops to reduce overhead associated with loop control (e.g., reducing the number of iterations).
- Loop Invariant Code Motion: Moves code outside the loop if it doesn’t change during iterations.
- Inlining: Replaces function calls with the function’s actual code, reducing the overhead of the call at runtime (though this can increase the size of the code).
- Vectorization: Combines multiple scalar operations into a single vector operation to take advantage of SIMD (Single Instruction, Multiple Data) instructions, boosting performance on modern CPUs.
- Register Allocation: Determines which variables should be stored in CPU registers (for faster access) and which should be stored in memory.
LLVM’s middle-end offers over 200 built-in optimizations that can be applied at compile-time, link-time, or runtime (in the case of Just-in-Time (JIT) compilation).
3. Backend
The backend takes the optimized LLVM IR from the middle-end and translates it into machine-specific code. This is the final stage in the LLVM compilation process, where the IR is mapped to the underlying hardware architecture. The backend generates machine code that the target CPU can execute directly.
Key Responsibilities of the Backend:
- Instruction Selection: Converts LLVM IR into assembly or machine code specific to the target architecture (e.g., x86, ARM, RISC-V). Different architectures have different instruction sets, and the backend selects the most efficient instruction for each operation.
- Register Allocation: While some register allocation occurs in the middle-end, the backend handles the final allocation of variables to CPU registers. Efficient register allocation is critical for performance, as accessing registers is much faster than accessing memory.
- Instruction Scheduling: Arranges the instructions in an optimal order to minimize CPU stalls and pipeline hazards. Some CPUs have specific pipeline architectures, and ordering instructions properly can lead to more efficient execution.
- Code Generation: Produces assembly language or machine code (binary) that can be executed on the target architecture. If the goal is to create an executable file, the backend also handles the generation of object code, which can then be linked with libraries and other object files.
- Target-specific Optimizations: The backend may apply optimizations that are specific to the target architecture. For example, certain processors may have specialized instructions that the backend can take advantage of, such as vector instructions for multimedia processing.
Backend Output:
- Assembly code: Human-readable machine instructions specific to the CPU.
- Machine code: The binary executable code that can be loaded and run by the operating system on the CPU.
- Object files: Intermediate binary files that are linked together to create the final executable.
Popular LLVM Backends:
- x86: For Intel and AMD processors.
- ARM: For mobile devices and embedded systems.
- RISC-V: For open-source processors.
- PowerPC: For IBM’s Power architecture.
LLVM also supports Just-In-Time (JIT) compilation, where the backend converts LLVM IR to machine code dynamically during program execution, allowing for runtime optimizations.
Putting it All Together
- Frontend: Takes high-level source code and converts it to LLVM IR.
- Middle-end: Optimizes the LLVM IR, applying various transformations to improve performance, reduce memory usage, and optimize resource use.
- Backend: Converts the optimized IR into machine-specific code, taking into account the target architecture, registers, and instruction set.
For example, if you compile a C program with Clang (LLVM’s frontend for C/C++), the process looks like this:
- Clang translates the C code into LLVM IR.
- The LLVM optimizer applies various optimizations to the IR.
- The LLVM backend translates the optimized IR into machine code for the CPU (e.g., x86 or ARM).
- The machine code is then linked with libraries and converted into a final executable.
This modular and flexible architecture makes LLVM powerful, enabling support for a wide range of programming languages and hardware architectures, with highly optimized code output.
Key Features of LLVM
LLVM’s architecture and design have been crafted to support a wide variety of programming languages, architectures, and optimization techniques. Its key features, such as modular design, language agnosticism, and powerful optimizations, make it a versatile tool for compiler developers, language designers, and system architects. Below, we dive into these features.
1. Modular Design
LLVM is built with a modular architecture, which means it is composed of distinct libraries and components that work together to form the complete toolchain. This design allows each component to be reused, extended, or replaced without affecting the entire system.
Key Aspects of LLVM’s Modular Design:
Separation of Frontend, Middle-end, and Backend: Each phase of the compiler (frontend, IR optimization, backend) is handled by separate modules. This means you can develop a new language frontend and plug it into the existing LLVM infrastructure without needing to modify other parts of the system.
Reusable Libraries: LLVM is divided into multiple libraries that serve specific purposes. These include:
- libLLVMCore: Implements core data structures like LLVM IR.
- libLLVMMC: Handles machine code generation and target-specific backends.
- libLLVMTransforms: Provides a collection of optimization passes.
- libLLVMExecutionEngine: Powers Just-in-Time (JIT) compilation.
This modularity allows users to pick and choose only the libraries they need for their specific use case, ensuring that the system remains lightweight and customizable.
Extendable Components: If LLVM doesn’t support a specific feature we need (e.g., a new optimization or a custom target architecture), we can extend it by developing pass, backend, or transformation and integrating it with the existing system.
Custom Frontends: We can create our own language frontend that parses the source code and generates LLVM IR. This IR can then take advantage of LLVM’s middle-end optimization and backend code generation. Languages like Rust, Swift, and Julia all have custom frontends that generate LLVM IR.
Example: Leveraging Modular Design
We could write a new language frontend that generates LLVM IR from the custom language (e.g., SimpleLang), and without needing to worry about code generation or optimization, we can reuse LLVM’s middle-end and backend to compile it into efficient machine code for multiple platforms.
2. Language Agnostic
LLVM’s language-agnostic nature means it doesn’t cater to any one programming language specifically but provides a platform for any language to be compiled. This is achieved through the use of LLVM Intermediate Representation (IR), a common, low-level code format that is designed to be flexible enough to represent various high-level language features.
Key Characteristics of LLVM’s Language Agnosticism:
- LLVM IR as the Common Denominator: LLVM IR is a platform-independent, typed intermediate language. It is rich enough to support the features of most modern programming languages, including:
- Static and dynamic types.
- High-level abstractions like objects and closures.
- Low-level operations like memory manipulation and hardware interaction. By generating LLVM IR, we can compile languages as diverse as C++, Python, Rust, or even custom domain-specific languages (DSLs).
- Frontends for Multiple Languages: LLVM supports a wide variety of languages via different frontends, such as:
- Clang for C, C++, and Objective-C.
- Rustc for Rust.
- Swiftc for Swift.
- Emscripten for compiling C/C++ to WebAssembly (via LLVM).
- LLVM-GHC for Haskell.
- The common factor is that these frontends translate high-level language constructs into LLVM IR, which the LLVM middle-end and backend then process.
- Custom Language Support: We can write a frontend for any language (or DSL) that we want to compile to LLVM. This could range from a general-purpose programming language to a highly specialized language tailored for a particular application domain (like GPU programming, AI model description, or scientific computation).
Example: Compiling a Custom Language to LLVM IR
If we design a new language like MatLang for mathematical computations, we can develop a frontend that translates MatLang expressions and constructs into LLVM IR. Once the IR is generated, we can use LLVM’s middle-end to optimize the code and the backend to generate machine-specific assembly code, leveraging LLVM’s built-in support for multiple architectures.
3. Optimization
One of LLVM’s most powerful features is its robust set of built-in optimizations. These optimizations can be applied at compile-time, link-time, or even run-time (Just-in-Time (JIT) compilation). This multi-phase optimization system ensures that the code runs efficiently on the target hardware, minimizing resource usage, improving performance, and optimizing memory usage.
Types of Optimizations in LLVM:
Compile-time Optimizations: These are applied when the source code is compiled into machine code. Examples include:
- Constant Folding: Simplifying constant expressions at compile-time (e.g., replacing
3 + 5
with8
). - Dead Code Elimination: Removing code that is never executed or whose result is never used.
- Inlining: Replacing function calls with the actual code of the function to avoid the overhead of a call.
Link-time Optimizations (LTO): These optimizations happen when object files and libraries are linked together to form the final executable. The advantage of link-time optimization is that it has a broader view of the entire program and can apply more aggressive optimizations:
- Whole Program Optimization: Optimizations can be applied across different modules (files) since the linker has visibility into all the code. It can inline functions across module boundaries, eliminate unnecessary function calls, and more.
- Global Dead Code Elimination: Code that appears necessary in individual modules but is found to be unused in the context of the entire program can be removed at this stage.
Run-time Optimizations (JIT): Just-in-Time (JIT) compilation is a process where LLVM compiles code at runtime rather than at compile-time, making it possible to optimize code dynamically based on the execution context. This is useful in scenarios like virtual machines (e.g., in the implementation of languages like JavaScript or Python):
- Adaptive Optimization: The JIT compiler can analyze the code while it’s running and recompile hot paths (frequently executed code) to optimize their performance.
- Profile-guided Optimizations (PGO): By analyzing runtime data (which functions are called most often, which branches are taken), the JIT can optimize the code for those specific execution patterns.
Target-specific Optimizations: LLVM’s backend can apply optimizations tailored to specific hardware architectures:
- Instruction Scheduling: The backend reorders instructions to better fit the CPU pipeline architecture, reducing stalls and increasing instruction throughput.
- Vectorization: The compiler can take advantage of SIMD (Single Instruction, Multiple Data) instructions available on modern CPUs by grouping operations into vector operations, speeding up tasks like matrix multiplications or image processing.
Specific Optimization Passes:
- Dead Code Elimination: Removes instructions that are not needed (i.e., whose results are never used).
- Loop Unrolling: Expands loops to reduce the number of iterations and the associated overhead of loop control.
- Loop Invariant Code Motion: Moves code outside a loop if it doesn’t depend on the loop, reducing repeated computations.
- Tail Call Optimization: Optimizes recursive functions by converting them into iterations (reducing stack usage).
Example: Compile-time Optimization
Consider this simple C code:
int square(int x) {
return x * x;
}
int main() {
int result = square(5);
return result;
}
LLVM can apply constant folding during the compilation process, replacing the function call square(5)
with 25
directly. The final code might simply return the constant 25
, skipping the actual multiplication at runtime.
Summary of Key Features:
- Modular Design: The modularity of LLVM allows flexibility, enabling developers to work on individual components (frontends, optimizations, backends) independently. We can create a custom language frontend, plug it into LLVM, and benefit from existing optimizations and machine-specific code generation.
- Language Agnostic: LLVM IR serves as a common intermediate representation that makes it possible to compile a wide range of languages, from high-level scripting languages to low-level system languages. This allows LLVM to act as a powerful backend for many compilers.
- Optimization: LLVM’s rich set of optimizations can improve code performance at multiple stages (compile-time, link-time, and run-time). These optimizations help generate efficient machine code tailored to the target architecture, ensuring that programs run faster and use fewer resources.
LLVM: The middle-end optimization pipeline
Compilers are classically split into three parts: The front end, the middle end and the back end. The front end is what deals with programming-language specific analysis, such as parsing, type checking, etc.
The middle end performs optimizations that are independent (to the degree that this is possible) from both the input language and the output target. Finally, the back end performs target-specific optimizations and emits machine code.
This separation is what saves everyone from reinventing the wheel. New programming languages only need to implement the frontend. New CPU architectures only need to implement the backend. And indeed, while there are many, many different language frontends, they nearly always rely on LLVM for the middle and backends:
Frontend LLVM
/---------\ /-----------------------------------\
| clang | | |
| rustc | --(LLVM IR)--> | Middle end --(LLVM IR)--> Backend | --> Machine code
| ... | | |
\---------/ \-----------------------------------/
The three optimization pipelines
There are three kinds of optimization pipelines: The default (non-LTO) pipeline, the ThinLTO pipeline and the FatLTO pipeline. The default pipeline works by separately optimizing each module, without any special knowledge about other modules (apart from function/global declarations). “Module” here corresponds to “one source file” in C/C++ and one “codegen unit” in Rust.
# Default
Module 1 --(Optimize)--> Module 1'
Module 2 --(Optimize)--> Module 2'
Module 3 --(Optimize)--> Module 3'
The LTO pipelines are split into a pre-link and a post-link optimization pipeline. After the pre-link pipeline, ThinLTO will perform some lightweight cross-module analysis, and in particular import certain functions from other modules to make them eligible for inlining. However, the post-link optimization again works on individual modules.
# ThinLTO
cross-import
Module 1 --(ThinLTO pre-link)------\-/-----(ThinLTO post-link)--> Module 1'
Module 2 --(ThinLTO pre-link)-------X------(ThinLTO post-link)--> Module 2'
Module 3 --(ThinLTO pre-link)------/-\-----(ThinLTO post-link)--> Module 3'
Finally, FatLTO will simply merge all modules together after the pre-link pipeline and then run the post-link pipeline on a single, huge module:
# FatLTO
merge
Module 1 --(FatLTO pre-link)-------\
Module 2 --(FatLTO pre-link)---------------(FatLTO post-link)--> Module'
Module 3 --(FatLTO pre-link)-------/
While LTO stands for “link-time optimization”, it is not necessarily performed by the linker. For example, rustc will usually perform it as part of the compiler. The important property of LTO is that is allows cross-module optimization.
LTO is also typically accompanied by a restricted list of symbols that need to be exported (i.e. externally accessible), while everything else can be internalized (i.e. become an internal implementation detail). This allows for whole program optimization (WPO).
The default optimization pipeline
The optimization pipelines are defined in PassBuilderPipelines and the default pipeline in particular is created by buildPerModuleDefaultPipeline()
. I will mostly talk about the pipeline in terms of generalities, so take a look at that file if you want to inspect the precise pass ordering.
The pipeline consists of two parts: The module simplification and the module optimization pipelines.
The job of the simplification pipeline is to simplify and canonicalize the IR. The job of the optimization pipeline is to perform optimizations that may make the IR more complex or less canonical.
Examples of the latter are vectorization and runtime unrolling. These transforms will generally greatly increase IR size and make it harder to analyze, which is why it is important that they run in the late optimization pipeline.
Most other transforms are simplifications, including classics like SROA, CSE/GVN, LICM, instruction combining, etc. An interesting case is full unrolling, which, unlike runtime unrolling, is also part of the simplification pipeline, because it converts a loop into straight-line code, opening up additional simplification opportunities.
The LTO pipelines
The ThinLTO pipeline is where the simplification/optimization split really comes into action: The pre-link pipeline basically just runs the simplification part, while the post-link pipeline runs both simplification and optimization.
The key property of LTO optimization is that additional inlining may be performed in the post-link step. As such, it is important that optimizations that increase IR size or decrease canonicality are not performed during the pre-link phase.
For example, we do not want functions to get vectorized pre-link, because this might prevent them from being inlined and then simplified in more profitable ways.
For the same reason, some other minor adjustments to the pre-link phase are performed. For example, loop rotation will be suppressed if it would duplicate a call: That call might become inlinable post-link, which might render the duplication much more expensive than it would otherwise be.
Running the simplification pipeline over everything again post-link is not strictly necessary: Ideally, it would only be applied to functions where additional inlining actually happened. This is a possible optimization for the future.
As for the fat LTO pipeline … well, we don’t talk about the fat LTO pipeline. Its design is somewhere between non-sensical and non-existent. Fat LTO runs the module optimization pipeline pre-link (which is not a good idea for the same reasons as for ThinLTO), and the post-link pipeline is entirely home-grown and not based on the standard simplification/optimization pipelines.
The module simplification pipeline
Now that we have put things into context, let’s return to the module simplification pipeline, the core of which is the CGSCC inliner. CGSCC is short for call-graph strongly-connected components.
The basic idea is that if you have a call-graph consisting of callers and callees (callee = called function), you want to simplify the callees first, before you try to inline them into the callers (and then simplify those too). Simplifying callees before inlining makes the cost-modelling more accurate and reduces the amount of simplification that needs to be performed post-inlining.
This is what CGSCC passes do: They try to visit callees before callers. For cyclic (recursive) call-graphs, this is not possible, so they do the next best thing: Visit strongly-connected components of mutually recursive functions. There is no well-defined visitation order within one component, but at least we preserve the “callee before callers” property between components.
The CGSCC pipeline consists essentially of the inliner and the function simplification pipeline. In addition to that it contains a few other CGSCC passes, e.g. for argument promotion (argument-level SROA) and inferral of function attributes.
The key bit is that inlining and simplification are interleaved, so that after calls into a function have been inlined, it will be simplified before becoming an inlining candidate itself. (This is another failure of the fat LTO pipeline, which does not interleave inlining and simplification.)
Before the CGSCC inliner, the module simplification pipeline performs some early cleanup. In particular this involes running a SimplifyCFG+SROA+EarlyCSE sequence on all functions, followed by some module-level passes, such as optimization of globals and IPSCCP, which performs inter-procedural constant and range propagation. I believe the purpose of the early function cleanup is to prepare for these module passes, and to make sure that non-trivial SCCs receive at least basic cleanup prior to inlining.
The function simplification pipeline
The function simplification pipeline is the very core of the entire optimization pipeline. Unfortunately, this is also the point where it becomes hard to make high-level statements about its underlying design philosophy.
There is a saying that goes something like this (I couldn’t find any source for it, but I’m also pretty sure I didn’t invent it):
Optimization is a science, phase ordering is an art.
There isn’t really a way to determine the correct ordering of optimization passes from first principles. It’s something that comes from experience, and incremental adjustments to handle this or that test case. In most cases, nobody will be able to tell you why a pass is scheduled in precisely the place it is.
The key constraint of phase ordering is compile-time. An easy solution to many phase ordering problems would be to just schedule a few extra pass runs, but this is rarely a worthwhile tradeoff. Fixing someone’s pet use case by scheduling an extra GVN run and thus making everyone pay an extra 5-10% compile-time is not a good idea.
A key part of compile-time reduction is analysis sharing. Passes that depend on the same expensive analysis will be scheduled as a group. For example, the jump threading and correlated value propagation passes are always scheduled as a pair, because they both depend on the expensive lazy value information analysis, which provides information on value ranges.
The function simplification pipeline contains two primary loop pipelines. The loop pipelines also work in an interleaved manner, where a sequence of loop passes is first applied to an inner-most loop, and then outer loops.
The primary distinction between the two loop pipelines is that the former uses and preserves MemorySSA (primarily to drive loop invariant code motion), while the latter does not. Once again, this is driven by analysis-sharing concerns.
Summary
In summary, this is roughly how LLVM’s optimization pipeline looks like:
|-------------------------- default ----------------------------|
|------------- pre-link --------------|
|------------------------- post-link ---------------------------|
/-------------------------------------\
| module simplification |
|-------------------------------------| /---------------------\
| | | module optimization |
| cgscc | |---------------------| /---------\
| /-------\ /-----------------------\ | | | | backend |
| | early | | inlining | | | vectorization | \---------/
| |cleanup| |function simplification| | | runtime unrolling |
| \-------/ \-----------------------/ | \---------------------/
| |
\-------------------------------------/
The cute “backend” box hides another optimization pipeline that I have ignored entirely here. Maybe I’ll talk about it some other time.
Finally, it’s worth mentioning that the optimization pipeline has a number of extension points where additional passes can be inserted either by the frontend, the backend or plugins. For example, frontends will usually add sanitizer passes using this mechanism.
It is also possible to have an entirely custom optimization pipeline. The standard pipeline is mostly tuned for Clang, but also works well for rustc, and I don’t think rustc would benefit substantially from a custom pipeline. However, when it comes to more exotic languages (e.g. Java and other GC languages) a custom optimization pipeline may be necessary.