LLVM Debug Dive – Why choice matters

Posted onJune 15, 2018 by ilddor

So, at my current work I mess a lot with LLVM. In case you don’t know or don’t want to search information about it – LLVM is a compiler framework that really makes it easier to make compilers, at least in most cases. LLVM is made with modularity in mind so that everyone can tweak it to his/her needs or discard any unwanted behaviors – or rather compile set of those wanted.

This post is about one of many corner cases, where some of compromises were made so that you don’t really have a way to solve your issue unless you involve some dirty solutions.

Some time ago, while debugging some issue, I stepped on a problem that showed itself as a really slow assembler code generated from relatively simple input code. By relatively simple I mean that this was code that should not produce any additional loads or stores to memory as input code basically had less variables than architecture it was compiling for has registers – that means that solution without need of storing variable values in memory to allow computations (values are stored in registers by default) should be possible as one of options.

I looked at parts of code that could create patterns that would result in such need to store register values in memory, but those components usually have restriction so they would not lead to those situations as generally in our architecture memory operations are quite expensive in terms of time required to complete them (not counting possibility of obstructing available memory bandwidth that could be used for actual memory operations indicated in code). To simplify I will introduce a term that people not related to compilers may not know – “register pressure” – if you need explanation, it is basically code statistics that tells what is the minimal number of registers required to handle code variables without the need to store them somewhere else. The higher the register pressure, the more registers you need.

Going back to the main issue we discuss here…

In fact LLVM-IR code (LLVM internal intermadiate representation code format), when going into those components, was already with high register pressure with value high over threshold where we would avoid storing values in memory, and none of present optimizations were able to reduce it.

After lot of calls to this->dump() (most used function when debugging LLVM 🙂 ) I found out that problem was in LLVM provided optimization called InstructionCombining.

More about InstructionCombining – it’s very generic and common optimization that basically… combines instructions like these (I’ll use simplified version of LLVM-IR text representation to avoid confusion on not related things):

[code language=”cpp”]
%2 = add %1, 1
%3 = add %2, 1
[/code]

becomes:

[code]
%2 = add %1, 2
[/code]

Most of the combining depends on properties of mathematical equations, the rule is that it tries to reduce the number of instructions.

How is it implemented?
I will oversimplyfy it! It starts with adding terminator instruction (pretty much last instruction in a basic block) to a worklist and then loop over the worklist. Pseudocode would look like this:
[code]
while worklist is not empty
pop operation from worklist
try to combine operation with its operands
add operants of result operation to worklist
insert operation at first insertion point in basic block (first insertion point is usually the beginning of basic block)
[/code]
quite simple, actually whole loop is run over and over as long as there was some change in the code quite useful but also can be dangerous.
And now to the root cause of my issue.
The worklist in this particular optimization could be any container that allows fast push and pop, like list, queue, stack, etc. In LLVM implementation – at least in version I was working with – this worklist was implemented using a stack. This caused some interesting side effects. Let’s take a look:

Let’s assume we have LLVM-IR input like this(again simplified version):
[code]
%1 = load
%2 = load
%3 = add %1, %2
%4 = load
%5 = add %4, %3
%6 = load
%7 = add %6, %5
%8 = load
%9 = add %8, %7
[/code]
Loads cannot be optimized (at least by this optimization) so basically none of these instructions will be optimized, just visited and inserted at first insertion point (kind of trivial sinking). Let’s go over how this would look like when going over pseudo code loop:
[code]
(push %9 to worklist, Worklist { %9})
pop worklist: %9 = add %8, %7 (push %8 and %7 in this order to worklist, Worklist { %8, %7 })
pop worklist: %7 = add %6, %5 (push %6 and %5 in this order to worklist, Worklist { %8, %6, %5 })
pop worklist: %5 = add %4, %3 (push %4 and %3 in this order to worklist, Worklist { %8, %6, %4, %3 })
pop worklist: %3 = add %1, %2 (push %1 and %2 in this order to worklist, Worklist { %8, %6, %4, %1, %2 })
pop worklist: %2 = load (no operands – no new items to worklist, Worklist { %8, %6, %4, %1 })
pop worklist: %1 = load (no operands – no new items to worklist, Worklist { %8, %6, %4 })
pop worklist: %4 = load (no operands – no new items to worklist, Worklist { %8, %6 })
pop worklist: %6 = load (no operands – no new items to worklist, Worklist { %8 })
pop worklist: %8 = load (no operands – no new items to worklist, Worklist {})
loop finished, nothing changed (no optimizations made)
[/code]
After this optimization code will look like this:
[code]
%8 = load
%6 = load
%4 = load
%1 = load
%2 = load
%3 = add %1, %2
%5 = add %4, %3
%7 = add %6, %5
%9 = add %8, %7
[/code]
As you can see implementing the worklist with the use of a stack caused instructions within this basic block to change their order significantly, in a way that’s not really easy to fix by a separate pass. The code is absolutely correct in terms of functionality, but with no particular gain we sacrificed low register pressure.

So let’s take a look what this would look like, when a queue instead of a stack was used in place to implement the worklist:
[code]
(push %9 to worklist, Worklist { %9})
pop worklist: %9 = add %8, %7 (push %8 and %7 in this order to worklist, Worklist { %8, %7 })
pop worklist: %8 = load (no operands – no new items to worklist, Worklist { %7 })
pop worklist: %7 = add %6, %5 (push %6 and %5 in this order to worklist, Worklist { %6, %5 })
pop worklist: %6 = load (no operands – no new items to worklist, Worklist { %5 })
pop worklist: %5 = add %4, %3 (push %4 and %3 in this order to worklist, Worklist { %4, %3 })
pop worklist: %4 = load (no operands – no new items to worklist, Worklist { %3 })
pop worklist: %3 = add %1, %2 (push %1 and %2 in this order to worklist, Worklist { %1, %2 })
pop worklist: %1 = load (no operands – no new items to worklist, Worklist { %2 })
pop worklist: %2 = load (no operands – no new items to worklist, Worklist {})
loop finished, nothing changed (no optimizations made)
[/code]
And corresponding code:
[code]
%2 = load
%1 = load
%3 = add %1, %2
%4 = load
%5 = add %4, %3
%6 = load
%7 = add %6, %5
%8 = load
%9 = add %8, %7
[/code]
WOW! No difference in instruction order, register pressure without any change.
Now lets consider a case where there’s more than 9 instructions with 5 loads, that it’s around 60 instructions, like in some blend effect shader, or worse, that it’s unrolled loop that does compute some accumulation of values, you can end up with register pressure so big that you’ll encounter many situations never considered before or rarely tested.
Of course there are already passes that fix such situations, but they’re not perfect or they’re very expensive in terms of computation time which leads to longer compilation times. I’m not saying a queue would be the best option for every situation, but at least for the architecture I worked with it was. At least much better than a stack. I only wanted to point out that if you’re writing some generic optimization, consider situations like this one and try not to make it in a way it screws up for not ideal case, or at least give a user an option to customize it so he/she won’t have to workaround those issues.

Side note: I did not raise this issue on LLVM community mailing list nor have I provided a fix to the public repository due to lack of time – so if you have time to do so you can do it. Just please credit me in some way..:)

Also… First actual post!

Leave a Reply Cancel reply