I'm trying to create a basic decompiler and I'm having thoughts on how I can find the basic blocks to later build a CFG.
I'm using the sources at http://www.backerstreet.com/decompiler/basic_blocks.php as a guide and I'm having trouble understanding how a certain part of the algorithm can work reliably.
The part I'm wondering about is the calculation of the length of the block:
block = new Block at current_address
blockList += block
label = find label at address higher than current_address
branch = find branch at address higher than current_address
if label->address < branch->address_after_branch
block->length = label->address - block->address
current_address = label->address
goto next
end if
block->length = branch->address_after_branch - block->address
Now it makes sense that we first try to find the closest higher address label and see if it's lesser than the following instruction address of the next branch (or return) instruction so to not overshoot the length value.
My problem with that approach is if we analyze the code and follow the branches as we find them a scenario can appear that we first analyze a branch target that lets assume lands us at address 0x000eb330
and the instructions might lookup like this (an example taken from ghidra):
LAB_000eb330
000eb330 c0 05 00 b0 adrp x0,0x1a4000
000eb334 00 40 2d 91 add x0,x0,#0xb50
000eb338 01 00 80 d2 mov x1,#0x0
000eb33c 02 00 80 d2 mov x2,#0x0
000eb340 9a cf 01 94 bl __stubs::bla_bla
000eb344 f5 03 00 aa mov x21,x0
LAB_000eb348
000eb348 ff ce 01 94 bl __stubs::bla_bla2
000eb34c fb ce 01 94 bl __stubs::bla_bla3
000eb350 e0 03 15 aa mov x0,x21
000eb354 fd 7b 42 a9 ldp x29,x30,[sp, #0x20]
000eb358 f4 4f 41 a9 ldp x20,x19,[sp, #local_20]
000eb35c f6 57 c3 a8 ldp x22,x21,[sp], #0x30
000eb360 c0 03 5f d6 ret
and another branch target that might land us at 0x000eb348
as can be seen from the example.
Now if the first branch is prior to the second, then we analyze the first block first and there's no label that its address is lesser than the address of the ret
instruction meaning we'll identify the first block length more than it should be, and later we'll get to analyze the second jump target, we'll create a block for it and only then we'll have the information needed for the first block to get the next closest label, which is basically the next basic block which was not yet initialized. So how do I solve this problem?
I think this needs two passes, one for creating all basic blocks with their respective address, and only then iterate each block, so they'll have all other basic blocks starting address to know if there's a block with address less than the closest next branch instruction.
I wonder if this is the best way as it seems the algorithm presented in the website glosses this issue over.