Decompilation creating basic blocks

59 views Asked by At

I'm trying to create a basic decompiler and I'm having thoughts on how I can find the basic blocks to later build a CFG.

I'm using the sources at http://www.backerstreet.com/decompiler/basic_blocks.php as a guide and I'm having trouble understanding how a certain part of the algorithm can work reliably.

The part I'm wondering about is the calculation of the length of the block:

block = new Block at current_address
        blockList += block
        label = find label at address higher than current_address
        branch = find branch at address higher than current_address
        if label->address < branch->address_after_branch
            block->length = label->address - block->address
            current_address = label->address
        goto next
    end if
    block->length = branch->address_after_branch - block->address

Now it makes sense that we first try to find the closest higher address label and see if it's lesser than the following instruction address of the next branch (or return) instruction so to not overshoot the length value.

My problem with that approach is if we analyze the code and follow the branches as we find them a scenario can appear that we first analyze a branch target that lets assume lands us at address 0x000eb330 and the instructions might lookup like this (an example taken from ghidra):

                             LAB_000eb330                                                   
        000eb330 c0 05 00 b0     adrp                           x0,0x1a4000
        000eb334 00 40 2d 91     add                            x0,x0,#0xb50                                                                                       
        000eb338 01 00 80 d2     mov                            x1,#0x0
        000eb33c 02 00 80 d2     mov                            x2,#0x0
        000eb340 9a cf 01 94     bl                             __stubs::bla_bla                                                                                              
        000eb344 f5 03 00 aa     mov                            x21,x0
                             LAB_000eb348                                                    
        000eb348 ff ce 01 94     bl                             __stubs::bla_bla2
        000eb34c fb ce 01 94     bl                             __stubs::bla_bla3
        000eb350 e0 03 15 aa     mov                            x0,x21
        000eb354 fd 7b 42 a9     ldp                            x29,x30,[sp, #0x20]
        000eb358 f4 4f 41 a9     ldp                            x20,x19,[sp, #local_20]
        000eb35c f6 57 c3 a8     ldp                            x22,x21,[sp], #0x30
        000eb360 c0 03 5f d6     ret

and another branch target that might land us at 0x000eb348 as can be seen from the example.

Now if the first branch is prior to the second, then we analyze the first block first and there's no label that its address is lesser than the address of the ret instruction meaning we'll identify the first block length more than it should be, and later we'll get to analyze the second jump target, we'll create a block for it and only then we'll have the information needed for the first block to get the next closest label, which is basically the next basic block which was not yet initialized. So how do I solve this problem?

I think this needs two passes, one for creating all basic blocks with their respective address, and only then iterate each block, so they'll have all other basic blocks starting address to know if there's a block with address less than the closest next branch instruction.

I wonder if this is the best way as it seems the algorithm presented in the website glosses this issue over.

0

There are 0 answers