How to disassemble line by line from stdin?

535 views Asked by At

My program output encoded instructions which look like this :

0x81FB4300000090
0x69FC4300000090
0x81FC4300000090
0x69FD4300000090
0x81FD4300000090
0x69FE4300000090
0x81FE4300000090
0x69FF4300000090
0x81FF4300000090
0x00054400000090
0x01054400000090
0x02054400000090
0x03054400000090
0x08054400000090
0x09054400000090
0x0A054400000090
0x0B054400000090
0x10054400000090
0x11054400000090
0x12054400000090
0x13054400000090
0x18054400000090
0x19054400000090
0x1A054400000090
0x1B054400000090
0x20054400000090
0x21054400000090
0x22054400000090
0x23054400000090
0x28054400000090
0x29054400000090
0x2A054400000090
0x2B054400000090
0x30054400000090
0x31054400000090
0x32054400000090
0x33054400000090
0x38054400000090
0x39054400000090
0x3A054400000090
0x3B054400000090
0x40054400000090
0x41054400000090
0x42054400000090
0x43054400000090
0x44054400000090
0x45054400000090
0x46054400000090
0x47054400000090

Where each lines above are independent set of instructions and need to be disassembled as separate programs. Each line contains 7 bytes of instructions. I can output them in binary directly, ***in that case, every block of 7 bytes need to be disassembled separately.

In the bash script that run my program, I want to filter out lines which contains static jumps.

So, how to disassemble each lines separately from stdin ? (I want to do something like ./my_C_program | the_disassembler | grep loopne)
I tried objdump, but it refuse to use /dev/stdin as input file.

2

There are 2 answers

1
doug65536 On
time bash -c 'for i in $(cat insns.txt); do \
        echo ".quad $i" | \
        as --64 | \
        objdump --disassemble; \
    done'

It took 192ms on my machine. Never assume you know something is too slow.

They are a bunch of nop instructions with junk after them. Are they in the wrong order? The most significant (last) byte is first when written in hex.

4
Peter Cordes On

Since you say it would be too slow to fork a disassembler for each line, you need some way to separate one stream of disassembler output.

Un-hexdump your input using something like xxd -r, and pipe that through a disassembler, and pipe the disassembler output into a perl program or something. Or just grep-with-context: grep -C8 loopne to print the 8 surrounding lines when a match is found.


To aid in separating the output back into separate records: maybe add some kind of sentinel (like a UD2 instruction) that doesn't appear in any of your lines. Since you say the sequences might not end on an instruction boundary, a sentinel like 90 90 90 90 90 90 90 90 90 0F 0B should safely soak up any extra bytes. That's 9 bytes of NOPs, in case a sequence ends with the start of an instruction looking for an imm32 and a disp32 as part of the addressing mode. (And a 9th NOP for good measure, since I didn't check what 0x90 means as a ModRM or SIB byte).

If your sequences are all the same number of bytes, you could use that to look for address ranges.

And BTW, I'd suggest something like perl to make it easy to take multiple lines as a group that you can pattern match on.

If you need this to be efficient, you need to make sure you can separate the output of one disassembler stream back into separate blocks, or else you need to embed a disassembler into the process that generates these lines (and not print them as ASCII strings in the first place).

There's no completely-general way to do this that's also fast. You can't have your cake and eat it too. If this is a problem, you're going to have to make the number-generating program know more about x86 machine code.


The other option I can see is to create an object file with symbols marking the start of each block, but that would mean feeding the whole thing through an assembler, after turning each line into something like:

label1234: dq 0x11054400000090

This option seems bad, so I haven't tried to solve any byte-order issues. It probably uses a lot of memory, since most x86 assemblers are not one-pass, and probably aren't designed for assembling massive amounts of data with no jump instructions that require picking a short or long encoding.