how does capstone disassemble instructions? does it dissassemble it to raw assembly code?

152 views Asked by At

i was trying to list all the X86(and hopefully rest of the available Architectures) instruction in 64, 32 and 16 bit modes. i used capstone, i heard it was a good disassembly framework. i quickly wrote a small c++ code but the ouput it gave was maybe not what i wanted and made me question what i understand about assembly. first here is my code:

#include <iostream>
#include <iomanip>
#include <capstone/capstone.h>
#pragma comment(lib, "capstone.lib")

#include <unordered_set>
#include <sstream>
#include <cstdlib>

class byteins {
private:
    size_t byteSize;
    static uint8_t* bytes;

public:
    byteins(int n) : byteSize(n) {
        bytes = static_cast<uint8_t*>(std::calloc(byteSize, sizeof(uint8_t)));
    }

    int increase() {
        for (int i = 0; i < byteSize; i = i + 1) {
            if (bytes[i] != 0xFF) {
                bytes[i] = bytes[i] + 1;
                return 0;  // Successfully increased
            }
        }

        return -1;  // Full
    }

    int getSize() { return byteSize; }
    void* getBytes() { return bytes; }

    ~byteins() {
        std::free(bytes);
    }
};

uint8_t* byteins::bytes = nullptr;

std::unordered_set<std::string> mnemonics;
std::unordered_set<std::string> operands;

std::unordered_set<std::string> assemins;

int main()
{
    int size = 8;

    byteins bytes(size);

    csh handle;
    cs_insn* insn;

    cs_open(CS_ARCH_X86, CS_MODE_64, &handle);


    
    for (; bytes.increase() != -1;)
    {
        size_t count = cs_disasm(handle, reinterpret_cast<uint8_t*>(bytes.getBytes()), bytes.getSize(), 0, 0, &insn);
            for (int i = 0; i < count; i = i + 1)
            {
                std::stringstream inst;
                if (mnemonics.find(insn[i].mnemonic) == mnemonics.end())
                {
                    mnemonics.insert(insn[i].mnemonic);
                }
                if (operands.find(insn[i].op_str) == operands.end())
                {
                    operands.insert(insn[i].op_str);
                }

                inst << insn[i].mnemonic << '\t' << insn[i].op_str;
            
                if (assemins.find(inst.str()) == assemins.end())
                {
                assemins.insert(inst.str());
  
                std::cout << std::hex << std::uppercase << *(reinterpret_cast<uint64_t*>(&insn[i].bytes)) << "\t";

                std::cout << inst.str() << "\n";
                }

            }

            cs_free(insn, count);
        
    }
    

    cs_close(&handle);

    return EXIT_SUCCESS;
}

i wanted to print the instruction all the instruction using all possible combinations. i wanted to print them out cause i find assembly interesting and a list of all of them would be helpful, but the code seems to only print the instruction from 0x0 to 0xFFFF and not 0xFFFFFFFFFFFFFFFF and i don't understand why. i know i should just read the official docs but i find this way more convinient where i can just look up the word i am interested in my free time. i feel like i am lacking addional info on either how assembly works, how the disassembly works or how the architecture works so if u also feel like i do as well then please do fill me in.

i tried all possible combinations, CS_ARCH_X86 with CS_MODE_16, CS_MODE_32 and CS_MODE_64, with different lengths of bytes(1, 2, 3, 4, 5, 6, 7 and 8). i also tried asking chatgpt but the solution it gave just gave more errors. i reviewed the code mutiple times but didn't see anything that could cause problems(but i am no professional so i dunno).

edit::

the code above produced this output:

1       add     dword ptr [rax], eax
0       add     byte ptr [rax], al
2       add     al, byte ptr [rax]
3       add     eax, dword ptr [rax]
4       add     al, 0
5       add     eax, 0
8       or      byte ptr [rax], al
9       or      dword ptr [rax], eax
A       or      al, byte ptr [rax]
B       or      eax, dword ptr [rax]
C       or      al, 0
D       or      eax, 0
F       sldt    word ptr [rax]
10      adc     byte ptr [rax], al
11      adc     dword ptr [rax], eax
12      adc     al, byte ptr [rax]
13      adc     eax, dword ptr [rax]
14      adc     al, 0
15      adc     eax, 0
18      sbb     byte ptr [rax], al
19      sbb     dword ptr [rax], eax
1A      sbb     al, byte ptr [rax]
1B      sbb     eax, dword ptr [rax]
1C      sbb     al, 0
1D      sbb     eax, 0
20      and     byte ptr [rax], al
21      and     dword ptr [rax], eax
22      and     al, byte ptr [rax]
23      and     eax, dword ptr [rax]
24      and     al, 0
25      and     eax, 0
26      add     byte ptr es:[rax], al
28      sub     byte ptr [rax], al
29      sub     dword ptr [rax], eax
2A      sub     al, byte ptr [rax]
2B      sub     eax, dword ptr [rax]
2C      sub     al, 0
2D      sub     eax, 0
2E      add     byte ptr cs:[rax], al
30      xor     byte ptr [rax], al
31      xor     dword ptr [rax], eax
32      xor     al, byte ptr [rax]
33      xor     eax, dword ptr [rax]
34      xor     al, 0
35      xor     eax, 0
36      add     byte ptr ss:[rax], al
38      cmp     byte ptr [rax], al
39      cmp     dword ptr [rax], eax
3A      cmp     al, byte ptr [rax]
3B      cmp     eax, dword ptr [rax]
3C      cmp     al, 0
3D      cmp     eax, 0
3E      add     byte ptr ds:[rax], al
41      add     byte ptr [r8], al
44      add     byte ptr [rax], r8b
45      add     byte ptr [r8], r8b
50      push    rax
51      push    rcx
52      push    rdx
53      push    rbx
54      push    rsp
55      push    rbp
56      push    rsi
57      push    rdi
58      pop     rax
59      pop     rcx
5A      pop     rdx
5B      pop     rbx
5C      pop     rsp
5D      pop     rbp
5E      pop     rsi
5F      pop     rdi
64      add     byte ptr fs:[rax], al
65      add     byte ptr gs:[rax], al
67      add     byte ptr [eax], al
68      push    0
69      imul    eax, dword ptr [rax], 0
6C      insb    byte ptr [rdi], dx
6D      insd    dword ptr [rdi], dx
6E      outsb   dx, byte ptr [rsi]
6F      outsd   dx, dword ptr [rsi]
70      jo      2
71      jno     2
72      jb      2
73      jae     2
74      je      2
75      jne     2
76      jbe     2
77      ja      2
78      js      2
79      jns     2
7A      jp      2
7B      jnp     2
7C      jl      2
7D      jge     2
7E      jle     2
7F      jg      2
80      add     byte ptr [rax], 0
81      add     dword ptr [rax], 0
84      test    byte ptr [rax], al
85      test    dword ptr [rax], eax
86      xchg    byte ptr [rax], al
87      xchg    dword ptr [rax], eax
88      mov     byte ptr [rax], al
89      mov     dword ptr [rax], eax
8A      mov     al, byte ptr [rax]
8B      mov     eax, dword ptr [rax]
8C      mov     word ptr [rax], es
8D      lea     eax, [rax]
8E      mov     es, word ptr [rax]
8F      pop     qword ptr [rax]
90      nop
91      xchg    ecx, eax
92      xchg    edx, eax
93      xchg    ebx, eax
94      xchg    esp, eax
95      xchg    ebp, eax
96      xchg    esi, eax
97      xchg    edi, eax
98      cwde
99      cdq
9B      wait
9C      pushfq
9D      popfq
9E      sahf
9F      lahf
A4      movsb   byte ptr [rdi], byte ptr [rsi]
A5      movsd   dword ptr [rdi], dword ptr [rsi]
A6      cmpsb   byte ptr [rsi], byte ptr [rdi]
A7      cmpsd   dword ptr [rsi], dword ptr [rdi]
A8      test    al, 0
A9      test    eax, 0
AA      stosb   byte ptr [rdi], al
AB      stosd   dword ptr [rdi], eax
AC      lodsb   al, byte ptr [rsi]
AD      lodsd   eax, dword ptr [rsi]
AE      scasb   al, byte ptr [rdi]
AF      scasd   eax, dword ptr [rdi]
B0      mov     al, 0
B1      mov     cl, 0
B2      mov     dl, 0
B3      mov     bl, 0
B4      mov     ah, 0
B5      mov     ch, 0
B6      mov     dh, 0
B7      mov     bh, 0
B8      mov     eax, 0
B9      mov     ecx, 0
BA      mov     edx, 0
BB      mov     ebx, 0
BC      mov     esp, 0
BD      mov     ebp, 0
BE      mov     esi, 0
BF      mov     edi, 0
C0      rol     byte ptr [rax], 0
C1      rol     dword ptr [rax], 0
C2      ret     0
C3      ret
C6      mov     byte ptr [rax], 0
C7      mov     dword ptr [rax], 0
C8      enter   0, 0
C9      leave
CA      retf    0
CB      retf
CC      int3
CD      int     0
CF      iretd
D0      rol     byte ptr [rax], 1
D1      rol     dword ptr [rax], 1
D2      rol     byte ptr [rax], cl
D3      rol     dword ptr [rax], cl
D7      xlatb
D8      fadd    dword ptr [rax]
D9      fld     dword ptr [rax]
DA      fiadd   dword ptr [rax]
DB      fild    dword ptr [rax]
DC      fadd    qword ptr [rax]
DD      fld     qword ptr [rax]
DE      fiadd   word ptr [rax]
DF      fild    word ptr [rax]
E0      loopne  2
E1      loope   2
E2      loop    2
E3      jrcxz   2
E4      in      al, 0
E5      in      eax, 0
E6      out     0, al
E7      out     0, eax
E8      call    5
E9      jmp     5
EB      jmp     2
EC      in      al, dx
ED      in      eax, dx
EE      out     dx, al
EF      out     dx, eax
F0      lock add        byte ptr [rax], al
F1      int1
F4      hlt
F5      cmc
F6      test    byte ptr [rax], 0
F7      test    dword ptr [rax], 0
F8      clc
F9      stc
FA      cli
FB      sti
FC      cld
FD      std
FE      inc     byte ptr [rax]
FF      inc     dword ptr [rax]
1FF     inc     dword ptr [rcx]
2FF     inc     dword ptr [rdx]
3FF     inc     dword ptr [rbx]
4FF     inc     dword ptr [rax + rax]
5FF     inc     dword ptr [rip]
6FF     inc     dword ptr [rsi]
7FF     inc     dword ptr [rdi]
8FF     dec     dword ptr [rax]
9FF     dec     dword ptr [rcx]
AFF     dec     dword ptr [rdx]
BFF     dec     dword ptr [rbx]
CFF     dec     dword ptr [rax + rax]
DFF     dec     dword ptr [rip]
EFF     dec     dword ptr [rsi]
FFF     dec     dword ptr [rdi]
10FF    call    qword ptr [rax]
11FF    call    qword ptr [rcx]
12FF    call    qword ptr [rdx]
13FF    call    qword ptr [rbx]
14FF    call    qword ptr [rax + rax]
15FF    call    qword ptr [rip]
16FF    call    qword ptr [rsi]
17FF    call    qword ptr [rdi]
18FF    call    ptr [rax]
19FF    call    ptr [rcx]
1AFF    call    ptr [rdx]
1BFF    call    ptr [rbx]
1CFF    call    ptr [rax + rax]
1DFF    call    ptr [rip]
1EFF    call    ptr [rsi]
1FFF    call    ptr [rdi]
20FF    jmp     qword ptr [rax]
21FF    jmp     qword ptr [rcx]
22FF    jmp     qword ptr [rdx]
23FF    jmp     qword ptr [rbx]
24FF    jmp     qword ptr [rax + rax]
25FF    jmp     qword ptr [rip]
26FF    jmp     qword ptr [rsi]
27FF    jmp     qword ptr [rdi]
28FF    jmp     ptr [rax]
29FF    jmp     ptr [rcx]
2AFF    jmp     ptr [rdx]
2BFF    jmp     ptr [rbx]
2CFF    jmp     ptr [rax + rax]
2DFF    jmp     ptr [rip]
2EFF    jmp     ptr [rsi]
2FFF    jmp     ptr [rdi]
30FF    push    qword ptr [rax]
31FF    push    qword ptr [rcx]
32FF    push    qword ptr [rdx]
33FF    push    qword ptr [rbx]
34FF    push    qword ptr [rax + rax]
35FF    push    qword ptr [rip]
36FF    push    qword ptr [rsi]
37FF    push    qword ptr [rdi]
45FF    inc     dword ptr [rbp]
4DFF    dec     dword ptr [rbp]
55FF    call    qword ptr [rbp]
5DFF    call    ptr [rbp]
65FF    jmp     qword ptr [rbp]
6DFF    jmp     ptr [rbp]
75FF    push    qword ptr [rbp]
C0FF    inc     eax
C1FF    inc     ecx
C2FF    inc     edx
C3FF    inc     ebx
C4FF    inc     esp
C5FF    inc     ebp
C6FF    inc     esi
C7FF    inc     edi
C8FF    dec     eax
C9FF    dec     ecx
CAFF    dec     edx
CBFF    dec     ebx
CCFF    dec     esp
CDFF    dec     ebp
CEFF    dec     esi
CFFF    dec     edi
D0FF    call    rax
D1FF    call    rcx
D2FF    call    rdx
D3FF    call    rbx
D4FF    call    rsp
D5FF    call    rbp
D6FF    call    rsi
D7FF    call    rdi
E0FF    jmp     rax
E1FF    jmp     rcx
E2FF    jmp     rdx
E3FF    jmp     rbx
E4FF    jmp     rsp
E5FF    jmp     rbp
E6FF    jmp     rsi
E7FF    jmp     rdi

but it seems to be missing insturctions like

4801C3         add     rbx, rax

4839D8         cmp     rax, rbx

488D0500000000  lea     rax, [rip]

488B0500000000  mov     rax, [rip]

48890500000000  mov     [rip], rax

nd other instructions that not just 2 bytes long

I alpologize for my huge oversight on the 'increase' function. I have now rewritten the entire code that now does what I expected it to though I do still have more questions.

new code:

#include <iostream>
#include <memory>

#pragma warning(push)
#pragma warning(disable : 26812)
#include <capstone/capstone.h>
#pragma comment(lib, "capstone.lib")
#pragma warning(pop)

#include <string>
#include <sstream>
#include <unordered_set>

class instruction
{
private:

    size_t count = 0;
    size_t size;
    uint8_t* bytes;

#pragma warning(push)
#pragma warning(disable : 26812)
    cs_err error = CS_ERR_HANDLE;
#pragma warning (pop)
    csh hndl = { 0 };
    cs_arch arch;
    cs_mode mode;

    cs_insn* insn = {0};

    std::string mnemonic;
    std::string operand;

    void disasm()
    {
        cs_free(insn, count);

        if (count = cs_disasm(hndl, bytes, size, 0, 1, &insn) == 1)
        {
            mnemonic = insn->mnemonic;
            operand = insn->op_str;
        }
    }
    
public:

#pragma warning(push)
#pragma warning(disable : 26812)
    instruction(size_t n, cs_arch arch, cs_mode mode) :size(n), bytes(static_cast<uint8_t*>(std::calloc(n, sizeof(uint8_t)))), mode(mode), arch(arch)
#pragma warning(pop)
    {
        error = cs_open(arch, mode, &hndl);
        if (error != CS_ERR_OK)
        {
            // handle error
        }

        disasm();

    }

    ~instruction()
    {
        if (hndl != 0) cs_close(&hndl);

        std::free(bytes);
    }

    int inc(int n = 0)
    {

        if (n < size)
        {
            if (bytes[n] < 0xFF)
            {
                bytes[n] = bytes[n] + 1;
                disasm();
                return 0;
            }
            else
            {
                bytes[n] = 0x00;
                inc(n + 1);
            }
        }
        else { return -1; }

    }

    std::string getIns()
    {
        if (insn == 0)return "";
        return mnemonic + '\t' + operand;
    }

    std::string getInsHex()
    {
        std::stringstream str;

        for (int i = 0; i < size; i = i + 1)
        {
            str << std::hex << std::uppercase << static_cast<int>(bytes[i]) << "";
        }
        return str.str();
    }

    std::string getmne() { return mnemonic; }

};

std::unordered_set<std::string> insts;
std::unordered_set<std::string> mne;

int main()
{

    size_t size = 1;

    instruction ins(size, CS_ARCH_X86, CS_MODE_16);

    for (;(ins.inc() == 0);)
    {

        std::cout << ins.getInsHex() << "\n";

        if (insts.find(ins.getIns()) == insts.end())
        {
            if (mne.find(ins.getmne()) == mne.end())
            {
                mne.insert(ins.getmne());
            }

            insts.insert(ins.getIns());
            std::cout << '\t' << ins.getIns() << std::endl;
        }
    }

    //uint64_t i = 1;

    //for (const auto& str : mne) {
    //  std::cout << i << " : " << str << "\n";
    //  i += 1;
    //}


    return EXIT_SUCCESS;
}

I am still confused on how some arch can take instructions sized more than what it suggests it should. Like how X86 in 32bit mode can still take in instruction sized less than or more than 32 bits. So I was a bit confused on if the bit mode is more of suggestion and things under the hood are more complicated cuz I read somewhere that some instruction have extensions and stuff which confused me even more. my question now is how do these archs maintain such backwards compatibility and how can it take instructions sized more than or less than what the bit mode suggests?

1

There are 1 answers

1
Margaret Bloom On

There is a (huge) bug in your increase method and a (huge) fundamental mistake in your approach.

Your increase method is not increasing a byteSize-byte long integer, instead, it's counting like this (for byteSize = 4):

0
1
...
ff
1ff
2ff
...
ffff
1ffff
...
ffffff
1ffffff
...
ffffffff

That's a total of 256*byteSize values out of 256byteSize actual possible values.

It's easy to fix the counting: always increment bytes[i] and return 0 if after the increment it isn't 0.

With that weird counting, you are generating all the valid 1-byte opcodes (note that 0f is a lock prefix, Capstone is using the zero bytes in your 8-byte buffer to attach it to an instruction), but when you get to ff your buffer is stuck with a 0xff byte in the first byte and that would limit the possible instructions to inc, dec, jmp, call and push.
Note how many values are missing because they don't encode any instruction (e.g. ffd8).

Once you get to ffff your buffer is stuck with these two bytes at the beginning, and they don't encode any instruction. So Capstone skips the first byte and will encounter the same byte sequences as before (ff01, ff02, ...). This repeats until the buffer is "full".
Since you are only printing new instructions, the output stops.

Note that you are printing the opcode values as 64-bit numbers, since x86 is little-endian this ends up reversing the order of the bytes when printed.

You are also parsing the same byte sequences over and over again. The sequence 0000 is present four times in the buffer, in the first 256 iterations.


The fundamental mistake is that you cannot treat x86 instructions as numbers, that would be a dead end.
You can easily see this: an instruction like mov rax, imm64 has the same prefix for all the possible 264 values of imm64. So you would have to enumerate all of them.
So, if your program could decode 1 billion instructions per second (spoiler: it can't) it would take 584 years just to enumerate all possible mov rax, imm64 forms.
And that's just one instruction.
When printed, each instruction is about 30 bytes. It would require 480 Exbibytes of storage to save the output of your program. Again, for the sole mov rax, imm64 instruction.

You cannot enumerate x86 instructions by generating all the possible 8-byte sequences, let alone the 15-byte sequences that are needed for the full range of the decoding space.


It's not clear why you want to do this.
If you need the list of assembly instructions, Intel SDM Volume 2 (and the ISA Extension manual) has it.
If you want to find undocumented instructions, you need a better approach and surely you cannot use Capstone ad the verifier.
If you want to fuzz Capstone, you may want to read about proper fuzzing and use a different approach (note that Capstone seems to be fuzzed automatically by their build pipeline).