Could this shift/jump be faster than switch...case statement?

255 views Asked by At

I'm trying to optimize a branch (a switch...case like) at its max to emulate an X CPU on an x86 CPU. I thought of this: In memory I'll load blocks of x86 opCodes with a fixed length of 0x100 bytes, like this:

first block 
0
...[my code, jump at 0x10000, nop nop nop until 0x9F...]
0x9F
second block 
0x100
...
019F
third block
0x200
...
0x29F
...
etc, etc
...
0x10000

which will be finite, start at memory $0 (+ maybe some offset) and end at $0xFFFF (like a "rom" of 0x10000 size). Now each time an X CPU opCode is being fetched and emulated I'll do this: shift it left by 8 bits and jump to that location. Execute this, and continue my program flow normally. My questions are: 1) Is this even possible to be so tight with those opCode blocks? 2) Was that a common practice in the past?

1

There are 1 answers

12
Ira Baxter On BEST ANSWER

If you are branching across 256 opcodes through a switch block, you're going to be doing an indirect jump which the CPU cannot predict well, and that will get you a pipeline break on every opcode.

If the work to emulate an opcode is fair size, then this pipeline break may not matter a lot. I suspect it does; a "load register" opcode is simulated by essentially just a memory read, which isn't lot of work.

You might buy some visible improvement, by adding special tests just before the switch block, that check for the two or three most frequent opcodes (probably LOAD, CMP, JMP conditional) [If opcode = JMP then ...] These tests the CPU can typically predict well. If you do this, measure, measure, measure.

A sleazier trick is to amortize the cost of the pipeline break across multiple instructions if you can. If the machine has a lot of single-byte opcodes, you might consider doing a 65536 way branch across the next two opcode bytes. (Now you have to code a lot of switch cases, but many of them are essentially the same. Wonder if your compiler can handle it?) In the abstract, this cuts the pipeline break costs by a factor of two. For a specific machine with a very regular instruction set this may not buy you a lot.

However, you may not have a lot of single-byte opcodes, but you may need to decode one or more bytes for each instruction. The x86 is like this (prefix, opcode, MODRM, SIB, offset...). The big switch case should work pretty well for this.

It probably is good to align each switch case on a cache line boundary. If the instruction emulation is simple, the code will likely fit in the cache line and so the memory sees only one fetch for that cache line. If you don't do this, your instruction emulations will have a higher chance of crossing a cache line boundary, raising the memory fetch costs to two. This may not matter for frequently executed instructions, but code for rarely executed instructions may fall out the cache. This will help when you actually encounter one of these.

Last bit of advice: measure, measure, measure.