Is there any way to make this ESP32 code faster? I'm out of ideas at this point

221 views Asked by At

I'm working on an ESP32-based hard drive emulator for a vintage computer (the Apple Lisa) and I've been having some issues with the performance of my code. My emulator works fine on one model of the Lisa, but the other Lisa model performs its hard disk operations much faster, and the emulator just can't keep up. The particular portion of the code that's being too slow can be found below:


#define busOffset 12 // The parallel bus starts on ESP32 pin 12.
#define CMDPin 21 // CMD is on ESP32 pin 21.
#define STRBPin 25 // STRB is on ESP32 pin 25.

uint32_t IRAM_ATTR prevState = 1; // Variables used for falling-edge detection on the strobe signal.
uint32_t IRAM_ATTR currentState = 1;
uint8_t blockData[532]; // The 532-byte hard disk block that we're currently working with.
volatile uint32_t IRAM_ATTR continueLoop = true;
byte *pointer;

void IRAM_ATTR exitLoopISR(){ // An ISR that clears the continueLoop flag when it's time to stop the data transfer.
  continueLoop = false;
}

void emulatorRead(){
  pointer = blockData; // Make our pointer point to the start of the blockData array.

  REG_WRITE(GPIO_OUT_W1TS_REG, *pointer << busOffset); // Put the first data byte on the bus and increment the value of the pointer.
  REG_WRITE(GPIO_OUT_W1TC_REG, ((byte)~*pointer++ << busOffset)); 

  attachInterrupt(CMDPin, exitLoopISR, FALLING); // Attach a falling-edge interrupt to CMD so that whenever the host computer lowers CMD, exitLoopISR will be called and will make us exit the data transfer loop.

  while(continueLoop){ // Keep transferring data until the host computer lowers CMD.
    currentState = bitRead(REG_READ(GPIO_IN_REG), STRBPin); // Check the current state of the data strobe.
    if(currentState == 0 and prevState == 1){ // If we're on the falling edge of the strobe, it's time to put the next data byte on the bus.
      REG_WRITE(GPIO_OUT_W1TS_REG, *pointer << busOffset); // So go ahead and put it on the bus. And then increment our blockData pointer to the next byte.
      REG_WRITE(GPIO_OUT_W1TC_REG, ((byte)~*pointer++ << busOffset));
    }
    prevState = currentState; // Set the previous strobe state to the current strobe state to prepare for the next iteration of the loop.
  }

  detachInterrupt(CMDPin); // Detach the CMD pin interrupt now that we're done.
}

I've included the declarations of the variables that are used within the emulatorRead function as well as the ISR that the data transfer loop uses, as well as the snippet of emulatorRead that's giving me problems. Basically, it's supposed to grab the next byte out of the 532-byte blockData array and put it onto the parallel bus each time it detects a falling edge on the data strobe. It's supposed to keep doing this until the CMD pin goes low (which is what the ISR is for). And yes, I know that we'll run outside the bounds of the blockData array if the emulator receives more than 532 strobe pulses before CMD goes low, but that's something I can worry about once I get the darn thing working in the first place! The problem is that the strobe is just too fast on this model of the Lisa and the emulator misses a couple strobe pulses. From my testing, it usually only misses one or two of them out of 532, but this is obviously enough to screw everything up!

I've tried a bunch of stuff, but nothing has made the code fast enough to work properly. I tried changing the compiler optimization flag from -Os to -Ofast in the platforms.txt file to no avail and I also tried using interrupts to detect the falling edges of the strobe pulses and to write the data to the parallel bus, but that actually ended up being even slower than the polling method that I'm using here! With the interrupt method, it would usually miss 3 or 4 strobe pulses, while the polling method only misses 1 or 2. I've also tried putting certain variables (prevState, currentState, and continueLoop) into IRAM in an attempt to speed things up, but this didn't seem to work either. I had to make all of these variables 32-bit, even if they don't need to be, because IRAM seems to require all quantities stored within it to be 32 bits long (I kept getting kernel panics with a LoadStoreError whenever I used non-32-bit quantities).

So can anyone think of any more optimizations that I could make to actually get this working? I'm out of ideas for speeding up the loop at this point, so I'd really appreciate some help!

2

There are 2 answers

4
Marcus Müller On

changing to interrupt service routines and back takes time. This is one of the cases where using an Interrupt doesn't help you at all – you're checking the state of your pin all the time, anyways.

So, either,

  • don't use that interrupt at all, you're polling the state of that pin anyways, so you're winning nothing by not checking it in your loop, or
  • don't use the loop at all, but do the two register writes in the ISR.

The first would have lower latency at the cost of using 100% of your CPU while you're waiting for that edge (which you're using currently already), and the second would leave your CPU unused and allow you to do something else in the mean time, while still having slightly better latency than your current approach.

0
Nilsie On

First I have to agree with @Marcus Müller that attaching and detaching interrupt like this generally induce unneeded overhead.

That aside, seeing the assembler output is usually a very good tool when programming close to the hardware. The compiler may or may not be cleaver when it comes to generating tight code. For example, bitRead() looks like a function and not an inline MACRO. Would it help to instead use:

#define STRBPin 2^25 // == 0x02000000 or 0b10 0000 0000 0000 0000 0000 0000 I think
...
currentState = REG_READ(GPIO_IN_REG) & STRBPin;

and then follow up with:

if(currentState == 0 and prevState == 0x02000000)

Perhaps?

It may also help to first do the busOffset shifting for the entire blockData[532] before starting to wright to any registers.

I'm also a bit confused by (but it's late night for me so...):

REG_WRITE(GPIO_OUT_W1TS_REG, *pointer << busOffset); 
REG_WRITE(GPIO_OUT_W1TC_REG, ((byte)~*pointer++ << busOffset));

Why are you actually doing here? Why are yo shifting the same byte twice? The compiler should remove the 2nd but without the assembler output, I wouldn't gamble on it.

Also, the variables int32_t IRAM_ATTR prevState; and uint32_t IRAM_ATTR currentState are probaly better located within the scope of the emulatorRead() function. That may help the compiler to use registers instead of RAM for their respective storage. The same goes for pointer.