How to prevent "partial write" data corruption during power loss?

4.3k views Asked by At

In an embedded environment (using MSP430), I have seen some data corruption caused by partial writes to non-volatile memory. This seems to be caused by power loss during a write (to either FRAM or info segments).

I am validating data stored in these locations with a CRC.

My question is, what is the correct way to prevent this "partial write" corruption? Currently, I have modified my code to write to two separate FRAM locations. So, if one write is interrupted causing an invalid CRC, the other location should remain valid. Is this a common practice? Do I need to implement this double write behavior for any non-volatile memory?

4

There are 4 answers

3
Clifford On BEST ANSWER

A simple solution is to maintain two versions of the data (in separate pages for flash memory), the current version and the previous version. Each version has a header comprising of a sequence number and a word that validates the sequence number - simply the 1's complement of the sequence number for example:

---------
|  seq  |
---------
| ~seq  |
---------
|       |
| data  |
|       |
---------

The critical thing is that when the data is written the seq and ~seq words are written last.

On start-up you read the data that has the highest valid sequence number (accounting for wrap-around perhaps - especially for short sequence words). When you write the data, you overwrite and validate the oldest block.

The solution you are already using is valid so long as the CRC is written last, but it lacks simplicity and imposes a CRC calculation overhead that may not be necessary or desirable.

On FRAM you have no concern about endurance, but this is an issue for Flash memory and EEPROM. In this case I use a write-back cache method, where the data is maintained in RAM, and when modified a timer is started or restarted if it is already running - when the timer expires, the data is written - this prevents burst-writes from thrashing the memory, and is useful even on FRAM since it minimises the software overhead of data writes.

0
peter_mcc On

We've used something similar to Clifford's answer but written in one write operation. You need two copies of the data and alternate between them. Use an incrementing sequence number so that effectively one location has even sequence numbers and one has odd.

Write the data like this (in one write command if you can):

---------
|  seq  |
---------
|       |
| data  |
|       |
---------
| seq   |
---------

When you read it back make sure both the sequence numbers are the same - if they are not then the data is invalid. At startup read both locations and work out which one is more recent (taking into account the sequence number rolling over).

5
DrRobotNinja On

Our engineering team takes a two pronged approach to these problem: Solve it in hardware and software!

The first is a diode and capacitor arrangement to provide a few milliseconds of power during a brown-out. If we notice we've lost external power, we prevent the code from entering any non-violate writes.

Second, our data is particularly critical for operation, it updates often and we don't want to wear out our non-violate flash storage (it only supports so many writes.) so we actually store the data 16 times in flash and protect each record with a CRC code. On boot, we find the newest valid write and then start our erase/write cycles.

We've never seen data corruption since implementing our frankly paranoid system.

Update:

I should note that our flash is external to our CPU, so the CRC helps validates the data if there is a communication glitch between the CPU and flash chip. Furthermore, if we experience several glitches in a row, the multiple writes protect against data loss.

0
Indeep On

Always store data in some kind of protocol , like START_BYTE, Total bytes to write, data , END BYTE. Before writting to external / Internal memory always check POWER Moniter registers/ ADC. if anyhow you data corrupts, END byte will also corrupt. So that entry will not vaild after validation of whole protocol. checksum is not a good idea , you can choose CRC16 instead of that if you want to include CRC into your protocol.