Software memory testing for bus failures

2k views Asked by At

I have a board with quite a few flash chips, some of them are showing intermittent failures. Standard memory tests are not showing any specific problem addresses, other than certain chips are failing intermittently under mechanical and thermal stress.

Suspecting the actual connections and not the flash cells themselves, I'm looking for a way to test the parallel bus for address or data pin errors.

There are some memory tests but they apply better to RAM rather than flash memory (http://www.ganssle.com/testingram.htm). Specifically, the parallel flash has a sequence of bus writes to write to each value; a write/verify failure could easily be the write operation which could be any pin on the bus.

Ideas welcome...

4

There are 4 answers

0
Michael On BEST ANSWER

I've decided to design a memory pattern that I think I can deduce both data and address errors from. The concept is to use values significantly different as key indicators of possible read errors. The concept is also to detect a failure on one pin at a time.

The test will read alternately from only bottom and top addresses (0x000000 and 0x3FFFFF - my chip has 22 address lines). In those locations I will put 0xFF and 0x00 respectively (byte wide). The idea is to flip all address and data lines and see what happens. (All other values in the flash have at least 3 bits different from 0x00 and 0xFF)

There are 44 addresses that a single pin failure could send me to in error. In each address put one of 22 values to represent which of the 22 address pin was flipped. Each are 2 bits different from each other, and 3 bits different from 00 and FF. (I tried for 3 bits different from each other but 8 bits could only get 14 values)

        07,0B,0D,0E,16,1A,1C,1F,25,29,2C,
        2F,34,38,3D,3E,43,49,4A,4F,52,58

The remaining addresses I put a nice pattern of six values 33,55,66,99,AA,CC. (3 bits different from all other values) value(address) = nicePattern[ sum of bits set in address % 6];

I tested this and have statistically collected 100s of intermittent failure incidents synchronized to the mechanical stress.

  • single bit errors detectable
  • double bit errors deducible (Explainable by a combination of frequent single bit errors)
  • 3 or more bit errors (generally inconclusive)

Even though some of the chips had 3 failing pins, 70% of the incidents were single bit (they usually didn't fail at the same time)

The testing group is now using this to identify which specific connections are failing.

2
old_timer On

The typical memory tests are there to do that. I prefer a pseudo randomizer (deterministic using an lfsr) to the 0xAA, 0x55, 0xFF, 0x00 tests. This allows for an address bus test as well as data bus test in two passes (repeat inverted). I say typical in the sense of wiggle the data bits and address bits both states each and vary the states of signals and their neighbors. The pounding on a ram to create thermal or other stresses, well you cant write very fast to a flash so you cant really do fast write/read cycles.

Flash creates another problem and that is writing then reading back isnt that interesting, you want to write the read back later, hours, days, weeks to determine if the part is actually holding data.

When you say thermal or stress do you mean only during the time it is above X degrees it fails, or do you mean that due to thermal stress it is broken all the time after the event. Likewise with mechanical, while vibrating or under mechanical stress the part fails, but when relieved of that stress it is okay, or the mechanical stress has done permanent damage that can be detected under stress or not.

Now although you cant do fast write/read cycles, you can punish a flash by reading heavily. I have seen read-disturb problems by constant reading of one block or location. Not necessarily something you have time to do for every location, but you might fill the ram with a pseudo random pattern and concentrate on one location for a while, (minutes, tens of minutes), if you have a part that you know is bad see if this accelerates the detection of the problem and if any location will work or only certain ones. then another thing is to read all the locations repetitively for hours/days or leave it sit for hours/days/weeks and then do a read pass without an erase or write and see if it has lost anything.

unfortunately as you probably know each new failure case takes its own research project and development of a new test.

0
user3115137 On

First step to test a memory is data bus test0 0 0 0 0 0 0 • In this test, data bus wiring is properly tested to0 0 0 0 0 0 0 confirm that the value placed on data bus by processor0 0 0 0 0 0 0 is correctly received by memory device at the other end0 0 0 0 0 0 00 0 0 0 0 0 0 • An obvious way to test is to write all possible0 0 0 0 0 0 0 data values and verify 0 0 0 0 0 0 0 • Each bit can be tested independently• To perform walking 1s test, write the first data value given in the table, verify by reading it back, write the second value, verify and so on. • When you reach the end of the table, the test is complete

0
D Krueger On

In the linked article Jack Ganssle says: "Critical to this [test], and every other RAM test algorithm, is that you write the pattern to all of RAM before doing the read test."

Since reading should be isolated from writing, testing the flash is easier. Perform the writing portion of the tests while the system is not under stress. Then perform the reading portion with the system under stress. By recording the address, expected value, and actual value in enough error cases, you should be able to determine the source of the errors.

If the system never fails when doing the above, you can then perform the whole tests while under stress. Any errors that appear are most likely write errors.