The gist of the problem is : What are the possibilities of a user-land app getting corrupted while it is running ? Other than hardware failures.
Hardware rig : ARM9 (at91sam9xe) NAND Flash for :Linux kernel + FS + userland app.
We had an app running on embedded linux on ARM9 (at91sam9xe ), there were no problems for a couple of months but then suddenly an ARM reported being unable to execute the app..
When it was executed it crashed with the following dump :
pgd = c16b8000
[00000020] *pgd=215a0031, *pte=00000000, *ppte=00000000
Pid: 349, comm: console
CPU: 0 Not tainted (2.6.30.4-uc0 #280)
PC is at 0x4e000
LR is at 0x673e0
pc : [<0004e000>] lr : [<000673e0>] psr: 60000010
sp : bec6a728 ip : bec6acb4 fp : bec6ac9c
r10: 000bd9f8 r9 : 00000000 r8 : 00000000
r7 : 00000000 r6 : bec6acb4 r5 : 00000000 r4 : fbad2084
r3 : ffffffff r2 : bec6acb4 r1 : 00000025 r0 : 0009eab0
Flags: nZCv IRQs on FIQs on Mode USER_32 ISA ARM Segment user
Control: 0005317f Table: 216b8000 DAC: 00000015
[<c02ec3b0>] (show_regs+0x0/0x50) from [<c02f11a8>] (__do_user_fault+0x9c/0xa8)
r5:0000000b r4:c1696360
[<c02f110c>] (__do_user_fault+0x0/0xa8) from [<c02f1344>] (do_page_fault+0x114/0x244)
r7:00010000 r6:c1696360 r5:c15a62e0 r4:c1c5fde0
[<c02f1230>] (do_page_fault+0x0/0x244) from [<c02ea284>] (do_DataAbort+0x3c/0xa0)
[<c02ea248>] (do_DataAbort+0x0/0xa0) from [<c02eae00>] (ret_from_exception+0x0/0x10)
Exception stack(0xc1683fb0 to 0xc1683ff8)
3fa0: 0009eab0 00000025 bec6acb4 ffffffff
3fc0: fbad2084 00000000 bec6acb4 00000000 00000000 00000000 000bd9f8 bec6ac9c
3fe0: bec6acb4 bec6a728 000673e0 0004e000 60000010 ffffffff
I tried addr2line to see where it crashed but it gave reference to crtstuff.c =\ crtstuff.c is not a part of our app, its related to GCC i think.
I feared corruption of my executable, so i ran a diff on the file on NAND and file from my PC... there were differences which shouldn't happen. Plus, the differences were almost all of them as "0x00" values instead of the value they should contain.
What I really want to know is , how can a userland app get corrupted other than the hardware failures ?
Cause: NAND flash was always writeable , so what we hypohtesized was that there is a coincidence where things are being written to flash and power goes out .
Solution Moved our FS to RAM, we only mount part of NAND partition as writeable only when there is a need to write something. NAND write protect was controlled via Hardware Pin to only enable when there is a write-request from App