embedded linux userland app suddenly started crashing

409 views Asked by At

The gist of the problem is : What are the possibilities of a user-land app getting corrupted while it is running ? Other than hardware failures.

Hardware rig : ARM9 (at91sam9xe) NAND Flash for :Linux kernel + FS + userland app.

We had an app running on embedded linux on ARM9 (at91sam9xe ), there were no problems for a couple of months but then suddenly an ARM reported being unable to execute the app..

When it was executed it crashed with the following dump :

pgd = c16b8000
[00000020] *pgd=215a0031, *pte=00000000, *ppte=00000000

Pid: 349, comm:              console
CPU: 0    Not tainted  (2.6.30.4-uc0 #280)
PC is at 0x4e000
LR is at 0x673e0
pc : [<0004e000>]    lr : [<000673e0>]    psr: 60000010
sp : bec6a728  ip : bec6acb4  fp : bec6ac9c
r10: 000bd9f8  r9 : 00000000  r8 : 00000000
r7 : 00000000  r6 : bec6acb4  r5 : 00000000  r4 : fbad2084
r3 : ffffffff  r2 : bec6acb4  r1 : 00000025  r0 : 0009eab0
Flags: nZCv  IRQs on  FIQs on  Mode USER_32  ISA ARM  Segment user
Control: 0005317f  Table: 216b8000  DAC: 00000015
[<c02ec3b0>] (show_regs+0x0/0x50) from [<c02f11a8>] (__do_user_fault+0x9c/0xa8)
 r5:0000000b r4:c1696360
[<c02f110c>] (__do_user_fault+0x0/0xa8) from [<c02f1344>] (do_page_fault+0x114/0x244)
 r7:00010000 r6:c1696360 r5:c15a62e0 r4:c1c5fde0
[<c02f1230>] (do_page_fault+0x0/0x244) from [<c02ea284>] (do_DataAbort+0x3c/0xa0)
[<c02ea248>] (do_DataAbort+0x0/0xa0) from [<c02eae00>] (ret_from_exception+0x0/0x10)
Exception stack(0xc1683fb0 to 0xc1683ff8)
3fa0:                                     0009eab0 00000025 bec6acb4 ffffffff 
3fc0: fbad2084 00000000 bec6acb4 00000000 00000000 00000000 000bd9f8 bec6ac9c 
3fe0: bec6acb4 bec6a728 000673e0 0004e000 60000010 ffffffff     

I tried addr2line to see where it crashed but it gave reference to crtstuff.c =\ crtstuff.c is not a part of our app, its related to GCC i think.

I feared corruption of my executable, so i ran a diff on the file on NAND and file from my PC... there were differences which shouldn't happen. Plus, the differences were almost all of them as "0x00" values instead of the value they should contain.

What I really want to know is , how can a userland app get corrupted other than the hardware failures ?

Cause: NAND flash was always writeable , so what we hypohtesized was that there is a coincidence where things are being written to flash and power goes out .

Solution Moved our FS to RAM, we only mount part of NAND partition as writeable only when there is a need to write something. NAND write protect was controlled via Hardware Pin to only enable when there is a write-request from App

0

There are 0 answers