How do I proceed with memory, .so filenames and hex offsets

92 views Asked by At

Don't flame me for this but it's genuine. I am writing a multi threaded python application that runs for a very long time, typically 2-3 hours with 10 processes. This machine isn't slow it's just a lot of calculations.

The issue is that sometimes the application will hang about 85-90% of the way there because of outside tools.

I've broken this test up into smaller pieces that can then run successfully but the long running program hangs.

for example let's say I have to analyze some data on a list that 100,000,000 items long.

Breaking it up into twenty 5,000,000 lists all the smaller parts runs to completion.

Trying to do the 100,000,000 project it hangs towards the end. I use some outside tools that I cannot change so I am just trying to see what's going on.

I setup Dtrace and run

sudo dtrace -n 'syscall:::entry / execname == "python2.7" / { @[ustack()] = count() }'

on my program right when it hangs and I get an output like the code sample below.

          libc.so.7`__sys_recvfrom+0xa
          _socket.so`0x804086ecd
          _socket.so`0x8040854ac
          libpython2.7.so.1`PyEval_EvalFrameEx+0x52d7
          libpython2.7.so.1`PyEval_EvalCodeEx+0x665
          libpython2.7.so.1`0x800b3317d
          libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
          libpython2.7.so.1`0x800b33250
          libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
          libpython2.7.so.1`PyEval_EvalCodeEx+0x665
          libpython2.7.so.1`0x800abb5a1
          libpython2.7.so.1`PyObject_Call+0x64
          libpython2.7.so.1`0x800aa3855
          libpython2.7.so.1`PyObject_Call+0x64
          libpython2.7.so.1`PyEval_EvalFrameEx+0x4de2
          libpython2.7.so.1`PyEval_EvalCodeEx+0x665
          libpython2.7.so.1`0x800abb5a1
          libpython2.7.so.1`PyObject_Call+0x64
          libpython2.7.so.1`0x800aa3855
          libpython2.7.so.1`PyObject_Call+0x64

that code just repeats over and over. I tried looking into the Dtrace python probes but those seems busted two sides from Tuesday so this might be the closest that I'll get.

My question, I have a fuzzy idea that libpython2.7.so.1 is the shared library that holds the function pyObject_Call at an hex offset of 0x64

Is that right?

How can I decipher this? I don't know what to even call this so that I can google for answers or guides.

1

There are 1 answers

1
Robert Harris On BEST ANSWER

You should probably start by reading Showing the stack trace from a running Python application.

Your specific question was about the interpretation of DTrace's ustack() action and so this reply may be more than you need. This is because one of the design principles of DTrace is to show the exact state of a system. So, even though you're interested in the Python aspect of your program, DTrace is revealing its underlying implementation.

The output you've presented is a stack, which is a way of describing the state of a thread at a specific point in its execution. For example, if you had the code

void c(void) { pause(); }
void b(void) { c(); }
void a(void) { b(); }

and you asked for a stack whilst execution was within pause() then you might see something like

pause()
c()
b()
a()

Whatever tool you use will find the current instruction and its enclosing function before finding the "return address", i.e. the point to which that function will eventually return; repeating this procedure yields a stack. Thus, although the stack should be read from the top to the bottom as a series of return addresses, it's typically read in the other direction as a series of callers. Note that subtleties in the way that the program's corresponding instructions are assembled mean that this second interpretation can sometimes be misleading.

To extend the example above, it's likely that a(), b() and c() are all present within the same library --- and that there may be functions with the same names in other libraries. Thus it's useful to display, for each function, the object to which it belongs. Thus the stack above could become

libc.so`pause()
libfoo.so`c()
libfoo.so`b()
libfoo.so`a()

This goes some way towards allowing a developer to identify how a program ended up in a particular state: function c() in libfoo has called pause(). However, there's more to be done: if c() looked like

void c() {
    pause();
    pause();
}

then in which call to pause() is the program waiting?

The functions a(), b() and c() will be sequences of instructions that will typically occupy a contiguous region of memory. Calling one of the functions involves little more than making a note of where to return when finished (i.e. the return address) and then jumping to whichever memory address corresponds to the function's start. Functions' start addresses and sizes are recorded in a "symbol table" that is embedded in the object; it's by reading this table that a debugger is able to find the function that contains a given location such as a return address. Thus a specific point within a function can be described by an offset, usually expressed in hex, from the start. So an even better version of the stack above might be

libc.so`pause()+0x12
libfoo.so`c()+0x42
libfoo.so`b()+0x12
libfoo.so`a()+0x12

At this point, the developer can use a "disassembler" on libfoo.so to display the instructions within c(); comparison with c()'s source code would allow him to reveal the specific line from which the call to pause() was made.

Before concluding this description of stacks, it's worth making one more observation. Given the presence of sufficient "debug data" in a library such as libfoo, a better debugger would be able to go the extra mile and display the the source code file name and line number instead of the hexadecimal offset for each "frame" in the stack.

So now, to return to the stack in your question, libpython(2.7.so.1) is a library whose functions perform the job of executing a Python script. Functions in the Python script are converted into executable instructions on the fly, so my guess is that the fragment

libpython2.7.so.1`0x800b33250
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`PyEval_EvalCodeEx+0x665

means that PyEval_EvalFrameEx() is functionality within libpython itself that calls a Python function (i.e. something written in Python) that resides in memory near the address 0x800b33250. A simple debugger can see that this address belongs to libpython but won't find a corresponding entry in the library's symbol table; left with no choice, it simply prints the "raw" address.

So, you need to look at the Python script so see what it's doing but, unfortunately, there's no indication of the names of the functions in the Python component of the stack.

There are a few ways to proceed. The first is to find a version of libpython, if one exists, with a "DTrace helper". This is some extra functionality that lets DTrace see the state of the Python program itself in addition to the surrounding implementation. The result is that each Python frame would be annotated with the corresponding point in the Python source code.

Another, if you're on Solaris, is to use pstack(1); this has native support for Python.

Finally, try a specific Python debugger.

It's also worth pointing out that your dtrace invocation will show you all the stacks seen, sorted by popularity, whenever the program "python2.7" makes a system call. From your description, this probably isn't what you want. If you're trying to understand the behaviour of a hang then you probably want to start with a single snapshot of the python2.7 process at the time of the hang.