We have a .NET 2.0 application that is a service, on which multiple clients are connected through .NET remoting mainly. The service crashes with OutOfMemory exception at the client site during production, so currently they are forced to restart it every day or so to avoid unexpected crashes.
Previously, I've successfully resolved a couple cases of memory leaks, in managed code (static collection not cleaning up objects saved in it, and another one where logical thread number was increasing continuously). So I'm pretty familiar with capturing memory dumps and searching them with WinDbg + SOS.
In this case however, private bytes are rising while bytes on all heaps remain stable, indicating unmanaged code memory leak. I received a crash dump with the actual OOM exception, which makes it more obvious:
Checking Tess Ferandez's blog about dealing with leaks on unmanaged code in .NET applications, as well as some other resources around the net, I've excluded problems such as lots of dynamic assemblies, a common XmlSerializer issue, or 3rd party native DLLs (there aren't any). However, there is a good number of P/Invokes around. Moving forward, checking the heaps returned me the following:
The 2nd command returned all the entries as well. Now, according to some stuff I read, I should run !heap -p -a to get the stack, but all I get is
Which according to this question is an incorrect gflags usage etc. However, starting the service locally and attaching the debugger on it, is currently not an option. Long story short, I have to setup an environment with similar configuration and load as the client to get it done, and this is not ready.
So, I'm quite stuck. I don't know where to go on from here, or even if I am using the right approach to troubleshoot that issue. Any pointers are more than welcome.
Edit #1: Thread.abort on threads that use external resource. In specific, database connections through Oracle's ODP.NET provider. Could that be a cause for a leak in the native heap?
That's just a perfect storm of conditions to have this kind of problem. Either of them if quite sufficient to by themselves cause an uncontrolled memory leak. It will certainly look like everything is dandy on your dev machine. But then it surely isn't exercising the kind of data loads that the real machine is dealing with.
You are not going to get a lot of help from tooling, not the .NET kind anyway. Definitely start with just completely getting rid of the thread aborts, that just never ends up well. You gave no context for the reason you are doing this so hard to give specific advice. Definitely pursue setting up a database with fake auto-generated data so you can check outcomes of code tweaks in the comfort of your cubicle. You need a lot of data because it doesn't leak fast enough.
If you still have trouble then you need tools like gflags.exe and umdh.exe, available in the Debugging Tools for Windows distribution. Nowadays part of the SDK. Last resort, they only work well with debugging symbols and Oracle is not the kind of company that ever makes that easy. Their ecosystem is friendly to move-aside-I'll-fix-your-problem high paid consultants. Which could work out too if you find the right one.