Strategies for debugging numerical stability issues?

802 views Asked by At

I'm trying to write an implementation of Wilson's spectral density factorization algorithm [1] for Python. The algorithm iteratively factorizes a [QxQ] matrix function into its square root (it's sort of an extension of the Newton-Raphson square-root finder for spectral density matrices).

The problem is that my implementation only converges for matrices of size 45x45 and smaller. So after 20 iterations, the summed squared difference between matrices is about 2.45e-13. However, if I make an input of size 46x46, it does not converge until the 100th or so iteration. For 47x47 or larger, the matrices never converge; the error fluctuates between 100 and 1000 for about 100 iterations, and then starts to grow very quickly.

How would you go about trying to debug something like this? There doesn't appear to be any specific point at which it goes crazy, and the matrices are too large for me to actually attempt to do the calculation by hand. Does anyone have tips / tutorials / heuristics for find bizarre numerical bugs like this?

I've never dealt with anything like this before but I'm hoping some of you have...

Thanks, - Dan

[1] G. T. Wilson. "The Factorization of Matricial Spectral Densities". SIAM J. Appl. Math (Vol 23, No. 4, Dec. 1972)

2

There are 2 answers

1
robince On

I would recommend asking this question on the scipy-user mailing list, perhaps with an example of your code. Generally the people on the list seem to be highly experienced with numerical computation and are really helpful, just following the list is an education in itself.

Otherwise, I'm afraid I don't have any ideas... If you think it is a numerical precision/floating point rounding issue, the first thing you could try is bump all the dtypes up to float128 and see if makes any difference.

1
Alex Martelli On

Interval arithmetic can help, but I'm not sure if performance will be sufficient to actually allow meaningful debugging at the matrix sizes of your interest (you have to figure on a couple orders of magnitude worth of slowdown, what between replacing highly-HW-helped "scalar" floating point operations with SW-heavy "interval" ones, and adding the checks about which intervals are growing too wide, when, where, and why).