Using ND4J and Flink, I have a process function that receives a POJO, uses the linalg ndarray to calculate a result using a bunch of math, and outputs a pojo. When running the program on the cluster, using the Linux CPU backend, both with and without avx512, I can see that the memory usage only goes up. It seems like there is a memory leak from the process function with the nd4j calculations. I'm not keeping any references outside that method, so there is no reason for the memory to not be released The GC is called, but it doesn't release much of the memory. I also tried to use the workspace feature, but it didn't change anything
I have tried to change the GC, change heap / off heap sizes, setting bytedeco's maxbytes and maxphysicalbytes, use workspaces, but nothing helps
Here are some suggestions that i can think of:
Release Resources: Make sure you release memory explicitly when it's no longer needed. In Flink, use ValueState.clear() or ListState.clear() to release state-related resources.
Check for Native Backend Configuration: Depending on the specific computations you are performing, you might need to configure the ND4J native backend to better suit your needs. This might involve switching between the CPU and GPU backends or fine-tuning the native backend settings.
To configure the native backend, you can set environment variables.For example:
Adjust accordingly to your system's capabilities and requirements.
Flink Off-Heap State: If you are using Flink's state, you can check if Flink is configured to use off-heap state storage if your state is large. Off-heap state helps in reducing the memory footprint of your Flink application.
If nothing work for you :
Consider Profiling and Monitoring: using profiling tools and monitoring to understand memory usage patterns and potential memory leaks. Tools like VisualVM, YourKit, or Flink's built-in metrics can help identify areas where memory is not being released as expected. HTH.