Changing the saga exemple from camel-spring-boot-examples/saga to execute multiple saga transactions with success causes the narayana-lra server to fail and restart because of a java.lang.OutOfMemoryError: Java heap space.

Using the camel org.apache.camel.springboot.example version 3.20.0

pom.xml

    <parent>
        <groupId>org.apache.camel.springboot</groupId>
        <artifactId>spring-boot</artifactId>
        <version>3.20.0</version>
    </parent>

    <groupId>org.apache.camel.springboot.example</groupId>
    <artifactId>examples</artifactId>
    <name>Camel SB :: Examples</name>
    <description>Camel Examples</description>
    <packaging>pom</packaging>
    <properties>
        <camel-version>3.20.0</camel-version>
        <skip.starting.camel.context>false</skip.starting.camel.context>
        <javax.servlet.api.version>4.0.1</javax.servlet.api.version>
        <jkube-maven-plugin-version>1.9.1</jkube-maven-plugin-version>
        <kafka-avro-serializer-version>5.2.2</kafka-avro-serializer-version>
        <reactor-version>3.2.16.RELEASE</reactor-version>
        <testcontainers-version>1.16.3</testcontainers-version>
        <hapi-structures-v24-version>2.3</hapi-structures-v24-version>
        <narayana-spring-boot-version>2.6.3</narayana-spring-boot-version>
    </properties>

changed docker-compose

version: "3.9"
services:
  lra-coordinator:
    image: "quay.io/jbosstm/lra-coordinator:7.0.0.Final-3.2.2.Final"
    network_mode: "host"
    deploy:
      resources:
        limits:
          memory: 400M
    environment:
      - 'JAVA_TOOL_OPTIONS=-Dquarkus.log.level=DEBUG 
        -Dcom.sun.management.jmxremote=true
        -Dcom.sun.management.jmxremote.port=7091
        -Dcom.sun.management.jmxremote.ssl=false 
        -Dcom.sun.management.jmxremote.authenticate=false'

  amq-broker:
    image: "registry.redhat.io/amq7/amq-broker-rhel8:7.10"
    environment:
      - AMQ_USER=admin
      - AMQ_PASSWORD=admin
      - AMQ_REQUIRE_LOGIN=true
    ports:
      - "8161:8161"
      - "61616:61616"

changed sagaRoute

public class SagaRoute extends RouteBuilder {

    private static final String DIRECT_SAGA = "direct:saga";
    @Autowired
    private ProducerTemplate producerTemplate;

    private final ExecutorService executor = Executors.newFixedThreadPool(10);

    @Override
    public void configure() throws Exception {

        rest().get("/perf")
                .param().type(RestParamType.query).name("n").dataType("int").required(true).endParam()
                .to("direct:perf");

        from("direct:perf")
            .process(exchange -> {
                Integer limit = exchange.getMessage().getHeader("n", Integer.class);
                for (int i = 0; i < limit; i++) {
                    int finalI = i;
                    executor.submit(()-> producerTemplate.sendBodyAndHeader(DIRECT_SAGA, finalI,"id",finalI))
                    ;
                }
            });

        rest().post("/saga")
                .param().type(RestParamType.query).name("id").dataType("int").required(true).endParam()
                .to(DIRECT_SAGA);

        from(DIRECT_SAGA)
                .saga()
                .compensation("direct:cancelOrder")
                    .log("Executing saga #${header.id} with LRA ${header.Long-Running-Action}")
                    .setHeader("payFor", constant("train"))
                    .to("activemq:queue:{{example.services.train}}?exchangePattern=InOut" +
                            "&replyTo={{example.services.train}}.reply")
                    .log("train seat reserved for saga #${header.id} with payment transaction: ${body}")
                    .setHeader("payFor", constant("flight"))
                    .to("activemq:queue:{{example.services.flight}}?exchangePattern=InOut" +
                            "&replyTo={{example.services.flight}}.reply")
                    .log("flight booked for saga #${header.id} with payment transaction: ${body}")
                .setBody(header("Long-Running-Action"))
                .end();

        from("direct:cancelOrder")
                .log("Transaction ${header.Long-Running-Action} has been cancelled due to flight or train failure");

    }

}

changed paymentRoute https://github.com/apache/camel-spring-boot-examples/blob/camel-spring-boot-examples-3.20.0/saga/saga-payment-service/src/main/java/org/apache/camel/example/saga/PaymentRoute.java

public class PaymentRoute extends RouteBuilder {

    @Override
    public void configure() throws Exception {

                from("activemq:queue:{{example.services.payment}}")
                .routeId("payment-service")
                .saga()
                    .propagation(SagaPropagation.MANDATORY)
                    .option("id", header("id"))
                    .compensation("direct:cancelPayment")
                    .log("Paying ${header.payFor} for order #${header.id}")
                    .setBody(header("JMSCorrelationID"))
                    .log("Payment ${header.payFor} done for order #${header.id} with payment transaction ${body}")
                .end();

        from("direct:cancelPayment")
                .routeId("payment-cancel")
                .log("Payment for order #${header.id} has been cancelled");
    }
}

executing o run-local.sh to start the services

executing the command bellow with the number of transactions to be done

curl http://localhost:8084/api/perf?n=100000

Doing this we get on the flight recorder this problem

The live set on the heap seems to increase with a speed of about 26,2 KiB per second during the recording.
An analysis of the reference tree found 1 leak candidates. The main candidate is java.util.concurrent.ConcurrentHashMap$Node[] 
Referenced by this chain:
java.util.concurrent.ConcurrentHashMap.table
io.narayana.lra.coordinator.domain.service.LRAService.participants
io.narayana.lra.coordinator.internal.LRARecoveryModule.service
java.lang.Object[]
java.util.Vector.elementData
com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery._recoveryModules

Flight Recorder Diagnose Image

and the erros on Narayana lra

2023-08-15 19:16:35,438 DEBUG [org.jbo.res.rea.com.cor.AbstractResteasyReactiveContext] (executor-thread-59) Restarting handler chain for exception exception: java.lang.OutOfMemoryError: Java heap space

2023-08-15 19:16:37,729 DEBUG [org.jbo.res.rea.com.cor.AbstractResteasyReactiveContext] (executor-thread-59) Attempting to handle unrecoverable exception: java.lang.OutOfMemoryError: Java heap space

2023-08-15 19:16:38,327 DEBUG [io.ver.ext.web.RoutingContext] (executor-thread-59) RoutingContext failure (500): java.lang.OutOfMemoryError: Java heap space

The attribute that holds the data causing the memory leak is

io.narayana.lra.coordinator.domain.service.LRAService.participants

public class LRAService {
    private final Map<String, String> participants = new ConcurrentHashMap<>();

in the LRAService class I could not find where itens are removed from this map.

Is it bug? a miss configuration on narayana lra? a bug on apache camel saga?

thanks a lot

2

There are 2 answers

0
user3078393 On

I received an answer on Narayana zulip chat

https://narayana.zulipchat.com/#narrow/stream/323714-users/topic/apache.20camel.20saga.20causes.20Narayana.20-LRA.20memory.20leak/near/386195429

Michael Musgrove: Well spotted, the participant should be removed here when the transaction finishes (and updated if participants move).

We'll get a issue tracker raised for you to monitor the fix.

Michael Musgrove: You may track our progress using issue https://issues.redhat.com/browse/JBTM-3795

content of the card

Description The LRA module maintains a map of participants 1 which should be cleaned up when an LRA finishes [2] (and if a participant wants to be notified on a different endpoint).

1 https://github.com/jbosstm/narayana/blob/7.0.0.Final/rts/lra/coordinator/src/main/java/io/narayana/lra/coordinator/domain/service/LRAService.java#L46

[2] https://github.com/jbosstm/narayana/blob/7.0.0.Final/rts/lra/coordinator/src/main/java/io/narayana/lra/coordinator/domain/service/LRAService.java#L195

0
Mark Sg On

As written in the Zulip thread this issue is related to Netty Direct access memory not respecting docker memory limitation. In order to fix the OOM error you could set

JAVA_TOOL_OPTIONS='-Dio.netty.maxDirectMemory=0'

as an environment variable. You might also make it work limiting io.netty.maxDirectMemory to a value lower than the docker container memory limit (i.e. -Dio.netty.maxDirectMemory=100m).