ActiveMQ Artemis cluster does not redistribute messages after one instance crash

2.3k views Asked by At

I have a cluster of Artemis in Kubernetes with 3 group of master/slave:

activemq-artemis-master-0                               1/1     Running
activemq-artemis-master-1                               1/1     Running
activemq-artemis-master-2                               1/1     Running
activemq-artemis-slave-0                                0/1     Running
activemq-artemis-slave-1                                0/1     Running
activemq-artemis-slave-2                                0/1     Running

I am using Spring boot JmsListener to consume messages sent to a wildcard queue as follow.

    @Component
    @Log4j2
    public class QueueListener {
      @Autowired
      private ListenerControl listenerControl;
    
      @JmsListener(id = "queueListener0", destination = "QUEUE.service2.*.*.*.notification")
      public void add(String message, @Header("sentBy") String sentBy, @Header("sentFrom") String sentFrom, @Header("sentAt") Long sentAt) throws InterruptedException {
    
    
        log.info("---QUEUE[notification]:  message={}, sentBy={}, sentFrom={}, sentAt={}",
                    message, sentBy, sentFrom, sentAt);
    
        TimeUnit.MILLISECONDS.sleep(listenerControl.getDuration());
      }
    }

There was 20 messages sent to the queue and master-1 was the delivering node. When 5 messages has been consumed, I killed the master-1 node to simulate a crash, I saw slave-1 started running then yielded back to master-1 after Kubernetes respawn it. The listener threw a JMSException that the connection was lost and it tried to reconnect. Then I saw it successfully connected to master-0 (I saw the queue created and the consumer count > 0). However the queue on master-0 was empty, while the same queue in master-1 still had 15 messages and no consumer attached to it. I waited for a while but the 15 messages was never delivered. I am not sure why redistribution did not kick in.

The attributes of the wildcard queue on master-1 is like this when it came back online after the crash (I manually replace the value of the field accessToken since it has sensitive info):

Attribute   Value
Acknowledge attempts    0
Address QUEUE.service2.*.*.*.notification
Configuration managed   false
Consumer count  0
Consumers before dispatch   0
Dead letter address DLQ
Delay before dispatch   -1
Delivering count    0
Delivering size 0
Durable true
Durable delivering count    0
Durable delivering size 0
Durable message count   15
Durable persistent size 47705
Durable scheduled count 0
Durable scheduled size  0
Enabled true
Exclusive   false
Expiry address  ExpiryQueue
Filter  
First message age   523996
First message as json   [{"JMSType":"service2","address":"QUEUE.service2.tech-drive2.188100000059.thai.notification","messageID":68026,"sentAt":1621957145988,"accessToken":"REMOVED","type":3,"priority":4,"userID":"ID:56c7b509-bd6f-11eb-a348-de0dacf99072","_AMQ_GROUP_ID":"tech-drive2-188100000059-thai","sentBy":"[email protected]","durable":true,"JMSReplyTo":"queue://QUEUE.service2.tech-drive2.188100000059.thai.notification","__AMQ_CID":"e4469ea3-bd62-11eb-a348-de0dacf99072","sentFrom":"service2","originalDestination":"QUEUE.service2.tech-drive2.188100000059.thai.notification","_AMQ_ROUTING_TYPE":1,"JMSCorrelationID":"c329c733-1170-440a-9080-992a009d87a9","expiration":0,"timestamp":1621957145988}]
First message timestamp 1621957145988
Group buckets   -1
Group count 0
Group first key 
Group rebalance false
Group rebalance pause dispatch  false
Id  119
Last value  false
Last value key  
Max consumers   -1
Message count   15
Messages acknowledged   0
Messages added  15
Messages expired    0
Messages killed 0
Name    QUEUE.service2.*.*.*.notification
Object Name org.apache.activemq.artemis:broker="activemq-artemis-master-1",component=addresses,address="QUEUE.service2.\*.\*.\*.notification",subcomponent=queues,routing-type="anycast",queue="QUEUE.service2.\*.\*.\*.notification"
Paused  false
Persistent size 47705
Prepared transaction message count  0
Purge on no consumers   false
Retroactive resource    false
Ring size   -1
Routing type    ANYCAST
Scheduled count 0
Scheduled size  0
Temporary   false
User    f7bcdaed-8c0c-4bb5-ad03-ec06382cb557

The attributes of the wildcard queue on master-0 is like this:

Attribute   Value
Acknowledge attempts    0
Address QUEUE.service2.*.*.*.notification
Configuration managed   false
Consumer count  3
Consumers before dispatch   0
Dead letter address DLQ
Delay before dispatch   -1
Delivering count    0
Delivering size 0
Durable true
Durable delivering count    0
Durable delivering size 0
Durable message count   0
Durable persistent size 0
Durable scheduled count 0
Durable scheduled size  0
Enabled true
Exclusive   false
Expiry address  ExpiryQueue
Filter  
First message age   
First message as json   [{}]
First message timestamp 
Group buckets   -1
Group count 0
Group first key 
Group rebalance false
Group rebalance pause dispatch  false
Id  119
Last value  false
Last value key  
Max consumers   -1
Message count   0
Messages acknowledged   0
Messages added  0
Messages expired    0
Messages killed 0
Name    QUEUE.service2.*.*.*.notification
Object Name org.apache.activemq.artemis:broker="activemq-artemis-master-0",component=addresses,address="QUEUE.service2.\*.\*.\*.notification",subcomponent=queues,routing-type="anycast",queue="QUEUE.service2.\*.\*.\*.notification"
Paused  false
Persistent size 0
Prepared transaction message count  0
Purge on no consumers   false
Retroactive resource    false
Ring size   -1
Routing type    ANYCAST
Scheduled count 0
Scheduled size  0
Temporary   false
User    f7bcdaed-8c0c-4bb5-ad03-ec06382cb557

The Artemis version in use is 2.17.0. Here is my cluster config in master-0 broker.xml. The configs are the same for other brokers except the connector-ref is changed to match the broker:

<?xml version="1.0"?>
<configuration xmlns="urn:activemq" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">
  <core xmlns="urn:activemq:core" xsi:schemaLocation="urn:activemq:core ">
    <name>activemq-artemis-master-0</name>
    <persistence-enabled>true</persistence-enabled>
    <journal-type>ASYNCIO</journal-type>
    <paging-directory>data/paging</paging-directory>
    <bindings-directory>data/bindings</bindings-directory>
    <journal-directory>data/journal</journal-directory>
    <large-messages-directory>data/large-messages</large-messages-directory>
    <journal-datasync>true</journal-datasync>
    <journal-min-files>2</journal-min-files>
    <journal-pool-files>10</journal-pool-files>
    <journal-device-block-size>4096</journal-device-block-size>
    <journal-file-size>10M</journal-file-size>
    <journal-buffer-timeout>100000</journal-buffer-timeout>
    <journal-max-io>4096</journal-max-io>
    <disk-scan-period>5000</disk-scan-period>
    <max-disk-usage>90</max-disk-usage>
    <critical-analyzer>true</critical-analyzer>
    <critical-analyzer-timeout>120000</critical-analyzer-timeout>
    <critical-analyzer-check-period>60000</critical-analyzer-check-period>
    <critical-analyzer-policy>HALT</critical-analyzer-policy>
    <page-sync-timeout>2244000</page-sync-timeout>
    <acceptors>
      <acceptor name="artemis">tcp://0.0.0.0:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;amqpMinLargeMessageSize=102400;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>
      <acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpMinLargeMessageSize=102400;amqpDuplicateDetection=true</acceptor>
      <acceptor name="stomp">tcp://0.0.0.0:61613?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=STOMP;useEpoll=true</acceptor>
      <acceptor name="hornetq">tcp://0.0.0.0:5445?anycastPrefix=jms.queue.;multicastPrefix=jms.topic.;protocols=HORNETQ,STOMP;useEpoll=true</acceptor>
      <acceptor name="mqtt">tcp://0.0.0.0:1883?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=MQTT;useEpoll=true</acceptor>
    </acceptors>
    <security-settings>
      <security-setting match="#">
        <permission type="createNonDurableQueue" roles="amq"/>
        <permission type="deleteNonDurableQueue" roles="amq"/>
        <permission type="createDurableQueue" roles="amq"/>
        <permission type="deleteDurableQueue" roles="amq"/>
        <permission type="createAddress" roles="amq"/>
        <permission type="deleteAddress" roles="amq"/>
        <permission type="consume" roles="amq"/>
        <permission type="browse" roles="amq"/>
        <permission type="send" roles="amq"/>
        <permission type="manage" roles="amq"/>
      </security-setting>
    </security-settings>
    <address-settings>
      <address-setting match="activemq.management#">
        <dead-letter-address>DLQ</dead-letter-address>
        <expiry-address>ExpiryQueue</expiry-address>
        <redelivery-delay>0</redelivery-delay>
        <!-- with -1 only the global-max-size is in use for limiting -->
        <max-size-bytes>-1</max-size-bytes>
        <message-counter-history-day-limit>10</message-counter-history-day-limit>
        <address-full-policy>PAGE</address-full-policy>
        <auto-create-queues>true</auto-create-queues>
        <auto-create-addresses>true</auto-create-addresses>
        <auto-create-jms-queues>true</auto-create-jms-queues>
        <auto-create-jms-topics>true</auto-create-jms-topics>
      </address-setting>
      <address-setting match="#">
        <dead-letter-address>DLQ</dead-letter-address>
        <expiry-address>ExpiryQueue</expiry-address>
        <redistribution-delay>60000</redistribution-delay>
        <redelivery-delay>0</redelivery-delay>
        <max-size-bytes>-1</max-size-bytes>
        <message-counter-history-day-limit>10</message-counter-history-day-limit>
        <address-full-policy>PAGE</address-full-policy>
        <auto-create-queues>true</auto-create-queues>
        <auto-create-addresses>true</auto-create-addresses>
        <auto-create-jms-queues>true</auto-create-jms-queues>
        <auto-create-jms-topics>true</auto-create-jms-topics>
      </address-setting>
    </address-settings>
    <addresses>
      <address name="DLQ">
        <anycast>
          <queue name="DLQ"/>
        </anycast>
      </address>
      <address name="ExpiryQueue">
        <anycast>
          <queue name="ExpiryQueue"/>
        </anycast>
      </address>
    </addresses>
    <cluster-user>clusterUser</cluster-user>
    <cluster-password>aShortclusterPassword</cluster-password>
    <connectors>
      <connector name="activemq-artemis-master-0">tcp://activemq-artemis-master-0.activemq-artemis-master.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-slave-0">tcp://activemq-artemis-slave-0.activemq-artemis-slave.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-master-1">tcp://activemq-artemis-master-1.activemq-artemis-master.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-slave-1">tcp://activemq-artemis-slave-1.activemq-artemis-slave.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-master-2">tcp://activemq-artemis-master-2.activemq-artemis-master.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-slave-2">tcp://activemq-artemis-slave-2.activemq-artemis-slave.svc.cluster.local:61616</connector>
    </connectors>
    <cluster-connections>
      <cluster-connection name="activemq-artemis">
        <connector-ref>activemq-artemis-master-0</connector-ref>
        <retry-interval>500</retry-interval>
        <retry-interval-multiplier>1.1</retry-interval-multiplier>
        <max-retry-interval>5000</max-retry-interval>
        <initial-connect-attempts>-1</initial-connect-attempts>
        <reconnect-attempts>-1</reconnect-attempts>
        <message-load-balancing>ON_DEMAND</message-load-balancing>
        <max-hops>1</max-hops>
        <!-- scale-down>true</scale-down -->
        <static-connectors>
          <connector-ref>activemq-artemis-master-0</connector-ref>
          <connector-ref>activemq-artemis-slave-0</connector-ref>
          <connector-ref>activemq-artemis-master-1</connector-ref>
          <connector-ref>activemq-artemis-slave-1</connector-ref>
          <connector-ref>activemq-artemis-master-2</connector-ref>
          <connector-ref>activemq-artemis-slave-2</connector-ref>
        </static-connectors>
      </cluster-connection>
    </cluster-connections>
    <ha-policy>
      <replication>
        <master>
          <group-name>activemq-artemis-0</group-name>
          <quorum-vote-wait>12</quorum-vote-wait>
          <vote-on-replication-failure>true</vote-on-replication-failure>
          <!--we need this for auto failback-->
          <check-for-live-server>true</check-for-live-server>
        </master>
      </replication>
    </ha-policy>
  </core>
  <core xmlns="urn:activemq:core">
    <jmx-management-enabled>true</jmx-management-enabled>
  </core>
</configuration>

From another answer from Stack Overflow, I understand that my topology for high-availability is redundant and I am planning to remove the slave. However, I don't think the slave is the cause for redistribution of messages not working. Is there a config that I am missing to handle Artemis node crash?

Updated 1: As Justin suggested, I tried to use a cluster of 2 nodes of Artemis without HA.

activemq-artemis-master-0                              1/1     Running            0          27m
activemq-artemis-master-1                              1/1     Running            0          74s

The following is broker.xml of the 2 artemis node. The only different between them is the node name and journal-buffer-timeout:

<?xml version="1.0"?>

<configuration xmlns="urn:activemq" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">
  <core xmlns="urn:activemq:core" xsi:schemaLocation="urn:activemq:core ">
    <name>activemq-artemis-master-0</name>
    <persistence-enabled>true</persistence-enabled>
    <journal-type>ASYNCIO</journal-type>
    <paging-directory>data/paging</paging-directory>
    <bindings-directory>data/bindings</bindings-directory>
    <journal-directory>data/journal</journal-directory>
    <large-messages-directory>data/large-messages</large-messages-directory>
    <journal-datasync>true</journal-datasync>
    <journal-min-files>2</journal-min-files>
    <journal-pool-files>10</journal-pool-files>
    <journal-device-block-size>4096</journal-device-block-size>
    <journal-file-size>10M</journal-file-size>
    <journal-buffer-timeout>100000</journal-buffer-timeout>
    <journal-max-io>4096</journal-max-io>
    <disk-scan-period>5000</disk-scan-period>
    <max-disk-usage>90</max-disk-usage>
    <critical-analyzer>true</critical-analyzer>
    <critical-analyzer-timeout>120000</critical-analyzer-timeout>
    <critical-analyzer-check-period>60000</critical-analyzer-check-period>
    <critical-analyzer-policy>HALT</critical-analyzer-policy>
    <page-sync-timeout>2244000</page-sync-timeout>
    <acceptors>
      <acceptor name="artemis">tcp://0.0.0.0:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;amqpMinLargeMessageSize=102400;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>
      <acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpMinLargeMessageSize=102400;amqpDuplicateDetection=true</acceptor>
      <acceptor name="stomp">tcp://0.0.0.0:61613?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=STOMP;useEpoll=true</acceptor>
      <acceptor name="hornetq">tcp://0.0.0.0:5445?anycastPrefix=jms.queue.;multicastPrefix=jms.topic.;protocols=HORNETQ,STOMP;useEpoll=true</acceptor>
      <acceptor name="mqtt">tcp://0.0.0.0:1883?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=MQTT;useEpoll=true</acceptor>
    </acceptors>
    <security-settings>
      <security-setting match="#">
        <permission type="createNonDurableQueue" roles="amq"/>
        <permission type="deleteNonDurableQueue" roles="amq"/>
        <permission type="createDurableQueue" roles="amq"/>
        <permission type="deleteDurableQueue" roles="amq"/>
        <permission type="createAddress" roles="amq"/>
        <permission type="deleteAddress" roles="amq"/>
        <permission type="consume" roles="amq"/>
        <permission type="browse" roles="amq"/>
        <permission type="send" roles="amq"/>
        <permission type="manage" roles="amq"/>
      </security-setting>
    </security-settings>
    <cluster-user>ClusterUser</cluster-user>
    <cluster-password>longClusterPassword</cluster-password>
    <connectors>
      <connector name="activemq-artemis-master-0">tcp://activemq-artemis-master-0.activemq-artemis-master.ncp-stack-testing.svc.cluster.local:61616</connector>
      <connector name="activemq-artemis-master-1">tcp://activemq-artemis-master-1.activemq-artemis-master.ncp-stack-testing.svc.cluster.local:61616</connector>
    </connectors>
    <cluster-connections>
      <cluster-connection name="activemq-artemis">
        <connector-ref>activemq-artemis-master-0</connector-ref>
        <retry-interval>500</retry-interval>
        <retry-interval-multiplier>1.1</retry-interval-multiplier>
        <max-retry-interval>5000</max-retry-interval>
        <initial-connect-attempts>-1</initial-connect-attempts>
        <reconnect-attempts>-1</reconnect-attempts>
        <use-duplicate-detection>true</use-duplicate-detection>
        <message-load-balancing>ON_DEMAND</message-load-balancing>
        <max-hops>1</max-hops>
        <static-connectors>
          <connector-ref>activemq-artemis-master-0</connector-ref>
          <connector-ref>activemq-artemis-master-1</connector-ref>
        </static-connectors>
      </cluster-connection>
    </cluster-connections>
    <address-settings>
      <address-setting match="activemq.management#">
        <dead-letter-address>DLQ</dead-letter-address>
        <expiry-address>ExpiryQueue</expiry-address>
        <redelivery-delay>0</redelivery-delay>
        <max-size-bytes>-1</max-size-bytes>
        <message-counter-history-day-limit>10</message-counter-history-day-limit>
        <address-full-policy>PAGE</address-full-policy>
        <auto-create-queues>true</auto-create-queues>
        <auto-create-addresses>true</auto-create-addresses>
        <auto-create-jms-queues>true</auto-create-jms-queues>
        <auto-create-jms-topics>true</auto-create-jms-topics>
      </address-setting>
      <address-setting match="#">
        <dead-letter-address>DLQ</dead-letter-address>
        <expiry-address>ExpiryQueue</expiry-address>
        <redistribution-delay>60000</redistribution-delay>
        <redelivery-delay>0</redelivery-delay>
        <max-size-bytes>-1</max-size-bytes>
        <message-counter-history-day-limit>10</message-counter-history-day-limit>
        <address-full-policy>PAGE</address-full-policy>
        <auto-create-queues>true</auto-create-queues>
        <auto-create-addresses>true</auto-create-addresses>
        <auto-create-jms-queues>true</auto-create-jms-queues>
        <auto-create-jms-topics>true</auto-create-jms-topics>
      </address-setting>
    </address-settings>
    <addresses>
      <address name="DLQ">
        <anycast>
          <queue name="DLQ"/>
        </anycast>
      </address>
      <address name="ExpiryQueue">
        <anycast>
          <queue name="ExpiryQueue"/>
        </anycast>
      </address>
    </addresses>
  </core>
  <core xmlns="urn:activemq:core">
    <jmx-management-enabled>true</jmx-management-enabled>
  </core>
</configuration>

With this setup, I still got the same result, after the the artemis node crash and comeback, the left over message was not moved to the other node.

Update 2 I tried to use non-wildcard queue as Justin suggested but still got the same behavior. One different I noticed is that if I use the non-wildcard queue, the consumer count is only 1 compare to 3 in the case of wildcard queue.Here is the attributes of the old queue after the crash

Acknowledge attempts    0
Address QUEUE.service2.tech-drive2.188100000059.thai.notification
Configuration managed   false
Consumer count  0
Consumers before dispatch   0
Dead letter address DLQ
Delay before dispatch   -1
Delivering count    0
Delivering size 0
Durable true
Durable delivering count    0
Durable delivering size 0
Durable message count   15
Durable persistent size 102245
Durable scheduled count 0
Durable scheduled size  0
Enabled true
Exclusive   false
Expiry address  ExpiryQueue
Filter  
First message age   840031
First message as json   [{"JMSType":"service2","address":"QUEUE.service2.tech-drive2.188100000059.thai.notification","messageID":8739,"sentAt":1621969900922,"accessToken":"DONOTDISPLAY","type":3,"priority":4,"userID":"ID:09502dc0-bd8d-11eb-b75c-c6609f1332c9","_AMQ_GROUP_ID":"tech-drive2-188100000059-thai","sentBy":"[email protected]","durable":true,"JMSReplyTo":"queue://QUEUE.service2.tech-drive2.188100000059.thai.notification","__AMQ_CID":"c292b418-bd8b-11eb-b75c-c6609f1332c9","sentFrom":"service2","originalDestination":"QUEUE.service2.tech-drive2.188100000059.thai.notification","_AMQ_ROUTING_TYPE":1,"JMSCorrelationID":"90b783d0-d9cc-4188-9c9e-3453786b2105","expiration":0,"timestamp":1621969900922}]
First message timestamp 1621969900922
Group buckets   -1
Group count 0
Group first key 
Group rebalance false
Group rebalance pause dispatch  false
Id  606
Last value  false
Last value key  
Max consumers   -1
Message count   15
Messages acknowledged   0
Messages added  15
Messages expired    0
Messages killed 0
Name    QUEUE.service2.tech-drive2.188100000059.thai.notification
Object Name org.apache.activemq.artemis:broker="activemq-artemis-master-0",component=addresses,address="QUEUE.service2.tech-drive2.188100000059.thai.notification",subcomponent=queues,routing-type="anycast",queue="QUEUE.service2.tech-drive2.188100000059.thai.notification"
Paused  false
Persistent size 102245
Prepared transaction message count  0
Purge on no consumers   false
Retroactive resource    false
Ring size   -1
Routing type    ANYCAST
Scheduled count 0
Scheduled size  0
Temporary   false
User    6e25e08b-9587-40a3-b7e9-146360539258

and here is the attributes of the new queue

Attribute   Value
Acknowledge attempts    0
Address QUEUE.service2.tech-drive2.188100000059.thai.notification
Configuration managed   false
Consumer count  1
Consumers before dispatch   0
Dead letter address DLQ
Delay before dispatch   -1
Delivering count    0
Delivering size 0
Durable true
Durable delivering count    0
Durable delivering size 0
Durable message count   0
Durable persistent size 0
Durable scheduled count 0
Durable scheduled size  0
Enabled true
Exclusive   false
Expiry address  ExpiryQueue
Filter  
First message age   
First message as json   [{}]
First message timestamp 
Group buckets   -1
Group count 0
Group first key 
Group rebalance false
Group rebalance pause dispatch  false
Id  866
Last value  false
Last value key  
Max consumers   -1
Message count   0
Messages acknowledged   0
Messages added  0
Messages expired    0
Messages killed 0
Name    QUEUE.service2.tech-drive2.188100000059.thai.notification
Object Name org.apache.activemq.artemis:broker="activemq-artemis-master-1",component=addresses,address="QUEUE.service2.tech-drive2.188100000059.thai.notification",subcomponent=queues,routing-type="anycast",queue="QUEUE.service2.tech-drive2.188100000059.thai.notification"
Paused  false
Persistent size 0
Prepared transaction message count  0
Purge on no consumers   false
Retroactive resource    false
Ring size   -1
Routing type    ANYCAST
Scheduled count 0
Scheduled size  0
Temporary   false
User    6e25e08b-9587-40a3-b7e9-146360539258
1

There are 1 answers

8
Justin Bertram On BEST ANSWER

I've taken your simplified configuration with just 2 nodes using a non-wildcard queue with redistribution-delay of 0, and I reproduced the behavior you're seeing on my local machine (i.e. without Kubernetes). I believe I see why the behavior is such, but in order to understand the current behavior you first must understand how redistribution works in the first place.

In a cluster every time a consumer is created the node on which the consumer is created notifies every other node in the cluster about the consumer. If other nodes in the cluster have messages in their corresponding queue but don't have any consumers then those other nodes redistribute their messages to the node with the consumer (assuming the message-load-balancing is ON_DEMAND and the redistribution-delay is >= 0).

In your case however, the node with the messages is actually down when the consumer is created on the other node so it never actually receives the notification about the consumer. Therefore, once that node restarts it doesn't know about the other consumer and does not redistribute its messages.

I see you've opened ARTEMIS-3321 to enhance the broker to deal with this situation. However, that will take time to develop and release (assuming the change is approved). My recommendation to you in the mean-time would be to configure your client reconnection which is discussed in the documentation, e.g.:

tcp://127.0.0.1:61616?reconnectAttempts=30

Given the default retryInterval of 2000 milliseconds that will give the broker to which the client was originally connected 1 minute to come back up before the client gives up trying to reconnect and throws an exception at which point the application can completely re-initialize its connection as it is currently doing now.

Since you're using Spring Boot be sure to use version 2.5.0 as it contains this change which will allow you to specify the broker URL rather than just host and port.

Lastly, keep in mind that shutting the node down gracefully will short-circuit the client's reconnect and trigger your application to re-initialize the connection, which is not what we want here. Be sure to kill the node ungracefully (e.g. using kill -9 <pid>).