Katta Index deployment issue

510 views Asked by At

I am seeing Deployment failed on a 3-Node Katta cluster. When I trigger katta deployment through command line, it tries to deploy index 3-times and fails with the message "failed to deploy index 'katta_index_1' in 374635 ms".

Deploy related log is as follows:

13/12/09 03:48:19 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=master.domain.com,node1.domain.com,node2.domain.com sessionTimeout=30000 watcher=org.I0Itec.zkclient.ZkClient@5c74c3aa
13/12/09 03:48:19 INFO zookeeper.ClientCnxn: Opening socket connection to server node1.domain.com/<node1-ip>:2181
13/12/09 03:48:19 INFO zookeeper.ClientCnxn: Socket connection established to node1.domain.com/<node1-ip>:2181, initiating session
13/12/09 03:48:19 INFO zookeeper.ClientCnxn: Session establishment complete on server node1.domain.com/<node1-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:48:19 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
...................................13/12/09 03:49:30 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 20000ms for sessionid 0x242d67304660002, closing socket connection and attempting reconnect
13/12/09 03:49:30 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:49:31 INFO zookeeper.ClientCnxn: Opening socket connection to server node2.domain.com/<node2-ip>:2181
13/12/09 03:49:31 INFO zookeeper.ClientCnxn: Socket connection established to node2.domain.com/<node2-ip>:2181, initiating session
13/12/09 03:49:31 INFO zookeeper.ClientCnxn: Session establishment complete on server node2.domain.com/<node2-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:49:31 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
13/12/09 03:49:31 WARN client.IndexDeployFuture: Reconnecting IndexDeployFuture
....................13/12/09 03:50:10 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 20000ms for sessionid 0x242d67304660002, closing socket connection and attempting reconnect
13/12/09 03:50:10 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:50:10 INFO zookeeper.ClientCnxn: Opening socket connection to server master.domain.com/<master-ip>:2181
13/12/09 03:50:10 INFO zookeeper.ClientCnxn: Socket connection established to master.domain.com/<master-ip>:2181, initiating session
13/12/09 03:50:10 INFO zookeeper.ClientCnxn: Session establishment complete on server master.domain.com/<master-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:50:10 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
13/12/09 03:50:10 WARN client.IndexDeployFuture: Reconnecting IndexDeployFuture
13/12/09 03:50:30 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 20000ms for sessionid 0x242d67304660002, closing socket connection and attempting reconnect
13/12/09 03:50:30 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:50:31 INFO zookeeper.ClientCnxn: Opening socket connection to server node1.domain.com/<node1-ip>:2181
13/12/09 03:50:31 INFO zookeeper.ClientCnxn: Socket connection established to node1.domain.com/<node1-ip>:2181, initiating session
13/12/09 03:50:31 INFO zookeeper.ClientCnxn: Session establishment complete on server node1.domain.com/<node1-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:50:31 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
13/12/09 03:50:51 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 20000ms for sessionid 0x242d67304660002, closing socket connection and attempting reconnect
13/12/09 03:50:51 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:50:51 INFO zookeeper.ClientCnxn: Opening socket connection to server node2.domain.com/<node2-ip>:2181
13/12/09 03:50:51 INFO zookeeper.ClientCnxn: Socket connection established to node2.domain.com/<node2-ip>:2181, initiating session
13/12/09 03:50:51 INFO zookeeper.ClientCnxn: Session establishment complete on server node2.domain.com/<node2-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:50:51 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
.13/12/09 03:51:09 WARN client.IndexDeployFuture: Reconnecting IndexDeployFuture
13/12/09 03:51:09 WARN client.IndexDeployFuture: Reconnecting IndexDeployFuture
...........................................................................................................................................................................................................
failed to deploy index 'katta_index_1' in 374635 ms
13/12/09 03:54:34 WARN protocol.InteractionProtocol: following components still connected:[net.sf.katta.client.IndexDeployFuture@604e280c]
13/12/09 03:54:34 INFO zkclient.ZkEventThread: Terminate ZkClient event thread.
13/12/09 03:54:34 INFO zookeeper.ClientCnxn: EventThread shut down
13/12/09 03:54:34 INFO zookeeper.ZooKeeper: Session: 0x242d67304660002 closed

There is not much helpful information in startMaster log related to this issue, which is as follows:

13/12/09 03:51:13 INFO master.Master: master.domain.com_cb1ea8a3-6b5e-463e-8ef5-ae9a01e1fd61 became master with 0 waiting master operations
13/12/09 03:51:13 INFO master.Master: start managing nodes...
13/12/09 03:51:13 INFO master.Master: found following nodes connected: [master.domain.com:20000, node2.domain.com:20000, node1.domain.com:20000]
13/12/09 03:51:13 INFO master.OperatorThread: starting...
13/12/09 03:51:13 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:52:03 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:52:53 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:53:43 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:54:33 INFO master.OperatorThread: SAFE MODE: leaving safe mode with 3 connected nodes
13/12/09 03:54:33 INFO master.OperationRegistry: watch operation 'IndexDeployOperation:65d9e449:katta_index_1' for node operations [master.domain.com:20000-operation-0000000003, node1.domain.com:20000-operation-0000000003, node2.domain.com:20000-operation-0000000003]
13/12/09 03:54:33 INFO protocol.InteractionProtocol: unregistering component net.sf.katta.master.OperationWatchdog@441aa8dc: {net.sf.katta.master.Master@626f50a8=[AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live]}
13/12/09 03:54:33 INFO master.AbstractIndexOperation: deployment of index katta_index_1 complete
13/12/09 03:54:34 INFO master.OperationWatchdog: watch for IndexDeployOperation:65d9e449:katta_index_1 finished
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'CheckIndicesOperation:58ca40d2'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:783614ab:master.domain.com:20000'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:48274f3c:node2.domain.com:20000'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:1550a7d:node1.domain.com:20000'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:4419391c:master.domain.com:20000'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:519a9fe8:master.domain.com:20000'
13/12/09 03:54:35 INFO master.OperatorThread: executing operation 'BalanceIndexOperation:9f4152a:katta_index_1'
13/12/09 03:54:35 INFO master.AbstractIndexOperation: balancing shards for index 'katta_index_1'
13/12/09 03:54:35 INFO master.OperationRegistry: watch operation 'BalanceIndexOperation:9f4152a:katta_index_1' for node operations [master.domain.com:20000-operation-0000000000, node1.domain.com:20000-operation-0000000000, node2.domain.com:20000-operation-0000000000]
13/12/09 03:54:50 INFO protocol.InteractionProtocol: unregistering component net.sf.katta.master.OperationWatchdog@22f37b06: {net.sf.katta.master.Master@626f50a8=[AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live]}
13/12/09 03:54:50 INFO master.AbstractIndexOperation: balancing of index katta_index_1 complete
13/12/09 03:54:50 INFO master.OperationWatchdog: watch for BalanceIndexOperation:9f4152a:katta_index_1 finished
13/12/09 03:55:24 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6666ms for sessionid 0x342d68098830007, closing socket connection and attempting reconnect
13/12/09 03:56:13 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:56:13 INFO zookeeper.ClientCnxn: Opening socket connection to server node1.domain.com/<node1-ip>:2181
13/12/09 03:56:15 INFO protocol.InteractionProtocol: unregistering component net.sf.katta.master.OperationWatchdog@22f37b06: {net.sf.katta.master.Master@626f50a8=[AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live]}
13/12/09 03:56:15 INFO master.OperatorThread: operator thread stopped
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Socket connection established to node1.domain.com/<node1-ip>:2181, initiating session
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x342d68098830007 has expired, closing socket connection
13/12/09 03:56:15 INFO zkclient.ZkClient: zookeeper state changed (Expired)
13/12/09 03:56:15 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=master.domain.com,node1.domain.com,node2.domain.com sessionTimeout=10000 watcher=org.I0Itec.zkclient.ZkClient@3a860d49
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: EventThread shut down
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Opening socket connection to server node1.domain.com/<node1-ip>:2181
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Socket connection established to node1.domain.com/<node1-ip>:2181, initiating session
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Session establishment complete on server node1.domain.com/<node1-ip>:2181, sessionid = 0x242d67304660006, negotiated timeout = 10000
13/12/09 03:56:15 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
13/12/09 03:56:15 INFO protocol.InteractionProtocol: master.domain.com_cb1ea8a3-6b5e-463e-8ef5-ae9a01e1fd61 started as master
13/12/09 03:56:15 INFO protocol.InteractionProtocol: master 'master.domain.com_cb1ea8a3-6b5e-463e-8ef5-ae9a01e1fd61' published
13/12/09 03:56:15 INFO upgrade.UpgradeRegistry: version of distribution 0.7.dev
13/12/09 03:56:15 INFO upgrade.UpgradeRegistry: version of cluster 0.7.dev
13/12/09 03:56:15 INFO master.Master: master.domain.com_cb1ea8a3-6b5e-463e-8ef5-ae9a01e1fd61 became master with 0 waiting master operations
13/12/09 03:56:15 INFO master.Master: start managing nodes...
13/12/09 03:56:15 INFO master.Master: found following nodes connected: [master.domain.com:20000, node1.domain.com:20000]
13/12/09 03:56:15 INFO master.OperatorThread: starting...
13/12/09 03:56:15 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:57:05 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:57:55 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:58:45 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:59:35 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 04:00:25 INFO master.OperatorThread: SAFE MODE: leaving safe mode with 3 connected nodes
13/12/09 04:00:25 INFO master.OperatorThread: executing operation 'CheckIndicesOperation:6b114461'
13/12/09 04:00:25 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:7a8b0be7:master.domain.com:20000'

I am guessing this is some timeout related issue. Can some one help me with this issue? Thanks in advance.

1

There are 1 answers

0
kn_pavan On

This issue was specific to our Katta cluster. I have figured out that Disk I/O speeds were very low on these nodes. Katta master was waiting for nodes to reply back with deploy status in the "Negotiated Time" when the reply was not received after the wait time, Katta was failing.

As the disk write speed was low, Katta Nodes were busy and responding to Zookeeper cluster's queries and thus causing the master to restart/to go into Safemode "SAFE MODE: No nodes available or state unstable within the last 200000 ms."

We resolved the disk write issue by replacing disk volumes with different mounts and Katta deploy issue also got resolved.