AWS IoT Python Device SDK Shadow Update Timeouts

2.9k views Asked by At

I have been developing AWS IoT device shadow updates for our devices for some time. We are using the AWS IoT Python device SDK located at https://github.com/aws/aws-iot-device-sdk-python on an ARM distribution of Linux (made with Yocto/Bitbake) and Python 2.7.3 (which is the only version available provided by our hardware vendor for the Yocto we use for our ARM board).

I downloaded the sample shadow update code from the AWS repo, modified it to suite our needs, and put it on our device build. It works, most of the time, I would say, but has a failure rate that I am still uncomfortable with. Note, in this code, I added the "OfflinePublishQueueing" setting to handle when it takes a while for the device to connect. This was recommended by one site to prevent "PublishQueueDisabled" errors that we were getting. Here's the initialization code. I will add that I know that the certificates we are using are good, otherwise there would be no success ever.

self.AWSIoTMQTTShadowClient = AWSIoTMQTTShadowClient("InCoIoTShadowUpdate")
self.AWSIoTMQTTShadowClient.configureEndpoint(endpoint, 8883)
self.AWSIoTMQTTShadowClient.configureCredentials(rootCAPath, privateKeyFile, 
jointCertificateFile)
# AWSIoTMQTTShadowClient configuration
self.AWSIoTMQTTShadowClient.configureAutoReconnectBackoffTime(1, 32, 20)
self.AWSIoTMQTTShadowClient.configureConnectDisconnectTimeout(10)  # 10 sec
self.AWSIoTMQTTShadowClient.configureMQTTOperationTimeout(15)  # 15 sec
MQTTClient = self.AWSIoTMQTTShadowClient.getMQTTConnection()
MQTTClient.configureOfflinePublishQueueing(5, DROP_OLDEST)

And later on, the code to update the shadow, noting that the ThingId is set elsewhere from local device config.

def ConnectAndUpdate(self):

    deviceState = InovaIoTDeviceState()
    log = logging.getLogger("InovaIoTDeviceClient:connectAndUpdate")

    # Connect and subscribe to AWS IoT
    try:
        self.AWSIoTMQTTShadowClient.connect()
    except (connectError, connectTimeoutException):
        log.error("Error connecting to AWS IoT service")
        return False

    # Create a deviceShadow with persistent subscription
    updateBot = self.AWSIoTMQTTShadowClient.createShadowHandlerWithName(self.ThingId, True)
    JSONPayload = deviceState.GetDeviceShadowDocument()

    try:
        updateBot.shadowUpdate(JSONPayload, self.ShadowUpdateCallback, 15)
    # This is bad if these errors are thrown, probably either an initial device registration failure
    # or global problem with Inova AWS IoT console configuration or lambda function
    except (publishError, subscribeError):

        log.error('Publish or subscribe error..')
        return False

    except (publishTimeoutException, subscribeTimeoutException):

        # It is possible we are here due to a temporary snafu in AWS
        log.error('Publish or subscribe timeout..')
        return False

    except publishQueueDisabledException:

        # From time to time, AWS will randomly disconnect.
        log.error('Publish Queue disabled..')
        return False

    while not self.ResponseReceived:
        time.sleep(1)

    # reset state
    self.ResponseReceived = False

    try:
        self.AWSIoTMQTTShadowClient.disconnect()
    except (disconnectError, disconnectTimeoutException):
        log.error('Error attempting to disconnect')

    return self.UpdateSuccess

and finally, the callback code

# Custom MQTT message callback
def ShadowUpdateCallback(self, payload, responseStatus, token):
    log = logging.getLogger("InovaIoTDeviceClient:shadowUpdateCallback")
    if responseStatus == "timeout":
        log.info("Shadow update timeout")
        self.UpdateSuccess = False
    elif responseStatus == "accepted":
        log.info("Shadow update successful")
        self.successive_errors = 0
        self.UpdateSuccess = True
    elif responseStatus == "rejected":
        log.info("Rejected shadow update")
        self.UpdateSuccess = False

    self.ResponseReceived = True

Here are the various errors we get, which are relatively common. I would say that the success rate of the shadow update is only around 60%. Here's a timeout reported from AWS:

2017-09-06 11:15:13: (INFO:AWSIoTPythonSDK.core.shadow.deviceShadow) 
Subscribed to update accepted/rejected topics for deviceShadow: 
qqpba4fgsfazl2zfgqq8zkavj (Line:372)
2017-09-06 11:15:13: (INFO:AWSIoTPythonSDK.core.protocol.mqttCore) Offline 
publish request detected. (Line:343)
2017-09-06 11:15:13: (INFO:AWSIoTPythonSDK.core.protocol.mqttCore) Try 
queueing up this request... (Line:347)
2017-09-06 11:15:28: (INFO:AWSIoTPythonSDK.core.shadow.deviceShadow) Shadow 
request with token: InCoIoTShadowUpdate_qqpba4fgsfazl2zfgqq8zkavj_0_acbxa 
has timed out. (Line:202)
2017-09-06 11:15:28: (INFO:InCoIoTDeviceClient:shadowUpdateCallback) Shadow 
update timeout (Line:188)

Here's a timeout error from the actual shadow update call

2017-09-06 12:40:11: (INFO:AWSIoTPythonSDK.core.protocol.mqttCore) Connected 
to AWS IoT. (Line:302)
2017-09-06 12:40:27: (ERROR:AWSIoTPythonSDK.core.protocol.mqttCore) No 
feedback detected for subscribe request 1. Timeout and failed. (Line:413)
2017-09-06 12:40:27: (ERROR:InCoIoTDeviceClient:connectAndUpdate) Publish 
or subscribe timeout.. (Line:147)
2017-09-06 12:40:27: (INFO:cycle) Unsuccessful shadow update... (Line:173)

So those are the timeouts, which are essentially all the errors. As an additional wrinkle, we wiresharked our device's connection and publish attempts, not to see the content (since it is TLS 1.2 encrypted) but to see the connection behavior. We noted that the behavior we see from the wireshark output is that the AWS endpoint is served by at least 8 different IP addresses. In the timeout cases, we always see a connection happen somewhere. In the case that we subscribe and publish to the shadow update topic, but receive a timeout, it only connects to one of the addresses. In the subscribe or publish timeout, it tries three different addresses.

It may be that all of this is on AWS and the performance at their end, but I wondered if anyone had seen this and been able to work around. Upping the timeout values doesn't seem to help.

1

There are 1 answers

0
Kevin Hirst On BEST ANSWER

So we've figured this out. A.) We upgraded to the recently released 1.2.0 AWS IoT Python SDK, and B.) It seems like they fixed a bug that allowed me to use an online callback when the MQTT CONNACK is received. So I changed my code to this:

    try:
        self.AWSIoTMQTTShadowClient.onOnline = self.onConnect
        self.AWSIoTMQTTShadowClient.connect()
    except (connectError, connectTimeoutException):
        log.error("Error connecting to AWS IoT service")
        return False

    while not self.is_aws_connected:
        self.seconds_waited_for_conn_aws += 1
        if self.seconds_waited_for_conn_aws > 30:
            log.error("Timeout waiting for connected status.")
            return False
        time.sleep(1)

    # Create a deviceShadow with persistent subscription
    updateBot = self.AWSIoTMQTTShadowClient.createShadowHandlerWithName(self.ThingId, True)

    try:
        updateBot.shadowUpdate(JSONPayload, self.ShadowUpdateCallback, 15)
    # This is bad if these errors are thrown, probably either an initial device registration failure
    # or global problem with Inova AWS IoT console configuration or lambda function
    except (publishError, subscribeError):

        log.error('Publish or subscribe error..')
        return False

    except (publishTimeoutException, subscribeTimeoutException):

        # It is possible we are here due to a temporary snafu in AWS
        log.error('Publish or subscribe timeout..')
        return False

    except publishQueueDisabledException:

        # From time to time, AWS will randomly disconnect.
        log.error('Publish Queue disabled..')
        return False

    while not self.ResponseReceived:
        time.sleep(1)

    # reset state
    self.ResponseReceived = False
    return self.UpdateSuccess

def onConnect(self):
    log = logging.getLogger("InovaIoTDeviceClient:onConnect")
    log.info("Callback from AWS layer on connect.")
    self.is_aws_connected = True

def runOnce(self):
    try:
        log = logging.getLogger("cycle")
        if not self.ConnectAndUpdate():
            log.info("Unsuccessful shadow update...")
    finally:
        # only here to make sure we disconnect
        try:
            self.AWSIoTMQTTShadowClient.disconnect()
            self.is_aws_connected = False
        except (disconnectError, disconnectTimeoutException):
            log.error('additional error attempting to disconnect')

As you see, I am setting the "onOnline" callback in the first line of code and waiting for that status to complete. As it was before, I was often in the process of waiting for the system to be marked as STABLE in the AWS SDK underpinnings (from workers.py in AWS SDK)

def _dispatch_connack(self, mid, rc):
    status = self._client_status.get_status()
    self._logger.debug("Dispatching [connack] event")
    if self._need_recover():
        if ClientStatus.STABLE != status:  # To avoid multiple connack dispatching
            self._logger.debug("Has recovery job")
            clean_up_debt = Thread(target=self._clean_up_debt)
            clean_up_debt.start()
    else:
        self._logger.debug("No need for recovery")
        self._client_status.set_status(ClientStatus.STABLE)

Apparently you cannot count on the MQTT system to be fully connected and ready on the other side of the connect() call, but you can if you wait for onOnline event. This is why we previously had to make the configureOfflinePublishQueueing() call. And why we would see this in the logs...

2017-09-19 14:45:13: (INFO:AWSIoTPythonSDK.core.protocol.mqtt_core) Offline request detected! (Line:313)

So thankfully this is now working much better. Incidentally, the older code was MUCH more successful (98% success rate on desktop Linux VMs, 45% success rate on small ARM board) on higher powered processors, so I think their "offline publishing" system was experiencing race conditions on our small ARM processor board.