I have been developing AWS IoT device shadow updates for our devices for some time. We are using the AWS IoT Python device SDK located at https://github.com/aws/aws-iot-device-sdk-python on an ARM distribution of Linux (made with Yocto/Bitbake) and Python 2.7.3 (which is the only version available provided by our hardware vendor for the Yocto we use for our ARM board).
I downloaded the sample shadow update code from the AWS repo, modified it to suite our needs, and put it on our device build. It works, most of the time, I would say, but has a failure rate that I am still uncomfortable with. Note, in this code, I added the "OfflinePublishQueueing" setting to handle when it takes a while for the device to connect. This was recommended by one site to prevent "PublishQueueDisabled" errors that we were getting. Here's the initialization code. I will add that I know that the certificates we are using are good, otherwise there would be no success ever.
self.AWSIoTMQTTShadowClient = AWSIoTMQTTShadowClient("InCoIoTShadowUpdate")
self.AWSIoTMQTTShadowClient.configureEndpoint(endpoint, 8883)
self.AWSIoTMQTTShadowClient.configureCredentials(rootCAPath, privateKeyFile,
jointCertificateFile)
# AWSIoTMQTTShadowClient configuration
self.AWSIoTMQTTShadowClient.configureAutoReconnectBackoffTime(1, 32, 20)
self.AWSIoTMQTTShadowClient.configureConnectDisconnectTimeout(10) # 10 sec
self.AWSIoTMQTTShadowClient.configureMQTTOperationTimeout(15) # 15 sec
MQTTClient = self.AWSIoTMQTTShadowClient.getMQTTConnection()
MQTTClient.configureOfflinePublishQueueing(5, DROP_OLDEST)
And later on, the code to update the shadow, noting that the ThingId is set elsewhere from local device config.
def ConnectAndUpdate(self):
deviceState = InovaIoTDeviceState()
log = logging.getLogger("InovaIoTDeviceClient:connectAndUpdate")
# Connect and subscribe to AWS IoT
try:
self.AWSIoTMQTTShadowClient.connect()
except (connectError, connectTimeoutException):
log.error("Error connecting to AWS IoT service")
return False
# Create a deviceShadow with persistent subscription
updateBot = self.AWSIoTMQTTShadowClient.createShadowHandlerWithName(self.ThingId, True)
JSONPayload = deviceState.GetDeviceShadowDocument()
try:
updateBot.shadowUpdate(JSONPayload, self.ShadowUpdateCallback, 15)
# This is bad if these errors are thrown, probably either an initial device registration failure
# or global problem with Inova AWS IoT console configuration or lambda function
except (publishError, subscribeError):
log.error('Publish or subscribe error..')
return False
except (publishTimeoutException, subscribeTimeoutException):
# It is possible we are here due to a temporary snafu in AWS
log.error('Publish or subscribe timeout..')
return False
except publishQueueDisabledException:
# From time to time, AWS will randomly disconnect.
log.error('Publish Queue disabled..')
return False
while not self.ResponseReceived:
time.sleep(1)
# reset state
self.ResponseReceived = False
try:
self.AWSIoTMQTTShadowClient.disconnect()
except (disconnectError, disconnectTimeoutException):
log.error('Error attempting to disconnect')
return self.UpdateSuccess
and finally, the callback code
# Custom MQTT message callback
def ShadowUpdateCallback(self, payload, responseStatus, token):
log = logging.getLogger("InovaIoTDeviceClient:shadowUpdateCallback")
if responseStatus == "timeout":
log.info("Shadow update timeout")
self.UpdateSuccess = False
elif responseStatus == "accepted":
log.info("Shadow update successful")
self.successive_errors = 0
self.UpdateSuccess = True
elif responseStatus == "rejected":
log.info("Rejected shadow update")
self.UpdateSuccess = False
self.ResponseReceived = True
Here are the various errors we get, which are relatively common. I would say that the success rate of the shadow update is only around 60%. Here's a timeout reported from AWS:
2017-09-06 11:15:13: (INFO:AWSIoTPythonSDK.core.shadow.deviceShadow)
Subscribed to update accepted/rejected topics for deviceShadow:
qqpba4fgsfazl2zfgqq8zkavj (Line:372)
2017-09-06 11:15:13: (INFO:AWSIoTPythonSDK.core.protocol.mqttCore) Offline
publish request detected. (Line:343)
2017-09-06 11:15:13: (INFO:AWSIoTPythonSDK.core.protocol.mqttCore) Try
queueing up this request... (Line:347)
2017-09-06 11:15:28: (INFO:AWSIoTPythonSDK.core.shadow.deviceShadow) Shadow
request with token: InCoIoTShadowUpdate_qqpba4fgsfazl2zfgqq8zkavj_0_acbxa
has timed out. (Line:202)
2017-09-06 11:15:28: (INFO:InCoIoTDeviceClient:shadowUpdateCallback) Shadow
update timeout (Line:188)
Here's a timeout error from the actual shadow update call
2017-09-06 12:40:11: (INFO:AWSIoTPythonSDK.core.protocol.mqttCore) Connected
to AWS IoT. (Line:302)
2017-09-06 12:40:27: (ERROR:AWSIoTPythonSDK.core.protocol.mqttCore) No
feedback detected for subscribe request 1. Timeout and failed. (Line:413)
2017-09-06 12:40:27: (ERROR:InCoIoTDeviceClient:connectAndUpdate) Publish
or subscribe timeout.. (Line:147)
2017-09-06 12:40:27: (INFO:cycle) Unsuccessful shadow update... (Line:173)
So those are the timeouts, which are essentially all the errors. As an additional wrinkle, we wiresharked our device's connection and publish attempts, not to see the content (since it is TLS 1.2 encrypted) but to see the connection behavior. We noted that the behavior we see from the wireshark output is that the AWS endpoint is served by at least 8 different IP addresses. In the timeout cases, we always see a connection happen somewhere. In the case that we subscribe and publish to the shadow update topic, but receive a timeout, it only connects to one of the addresses. In the subscribe or publish timeout, it tries three different addresses.
It may be that all of this is on AWS and the performance at their end, but I wondered if anyone had seen this and been able to work around. Upping the timeout values doesn't seem to help.
So we've figured this out. A.) We upgraded to the recently released 1.2.0 AWS IoT Python SDK, and B.) It seems like they fixed a bug that allowed me to use an online callback when the MQTT CONNACK is received. So I changed my code to this:
As you see, I am setting the "onOnline" callback in the first line of code and waiting for that status to complete. As it was before, I was often in the process of waiting for the system to be marked as STABLE in the AWS SDK underpinnings (from workers.py in AWS SDK)
Apparently you cannot count on the MQTT system to be fully connected and ready on the other side of the connect() call, but you can if you wait for onOnline event. This is why we previously had to make the configureOfflinePublishQueueing() call. And why we would see this in the logs...
So thankfully this is now working much better. Incidentally, the older code was MUCH more successful (98% success rate on desktop Linux VMs, 45% success rate on small ARM board) on higher powered processors, so I think their "offline publishing" system was experiencing race conditions on our small ARM processor board.