We have a simple ETL process to extract data from an API to a Document DB which we would like to implement using functions. In brief, the process is to take a ~16,500 line file, extract an ID from each line (Function 1), build a URL for each ID (Function 2), hit an API using the URL (Function 3), store the response in a document DB (Function 4). We are using queues for inter-function communication and are seeing problems with timeouts in the first function while doing this.
Function 1 (index.js)
module.exports = function (context, odsDataFile) {
context.log('JavaScript blob trigger function processed blob \n Name:', context.bindingData.odaDataFile, '\n Blob Size:', odsDataFile.length, 'Bytes');
const odsCodes = [];
odsDataFile.split('\n').map((line) => {
const columns = line.split(',');
if (columns[12] === 'A') {
odsCodes.push({
'odsCode': columns[0],
'orgType': 'pharmacy',
});
}
});
context.bindings.odsCodes = odsCodes;
context.log(`A total of: ${odsCodes.length} ods codes have been sent to the queue.`);
context.done();
};
function.json
{
"bindings": [
{
"type": "blobTrigger",
"name": "odaDataFile",
"path": "input-ods-data",
"connection": "connecting-to-services_STORAGE",
"direction": "in"
},
{
"type": "queue",
"name": "odsCodes",
"queueName": "ods-org-codes",
"connection": "connecting-to-services_STORAGE",
"direction": "out"
}
],
"disabled": false
}
Full code here
This function works fine when the number of ID's is in the 100's but times out when it is in the 10's of 1000's. The building of the ID array happens in milliseconds and the function completes but the adding of the items to the queue seems to take many minutes and eventually causes a timeout at the default of 5 mins.
I am surprised that the simple act of populating the queue seems to take such a long time and that the timeout for a function seems to include the time for tasks external to the function (i.e. queue population). Is this to be expected? Are there more performant ways of doing this?
We are running under the Consumption (Dynamic) Plan.
I did some testing of this from my local machine and found that it takes ~200ms to insert a message into the queue, which is expected. So if you have 17k messages to insert and are doing it sequentially, the time will take:
17,000 messages * 200ms = 3,400,000ms or ~56 minutes
The latency may be a bit quicker when running from the cloud, but you can see how this would jump over 5 minutes pretty quickly when you are inserting that many messages.
If message ordering isn't crucial, you could insert the messages in parallel. Some caveats, though:
IAsyncCollector
interface to you so it does it all behind-the-scenes.Here's an example of batching up the inserts 200 at a time -- with 17k messages, this took under a minute in my quick test.