MarkLogic 8 - XQuery write large result set to a file efficiently

258 views Asked by At

UPDATE: See MarkLogic 8 - Stream large result set to a file - JavaScript - Node.js Client API for someone's answer on how to do this in Javascript. This question is specifically asking about XQuery.

I have a web application that consumes rest services hosted in node.js.

Node simply proxies the request to XQuery which then queries MarkLogic. These queries already have paging setup and work fine in the normal case to return a page of data to the UI.

I need to have an export feature such that when I put a URL parameter of export=all on a request, it doesn't lookup a page anymore.

At that point it should get the whole result set, even if it's a million records, and save it to a file.

The actual request needs to return immediately saying, "We will notify you when your download is ready."

One suggestion was to use xdmp:spawn to call the XQuery in the background which would save the results to a file. My actual HTTP request could then return immediately.

For the spawn piece, I think the idea is that I run my query with different options in order to get all results instead of one page. Then I would loop through the data and create a string variable to call xdmp:save with.

Some questions, is this a good idea? Is there a better way? If I loop through the result set and it does happen to be very large (gigabytes) it could cause memory issues.

Is there no way to directly stream the results to a file in XQuery?

Note: Another idea I had was to intercept the request at the proxy (node) layer and then do an xdmp:estimate to get the record count and then loop through querying each page and flushing it to disk. In this case I would need to find some way to return my request immediately yet process in the background in node which seems to have some ideas here: http://www.pubnub.com/blog/node-background-jobs-async-processing-for-async-language/

1

There are 1 answers

2
ehennum On

One possible strategy would be to use a self-spawning task that, on each iteration, gets the next page of the results for a query.

Instead of saving the results directly to a file, however, you might want to consider using xdmp:http-post() to send each page to a server:

http://docs.marklogic.com/xdmp:http-post?q=xdmp:http-post&v=8.0&api=true

In particular, the server could be a Node.js server that appends each page as it arrives to a file or any other datasink.

That way, Node.js could handle the long-running asynchronous IO with minimal load on the database server.

When a self-spawned task hits the end of the query, it can again use an HTTP request to notify Node.js to close the file and report that the export is finished.

Hping that helps,