Cursor not found while reading all documents from a collection

667 views Asked by At

I have a collection student and I want this collection as list in Python, but unfortunately I got the following error CursorNextError: [HTTP 404][ERR 1600] cursor not found. Is there an option to read a 'huge' collection without an error?

from arango import ArangoClient

# Initialize the ArangoDB client.
client = ArangoClient()

# Connect to database as  user.
db = client.db(<db>, username=<username>, password=<password>)

print(db.collections())
students = db.collection('students')
#students.all()

students = db.collection('handlingUnits').all()
list(students)
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found

students = list(db.collection('students'))
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found
2

There are 2 answers

1
mpoeter On

By default, AQL queries generate the complete result, which is then held in memory, and provided batch by batch. So the cursor is simply fetching the next batch of the already calculated result. In most of the cases this is fine, but if your query produces a huge result set, then this can take a long time and will require a lot of memory.

As an alternative you can create a streaming cursor. See https://www.arangodb.com/docs/stable/http/aql-query-cursor-accessing-cursors.html and check the stream option. Streaming cursors calculate the next batch on demand and are therefore better suited to iterate a large collection.

3
hYg-Cain On

as suggested in my comment, if raising the ttl is not an option (what I wouldn't do either) I would get the data in chunks instead of all at once. In most cases you don't need the whole collection anyway, so maybe think of limiting that first. Do you really need all documents and all their fields? That beeing said I have no experience with arango, but this is what I would do:

entries = db.collection('students').count() # get total amount of documents in collection
limit=100 # blocksize you want to request
yourlist = [] # final output
for x in range(int(entries/limit) + 1):
    block = db.collection('students').all(skip=x*limit, limit=100)
    yourlist.extend(block) # assuming block is of type list. Not sure what arango returns

something like this. (Based on the documentation here: https://python-driver-for-arangodb.readthedocs.io/_/downloads/en/dev/pdf/)

Limit your request to a reasonable amount and then skip this amount with your next request. You have to check if this "range()" thing works like that you might have to think of a better way of defining the number of iterations you need. This also assumes arango sorts the all() function per default.

So what is the idea?

  1. determin the number of entries in the collection.
  2. based on that determin how many requests you need (f.e. size=1000 -> 10 blocks each containing 100 entries)
  3. make x requests where you skip the blocks you already have. First iteration entries 1-100; second iteration 101-200, third iteration 201-300 etc.