Neo4j, long linked lists, timeouts, - should I use an index instead? (Neo 1.9)

261 views Asked by At

I'm using a linked list to keep track of a user's activities, but I want to keep track of every activity, not just the last 20 or so.. So over time, a user might have 5000 nodes in the list.. (Neo 1.9, Neography Gem, ROR 3.2...)

When I use cypher to traverse the list, it does well as long as the number of hops is under 40 or so... but.. with thousands of nodes, 40 isn't gonna cut it.. If I use a larger number, Neo basically gets tied up for quite some time..

The issue is this.. say a user liked a photo a while ago, say 1000 actions ago before the most recent, and for some odd reason they want to unlike a photo.. Well, I have to snip that node out of the activity linked list. Problem is, I have to locate that activity node that is buried 1000 nodes deep.. So the 40 hops that seems to not freeze things up isn't going to work at all

When I run the following cypher query, it just takes minutes, if it finishes at all.. (:ACTIVITIES_FIRST_NODE IS the head of the linked list)

START viewer_who_liked=node(2675)
MATCH viewer_who_liked-[:ACTIVITIES_FIRST_NODE|ACTIVITIES_NEXT_NODE*]-activity_list_node-[:LIKES_PHOTO]->object 
RETURN id(activity_list_node)

It doesn't freeze up if I change it to [:ACTIVITIES_FIRST_NODE|ACTIVITIES_NEXT_NODE*1..30] but that won't find the node I'm looking for.. And the where function is useless as it seems to be evaluated after the traversal of the activities nodes.. I've used the with clause to no avail.. I also tried step and limiting - no dice.. it always times out...

Now, if I use limit and skip it doesn't bomb but it also doesn't return the node I'm looking for unless I plan on iterating through chunks of the list.... but I'm looking for one result, and quickly, and don't want to iterate through sets of records.. (This works fine for an activity feed that is paged, but not looking for a needle in a haystack.. I need the needle in the haystack..)

So, I was wondering, should I put the activities nodes on an index (in addition to being on the linked list - the linked list is great for activity feeds etc), and search the index to retrieve the proper activity node, then perform the removal of the target node? Or am I overlooking something here/ doing something wrong..

1

There are 1 answers

4
jjaderberg On BEST ANSWER

When the user 'un-likes' the picture, do you have enough information to retrieve the 'object' node right away? Could you bind both user and object, follow the incoming [:LIKES_PHOTO] relationships that the object node has to some activity node, and see which of those activity nodes connect to the right user node? Something like

START photo=node(5762), user=node(2675)
MATCH photo<-[r:LIKES_PHOTO]-(like_event), user-[:FIRST]->(first_event)
WHERE first_event-[:NEXT*1..1000]->like_event
DELETE r

I think the pattern in the WHERE clause above would be cheaper to match than a pattern from the user to all the photos he has liked, testing properties to find the right one.


Some other things to try with your query and model:

1) Declare the direction on the variable length relationship, i.e. -[:NEXT*1..100]-> . Even if the traversal can only progress along one relationship for each step, it is still significantly faster when the direction is explicit. Testing on a similar structure to yours, adding direction reduced the number of db hits for the TraversalMatcher by 75% (below).

2) If you can narrow the possible depth down to a some limited range, do so. It's faster to query (a)-[:NEXT*90..110]->(b)-[:DATA]->(c) than (a)-[:NEXT*..100]->(b)-[:DATA]->(c). You could do some kind of iterative search or paging of your stream this way, though I haven't tried it.

3) If some relationship only occurs once (your [:ACTIVITIES_FIRST_NODE] perhaps) then don't include it in the variable depth portion of the pattern. On a structure similar to yours profiling a match with these patterns had the TraversalMatcher hit the db as follows

(a)-[:DATA|NEXT*45..55]-(b)         //_db_hits=212366
(a)-[:DATA|NEXT*45..55]->(b)        //_db_hits=53566
(a)-[:NEXT*45..55]-()-[:DATA]-(b)   //_db_hits=418
(a)-[:NEXT*45..55]->()-[:DATA]->(b) //_db_hits=107

4) If matching (a)-[:NEXT*..1000]->(b) doesn't return, matching shortestPath( (a)-[:NEXT*..1000]->(b) ) might. I'm not sure why, but when I tested it the former pattern ran off a cliff while shortestPath() returned in ~200ms. It may have something to do with finding some vs. all matching paths, and maybe depth vs. breadth first traversal order, but I haven't looked into it.

5) If the series get too long, split them up. Either create a new series per day/week or for every Nth member, or relate the user directly to steps in the series at some interval (also per some time or amount. This would be an analogue solution to indexing the activity nodes like you mentioned).