Returning a count of and all complete paths in Neo4j

745 views Asked by At

Being an absolute noob in neo4j and having had very generous help with a previous question, I thought I'd try my luck once again as I'm still struggling.

The example scenario is that of students that enters a house and walks from one room to another. The journey doesn't have to start or end at a particular room but the order of sequence that a student enters a room is important.

What I want to find out is all the complete paths that students have taken along with a count of how many times the path in question was taken. Below is the sample data and what I've tried (thanks to the answer of a previous question along with a series of blog posts):

the file dorm.csv

ID|SID|EID|ROOM|ENTERS|LEAVES
1|1|12|BLUE|1/01/2015 11:00|4/01/2015 10:19
2|2|18|GREEN|1/01/2015 12:11|1/01/2015 12:11
3|2|18|YELLOW|1/01/2015 12:11|1/01/2015 12:20
4|2|18|BLUE|1/01/2015 12:20|5/01/2015 10:48
5|3|28|GREEN|1/01/2015 18:41|1/01/2015 18:41
6|3|28|YELLOW|1/01/2015 18:41|1/01/2015 21:00
7|3|28|BLUE|1/01/2015 21:00|9/01/2015 9:30
8|4|36|BLUE|1/01/2015 19:30|3/01/2015 11:00
9|5|40|GREEN|2/01/2015 19:08|2/01/2015 19:08
10|5|40|ORANGE|2/01/2015 19:08|3/01/2015 2:43
11|5|40|PURPLE|3/01/2015 2:43|4/01/2015 16:44
12|6|48|GREEN|3/01/2015 11:52|3/01/2015 11:52
13|6|48|YELLOW|3/01/2015 11:52|3/01/2015 17:45
14|6|48|RED|3/01/2015 17:45|7/01/2015 10:00

creating nodes for Student, Room and Visit where Visit is the event of a student entering a room uniquely identified by the ID property

CREATE CONSTRAINT ON (student:Student) ASSERT student.studentID IS UNIQUE;
CREATE CONSTRAINT ON (room:Room) ASSERT room.roomID IS UNIQUE;
CREATE CONSTRAINT ON (visit:Visit) ASSERT visit.visitID IS UNIQUE;


USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dorm.csv" as line fieldterminator '|'
MERGE (student:Student {studentID: line.SID})
MERGE (room:Room {roomID: line.ROOM})
MERGE (visit:Visit {visitID: line.ID, roomID: line.ROOM, studentID: line.SID, ticketID: line.EID})
create (student)-[:VERB]->(visit)-[:OBJECT]->(room)

Creating a PREV relationship allows the ordering or sequencing that the student travels in. This uses data in the file dormprev.csv. If a student has only visited a single room, this ID will not appear in the dormprev file as its purpose is to link/chain visits. Data as below

ID|PREV_ID|EID
3|2|18
4|3|18
6|5|28
7|6|28
10|9|40
11|10|40
13|12|48
14|13|48

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dormprev.csv" as line fieldterminator '|'
MATCH (new:Visit {visitID: line.ID})
MATCH (old:Visit {visitID: line.PREV_ID})
MERGE (new)-[:PREV]->(old)

I can view all student journeys by the below query

MATCH (student:Student)-[:VERB]->(visit:Visit)-[:OBJECT]-(room:Room)
RETURN student, visit, room

However, I have no idea how to return all of the rooms in a complete path.

if I run this query

MATCH p = (:Visit)<-[:PREV]-(:Visit) return p

I can see that it, for example, for student ID 2 returns Green and Yellow and then Yellow and Blue as a separate pair - I want to view that as Green, Yellow, Blue

This also means that if I run the below query:

MATCH p = (:Visit)<-[:PREV]-(:Visit)
WITH p, EXTRACT(v IN NODES(p) | v.roomID) AS rooms
UNWIND rooms AS stays
WITH p, COUNT(DISTINCT stays) AS distinct_stays
WHERE distinct_stays = LENGTH(NODES(p))
RETURN EXTRACT(v in NODES(p) | v.roomID), count(p)
ORDER BY count(p) DESC

it will return a count of those pairings rather than count of "whole paths" if that makes sense.

For example, SID 2 and SID 3 both visit rooms GREEN, YELLOW, BLUE in that order. SID 5 visits GREEN, ORANGE, PURPLE in that order.

What I'm hoping to see is:

[GREEN, YELLOW, BLUE] 2
[GREEN, ORANGE, PURPLE] 1

etc. Is that possible with the above model and if so can anyone please help point me in the right direction? The number of rooms that are visited is not guaranteed and can be anything from one to *. However, if only one room is visited, that's not really of interest and so is the reason why I thought this model might make sense (again, stolen from a blog post series).

I don't know if the above makes sense but any help would be much appreciated - this makes for an excellent use case and would be really useful.

Thank you for your kind help.

2

There are 2 answers

3
Tom Geudens On BEST ANSWER

What I think you are looking for is variable path length. And you can accomplish that by merely changing this in your query (note the asterisk) :

MATCH p = (:Visit)<-[:PREV*]-(:Visit)

Do allow me a couple of further remarks. Yes, I understand the convenience of having roomID and studentID in the Visit node (keeps this specific query quite a bit simpler), but you are ignoring the whole point of having relationships in the first place (in fact, if you do it this way there's currently actually no point in having the Student and Room nodes at all) and you are going to have trouble maintaining them. Secondly ... if we are going to be splitting the proverbial 3rd normal form hairs ;-), then the relations for a Visit should actually be created as follows (note the direction of the relationships) :

CREATE (student)-[:VERB]->(visit)<-[:OBJECT]-(room)

Other than that I must say you're moving very fast :-)

Hope this helps, Tom

4
InverseFalcon On

Building a bit on Tom's suggestions, you might consider an alternate model doing away with :Visit nodes completely, and making your relationship types a bit more focused, like this:

(:Student)-[:VISITED]->(:Room)

You can set entered and left properties on the :VISITED relationship, which will allow you to order the relationships (and corresponding :Rooms) in visited order.

Here's an alternate import that will do this, using APOC Procedures (you'll have to install the correct version corresponding with your Neo4j version) to parse out timestamps from your date strings.

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dorm.csv" as line fieldterminator '|'
MERGE (student:Student {studentID: line.SID})
MERGE (room:Room {roomID: line.ROOM})
WITH student, room, apoc.date.parse(line.ENTERS, 'ms', 'MM/dd/yyyy HH:mm') as entered, apoc.date.parse(line.LEAVES, 'ms', 'MM/dd/yyyy HH:mm') as left
CREATE (student)-[r:VISITED]->(room)
SET r.entered = entered, r.left = left

And now your query to get all paths and the number of students who have taken those paths becomes very easy:

MATCH (s:Student)-[v:VISITED]->(r:Room)
WHERE size((s)-[:VISITED]->()) > 1
WITH s, r
ORDER BY v.entered ASC
WITH s, collect(r.roomID) as rooms
RETURN rooms, count(s)