Neo4j - get distinct count of clusters that a value is in that cluster

49 views Asked by At

I am trying to get the distinct count of clusters that contains a certain value from a list of values.

Dataset:

CREATE 
(a1: node {tag: "a"}),
(a2: node {tag: "a"}),
(a3: node {tag: "a"}),
(a4: node {tag: "a"}),
(b1: node {tag: "b"}),
(b2: node {tag: "b"}),
(c1: node {tag: "c"}),
(a1)-[:LINKS_TO]->(a2),
(a4)-[:LINKS_TO]->(b1),
(a4)-[:LINKS_TO]->(c1)

enter image description here

I would like to get a distinct count of clusters for each distinct value of tag.

  • a: 3.
    • There are 4 nodes that has tag: a, but they are in 3 distinct clusters, cluster 1,2,3
  • b: 2.
    • b appears in 2 distinct clusters, cluster 3, 4
  • c: 1.
    • c appears in 1 distinct clusters, cluster 3

I attempted to get a distinct list of tag value and list of clusters through below query, but I am not sure how I should proceed to join/link the 2 lists to get the expected distinct count.

MATCH (node)
WITH collect(distinct node.tag) AS tag_list, collect(node) AS clusters
RETURN tag_list, clusters

Many thanks in advance!

2

There are 2 answers

1
Finbar Good On BEST ANSWER

Here is a way that doesn't depend on a cluster attribute:

MATCH (n:node) 
WHERE NOT EXISTS { (n)--+(m:node) WHERE n.tag = m.tag AND id(n) > id(m) }
RETURN n.tag AS tag, count(*) AS numClusters

Result:

tag numClusters
"a" 3
"b" 2
"c" 1

This works because if a cluster has at least one node with a given tag, it will only return the one with the lowest id.

1
cybersam On

If, as your question originally stated, each node node contains a cluster property, this simple query:

MATCH (n:node)
RETURN n.tag AS tag, COLLECT(DISTINCT n.cluster) AS clusters

gets the result:

╒═══╤═════════╕
│tag│clusters │
╞═══╪═════════╡
│"a"│[1, 2, 3]│
├───┼─────────┤
│"b"│[3, 4]   │
├───┼─────────┤
│"c"│[3]      │
└───┴─────────┘

On the other the other hand, if there is no cluster property, then take a look at this answer to see how to use the neo4j Graph Data Science Library's WCC algorithm to efficiently calculate the componentId for each node (where a "component" is the same as what you call a "cluster"). (But @FinbarGood's answer should work fine if you do not have much data and all clusters are small.)

Once you have created the GDS projection, like this:

CALL gds.graph.project(
  'myGraph',
  'node',
  'LINKS_TO'
)

you can get your desired results this way:

CALL gds.wcc.stream('myGraph') YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).tag AS tag, COLLECT(DISTINCT componentId) AS clusters

Afterwards, you should drop the GDS projection to free up server memory.