store temp variables in neo4j

1.1k views Asked by At

I have some cypher queries that I execute against my neo4j database. The query is in this form

MATCH p=(j:JOB)-[r:HAS|STARTS]->(s:URL)-[r1:VISITED]->(t:URL) 
WHERE j.job_id =5000 and r1.origin='iframe' and r1.job_id=5000 AND NOT (t.netloc =~ 'VERY_LONG_LIST')   
RETURN count(r1) AS number_iframes;

If you can't understand what I am doing. This is a much simpler query

MATCH (s:WORD)
WHERE NOT (s.text=~"badword1|badword2|badword3")
RETURN s

I am basically trying to match some words against specific list

The problem is that this list is very large as you can see my job_id=5000 and I have more than 20000 jobs, so if my whitelist length is 1MB then I will end up with very large queries. I tried 500 jobs and end up with 200 MB queries file.

I was trying to execute these queries using transactions from py2neo but this is wont be feasible because my post request length will be very large and it will timeout. As a result, I though of using

neo4j-shell -file <queries_file> 

However as you can see the file size is very large because of the large whitelist. So my question is there anyway that I can store this "whitelist" in a variable in neo4j using cypher?? I wish if there is something similar to this

SAVE $whitelist="word1,word2,word3,word4,word5...."

MATCH p=(j:JOB)-[r:HAS|STARTS]->(s:URL)-[r1:VISITED]->(t:URL) 
    WHERE j.job_id =5000 and r1.origin='iframe' and r1.job_id=5000 AND NOT (t.netloc =~ $whitelist) 
    RETURN count(r1) AS number_iframes;
1

There are 1 answers

8
Michael Hunger On

What datatype is your netloc?

If you have an index on netloc you can also use t.netloc IN {list} where {list} is a parameter provided from the outside.

Such large regular expressions will not be fast What exactly is your regexp and netloc format like? Perhaps you can change that into a split + index-list lookup?

In general also for regexps you can provide an outside parameter.

You can also use "IN" + index for job_ids.

You can also run a separate job that tags the jobs within your whitelist with a label and use that label for additional filtering e.g. in the match already.

Why do you have to check this twice ? Isn't it enough that the job has id=5000?

j.job_id =5000 and r1.job_id=5000