I have a corpus of 44940 articles, each article has id, title and list of references (other articles that were cited in). The schema of corpus looks somthing like this :
+---+-----+----------+
| id|title| ref|
+---+-----+----------+
|id1| a|[id4, id3]|
|id2| b| [id10]|
|id3| c|[id1, id9]|
|id4| d| [id2]|
|id9| e| [id3]|
+---+-----+----------+
My goal is to build a network for each article in the corpus which vertices represent articles and edges represent the reference link. For example, the graph for article 'id1' would be :
id1 -- id4
id1 -- id3
id2 -- id10
id3 -- id9
id4 -- id2
id9 -- id3
As shown in the example above, article 'id1' references article 'id4' and so on.
I have used pyspark to read the corpus and graphframe to construct the graph.
My code :
sources = df.select('id', 'title')
edges = df.select('id', F.explode('ref').alias('dst')).withColumnRenamed('id', 'src')
g = GraphFrame(sources, edges)
# user query
query_id = 'id1'
query_df = g.edges.filter("src == '%s'" % query_id).withColumnRenamed('dst', 'dst1').withColumnRenamed('src', 'src1')
res = g.edges.join(query_df, query_df.dst1 == g.edges.src, "outer").select('src', 'dst')
The result :
+---+----+
|src| dst|
+---+----+
|id1| id4|
|id1| id3|
|id2|id10|
|id3| id1|
|id3| id9|
|id4| id2|
|id9| id3|
+---+----+
It worked well for this example, however when i use article 'id2' as query which references article 'id10' and article 'id10' does not exist in the corpus, it returned :
+----+----+
| src| dst|
+----+----+
| id1| id4|
| id1| id3|
|null|null|
| id2|id10|
| id3| id1|
| id3| id9|
| id4| id2|
| id9| id3|
+----+----+
This is wrong, it should return (or i want it to return) somthing like this, since 'id10' doesn't exist in src column :
+----+----+
| src| dst|
+----+----+
| id2|id10|
+----+----+
What should i do ? and is there any other solutions for this problem ?