How to create citation network of articles using graphframes?

222 views Asked by At

I have a corpus of 44940 articles, each article has id, title and list of references (other articles that were cited in). The schema of corpus looks somthing like this :

+---+-----+----------+
| id|title|       ref|
+---+-----+----------+
|id1|    a|[id4, id3]|
|id2|    b|    [id10]|
|id3|    c|[id1, id9]|
|id4|    d|     [id2]|
|id9|    e|     [id3]|
+---+-----+----------+

My goal is to build a network for each article in the corpus which vertices represent articles and edges represent the reference link. For example, the graph for article 'id1' would be :

id1 -- id4 
id1 -- id3 
id2 -- id10 
id3 -- id9
id4 -- id2 
id9 -- id3

As shown in the example above, article 'id1' references article 'id4' and so on.

I have used pyspark to read the corpus and graphframe to construct the graph.

My code :

sources = df.select('id', 'title')
edges = df.select('id', F.explode('ref').alias('dst')).withColumnRenamed('id', 'src')
g = GraphFrame(sources, edges)

# user query 
query_id = 'id1'
query_df = g.edges.filter("src == '%s'" % query_id).withColumnRenamed('dst', 'dst1').withColumnRenamed('src', 'src1')
res = g.edges.join(query_df, query_df.dst1 == g.edges.src, "outer").select('src', 'dst')

The result :

+---+----+
|src| dst|
+---+----+
|id1| id4|
|id1| id3|
|id2|id10|
|id3| id1|
|id3| id9|
|id4| id2|
|id9| id3|
+---+----+

It worked well for this example, however when i use article 'id2' as query which references article 'id10' and article 'id10' does not exist in the corpus, it returned :

+----+----+
| src| dst|
+----+----+
| id1| id4|
| id1| id3|
|null|null| 
| id2|id10|
| id3| id1|
| id3| id9|
| id4| id2|
| id9| id3|
+----+----+

This is wrong, it should return (or i want it to return) somthing like this, since 'id10' doesn't exist in src column :

+----+----+
| src| dst|
+----+----+
| id2|id10|
+----+----+

What should i do ? and is there any other solutions for this problem ?

0

There are 0 answers