Filter a TTL file too large for rdflib to parse in-memory otherwise

105 views Asked by At

I am working on a large ttl file of 20 GB, I try to read in using rdflib but the I am getting a error

killed

To avoid this, I am trying to create a smaller file from this file using grep command.


The sample data is yagoTransitiveType.ttl; the beginning contains lines that look like the following:


@base <http://yago-knowledge.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<1908_St._Louis_Browns_season>  rdf:type        <wikicat_St._Louis_Browns_seasons> .
<1908_St._Louis_Browns_season>  rdf:type        <wordnet_abstraction_100002137> .
<A1086_road>    rdf:type        <wikicat_Roads_in_the_United_Kingdom> .
<A1086_road>    rdf:type        <wordnet_artifact_100021939> .

I want to keep only the lines in the header at the top, or the ones that contain wordnet_.


What I've tried so far is:

grep "wordnet_" yagoTransitiveType.ttl >wordnet_yagoTransitiveType.ttl

The problem is that the file don't read the initial prefix like yago: and other, due to which rdflib is not able to parse the ttl file.

import rdflib
g = rdflib.Graph()
g.parse('yagoTransitiveType.ttl', format='ttl')

How can I fix the issue either by adding 10 lines after running grep command or any other way?

2

There are 2 answers

1
Charles Duffy On

As I understand it, you want to retain either of two types of data:

  • Data that doesn't contain rdf:type
  • Data that contains both rdf:type and wordnet_ (in that order)

Less efficiently (as it reads the input file twice), this could look like:

{
  grep -v rdf:type yagoTransitiveType.ttl
  grep -Ee 'rdf:type.*wordnet' yagoTransitiveType.ttl
} >filtered.ttl

...or, trading robustness for efficiency (reading the file twice, but on the first pass reading only the first 10 lines -- but depending on you to provide an accurate count of how many lines to keep):

keep_count=10
{
  head -n "$keep_count" yagoTransitiveType.ttl
  grep -Ee 'rdf:type.*wordnet' yagoTransitiveType.ttl
} >filtered.ttl

Note that in both of the above, we provide the input filename twice so the second tool starts over from the beginning of the file, rather than sharing an input file handle and counting on the first tool to read the cursor at the right place within the file. We'll see alternatives to that later.


As a more efficient alternative (reading the input file only once and performing both operations with the same tool):

awk '
  ! /rdf:type/ { print; next }
  /rdf:type/ && /wordnet_/ { print }
' <yagoTransitiveType.ttl >filtered.ttl

awk is a full-featured programming language; it thus can perform arbitrary operations, whereas grep does nothing but regex matching against individual lines.


If you really want to just copy the first 10 lines unfiltered, and then use grep to filter the rest, that could instead look like:

keep_count=10
{
  for ((i=0; i<keep_count; i++)); do
    IFS= read -r line && printf '%s\n' "$line"
  done
  grep -e wordnet_
} <yagoTransitiveType.ttl >filtered.ttl

All of these will work to generate a filtered.ttl.

0
Ed Morton On

Is this what you're trying to do?

$ grep -E '^@|wordnet_' yagoTransitiveType.ttl
@base <http://yago-knowledge.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<1908_St._Louis_Browns_season>  rdf:type        <wordnet_abstraction_100002137> .
<A1086_road>    rdf:type        <wordnet_artifact_100021939> .