I am working on a large ttl file of 20 GB, I try to read in using rdflib but the I am getting a error
killed
To avoid this, I am trying to create a smaller file from this file using grep command.
The sample data is yagoTransitiveType.ttl; the beginning contains lines that look like the following:
@base <http://yago-knowledge.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<1908_St._Louis_Browns_season> rdf:type <wikicat_St._Louis_Browns_seasons> .
<1908_St._Louis_Browns_season> rdf:type <wordnet_abstraction_100002137> .
<A1086_road> rdf:type <wikicat_Roads_in_the_United_Kingdom> .
<A1086_road> rdf:type <wordnet_artifact_100021939> .
I want to keep only the lines in the header at the top, or the ones that contain wordnet_
.
What I've tried so far is:
grep "wordnet_" yagoTransitiveType.ttl >wordnet_yagoTransitiveType.ttl
The problem is that the file don't read the initial prefix like yago: and other, due to which rdflib is not able to parse the ttl file.
import rdflib
g = rdflib.Graph()
g.parse('yagoTransitiveType.ttl', format='ttl')
How can I fix the issue either by adding 10 lines after running grep command or any other way?
As I understand it, you want to retain either of two types of data:
rdf:type
rdf:type
andwordnet_
(in that order)Less efficiently (as it reads the input file twice), this could look like:
...or, trading robustness for efficiency (reading the file twice, but on the first pass reading only the first 10 lines -- but depending on you to provide an accurate count of how many lines to keep):
Note that in both of the above, we provide the input filename twice so the second tool starts over from the beginning of the file, rather than sharing an input file handle and counting on the first tool to read the cursor at the right place within the file. We'll see alternatives to that later.
As a more efficient alternative (reading the input file only once and performing both operations with the same tool):
awk
is a full-featured programming language; it thus can perform arbitrary operations, whereas grep does nothing but regex matching against individual lines.If you really want to just copy the first 10 lines unfiltered, and then use grep to filter the rest, that could instead look like:
All of these will work to generate a
filtered.ttl
.