imported .owl files have #'s in prefixes vs original rdf4j triplestore

148 views Asked by At

When I import the dump "PathwayCommons12.All.BIOPAX.owl.gz" (linked from this page) of this Virtuoso triplestore, I've noticed that there are "#"s inserted after the prefix of various URIs.

In particular, the following query runs on the original endpoint:

# Query 1
PREFIX pfx: <http://pathwaycommons.org/pc12/>

select ?pw 
where {
?pw a bp:Pathway
values ?pw {pfx:Pathway_c2fd3d95c8c65552a0514393ede60c37}
}

But to get it running on the local endpoint (imported owl dump) I have to add a "#" to the end of pfx: like:

# Query 2
PREFIX pfx: <http://pathwaycommons.org/pc12/#>

select ?pw 
where {
?pw a bp:Pathway
values ?pw {pfx:Pathway_c2fd3d95c8c65552a0514393ede60c37}
}

Note that Query 1 works only on the original endpoint, while Query 2 works only on the local endpoint.

What is going on here?

1

There are 1 answers

4
Jeen Broekstra On BEST ANSWER

If we look at the first few lines of that massive RDF/XML file, we see:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:owl="http://www.w3.org/2002/07/owl#"
 xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
 xmlns:bp="http://www.biopax.org/release/biopax-level3.owl#"
 xml:base="http://pathwaycommons.org/pc12/">
<owl:Ontology rdf:about="">
 <owl:imports rdf:resource="http://www.biopax.org/release/biopax-level3.owl#" />
</owl:Ontology>

<bp:ExperimentalForm rdf:ID="ExperimentalForm_ee10aeab-1129-49ad-8217-4193f4fbf7e0">
 <bp:comment rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">[ExperimentalFormVocabulary_bait]</bp:comment>
 <bp:experimentalFormDescription rdf:resource="#ExperimentalFormVocabulary_701737e5cf53d06134cbd3ee59611827" />
</bp:ExperimentalForm>

Note the value of the rdf:ID attribute here: "ExperimentalForm_ee10aeab-1129-49ad-8217-4193f4fbf7e0". This is a relative URI, and needs to be resolved against the base URI (which is declared in the document header: "http://pathwaycommons.org/pc12/"). How this resolution is supposed to happen is described in section 2.14 of the RDF/XML syntax specifcation:

The rdf:ID attribute on a node element (not property element, that has another meaning) can be used instead of rdf:about and gives a relative IRI equivalent to # concatenated with the rdf:ID attribute value. So for example if rdf:ID="name", that would be equivalent to rdf:about="#name".

(emphasis mine)

Example 16 in the specification illustrates this further.

What it comes down to is that in parsing this RDF/XML, the values supplied as rdf:ID attributes all resolve to http://pathwaycommons.org/pc12/#<ID>. So the result you're getting in GraphDB is correct for the given input. Why it is different in the Virtuoso endpoint I don't know: either they used a different input file, or they have a bug in their parser, or whatever tool was used to produce this dump file contains a bug.

It is probably safe to say that the intent of whoever created the dump file was that rdf:ID="ExperimentalForm_ee10aeab-1129-49ad-8217-4193f4fbf7e0" would resolve to the IRI http://pathwaycommons.org/pc12/ExperimentalForm_ee10aeab-1129-49ad-8217-4193f4fbf7e0 (that, is without the added # character). There are several ways to fix this in the file: either replace all occurrences of rdf:ID with rdf:about, or else don't rely on relative URI resolution and just use the full URI as the rdf:ID value.