How to return specific variable from SPARQL Federated Query (Service keyword)?

344 views Asked by At

I'm using a federated query to retrieve some infos from a remote server, but I don't want to retrieve all the variables (select *) that I'm working on inside the federated query, I want to return just the count variable. How can I do that?

Code:

SERVICE <https://sparql.uniprot.org/sparql/> {
    ?sub_bp (rdfs:subClassOf|owl:someValuesFrom)* ?bp_iri .
    ?protein up:classifiedWith ?sub_bp.
    ?protein up:organism <http://purl.uniprot.org/taxonomy/10090> .
}

If was not a federated query, I would do like this:

SELECT distinct (count(distinct ?protein) as ?count) WHERE {

  ?sub_bp (rdfs:subClassOf|owl:someValuesFrom)* ?bp_iri .
  ?protein up:classifiedWith ?sub_bp.
  ?protein up:organism <http://purl.uniprot.org/taxonomy/10090> .

}

But in the federated query I cannot select variables, so is there a way to do what I want?

** EDIT 1 **

After @TallTed response I notice that I may have skipped some details in order to make the question simple but the details turn out to be important so I will describe the whole situation.

I have a local data set containing triples about biological process and genes. I have to count how many genes are related to each biological process and divide that number by the total number of proteins identified in Uniprot about the same biological process (and its "childrens").

To do this, I first query my local data set counting the genes for each biological process and then I run a federated query to count all the identified proteins in Uniprot of each biological process (and its "childrens").

The full SPARQL code:

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uniprot:    <http://purl.uniprot.org/core/>
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX owl:<http://www.w3.org/2002/07/owl#> 

SELECT DISTINCT ?bp_iri ?bp_count (count(distinct ?protein) as ?bp_total) ((?bp_count / ?bp_total) as ?divided) WHERE {

    { 
        SELECT DISTINCT ?bp_iri (COUNT(?bp_iri) as ?bp_count) WHERE{
            ?genes_iri a uniprot:Gene .
            ?genes_iri obo:RO_0000056 ?bp_iri .
        }group by ?bp_iri order by DESC(?bp_count)

    }

    SERVICE silent <https://sparql.uniprot.org/sparql/> {
        ?sub_bp (rdfs:subClassOf|owl:someValuesFrom)* ?bp_iri .
        ?protein up:classifiedWith ?sub_bp.
        ?protein up:organism <http://purl.uniprot.org/taxonomy/10090> .
    }

}group by ?bp_iri ?bp_count ?bp_total order by DESC(?divided)

When I run this query using Jena ARQ (a query engine) the variable ?bp_iri is replaced at the moment of the HTTP request by an specific biological process IRI (one HTTP request for each biological process) as shown in the image below:

SPARQL explain of the federated query

Note that in the explain image, the federated query is selecting everything (*) but the problem is that I don't want to retrieve all these relations that I'm dealing in the federated query, I just want to retrieve the count but the count is a aggragated function that is only allowed to be placed in front of the SELECT keyword. (I don't want to retrieve all the relations because these query returns A LOT of triples (in order of tens of thousands, sometimes milions) and its not necessary to have them in my computer just to count.)

To solve this, I tried to create a subquery inside the federated query to select only the count (?bp_total) and not all the triples. Code used:

SERVICE silent <https://sparql.uniprot.org/sparql/> {
    {
        SELECT (count(distinct ?protein) as ?bp_total) WHERE {
            ?sub_bp (rdfs:subClassOf|owl:someValuesFrom)* ?bp_iri .
            ?protein up:classifiedWith ?sub_bp.
            ?protein up:organism <http://purl.uniprot.org/taxonomy/10090> .
        }
    }
}

Running the explain again, I noticed that when I put a subquery inside the federated query, the variable ?bp_iri is not replaced by the biological process IRI as shown in the image below:

<code>Explain</code> the subquery inside the federated query

Considering this, how can I retrieve only the count from a federated query?

Sorry about the long post.

1

There are 1 answers

1
TallTed On

As in Using Wikidata label service in federated queries, include some of the things that are nominally optional...

Note -- your remote query must actually execute on the remote endpoint, else you will get varying errors.

This is the query you're trying to run on the Uniprot endpoint --

PREFIX    up: <http://purl.uniprot.org/core/> 
PREFIX taxon: <http://purl.uniprot.org/taxonomy/> 
PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX   owl: <http://www.w3.org/2002/07/owl#> 

SELECT (COUNT(DISTINCT ?protein) AS ?count) 
WHERE
  {
    ?sub_bp  (rdfs:subClassOf|owl:someValuesFrom)*  ?bp_iri .
    ?protein  up:classifiedWith  ?sub_bp .
    ?protein  up:organism        taxon:10090 .
  }

That gets an error --

Query evaluation exception.

: SPARQL execute failed:[PREFIX up: PREFIX taxon: PREFIX rdfs: PREFIX owl: SELECT (COUNT(DISTINCT ?protein) AS ?count) WHERE { ?sub_bp (rdfs:subClassOf|owl:someValuesFrom)* ?bp_iri . ?protein up:classifiedWith ?sub_bp . ?protein up:organism taxon:10090 . }] Exception:virtuoso.jdbc4.VirtuosoException: TN...: Exceeded 1000000000 bytes in transitive temp memory. use t_distinct, t_max or more T_MAX_memory options to limit the search or increase the pool

-- but that's not due to a syntax error; it's due to the ZeroOrMorePath of rdfs:subClassOf or owl:someValuesFrom properties ((rdfs:subClassOf|owl:someValuesFrom)*) Property Path you're querying, which has to try MANY possibilities.

If you limit the depth of that path, the Uniprot end point can handle it, and you can run it through Federated SPARQL.

Here's a reduced depth query (which I arbitrarily tried with 3 "ZeroOrOnePath") --

PREFIX    up: <http://purl.uniprot.org/core/> 
PREFIX taxon: <http://purl.uniprot.org/taxonomy/> 
PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX   owl: <http://www.w3.org/2002/07/owl#> 

SELECT (COUNT(DISTINCT ?protein) AS ?count) 
WHERE
  {
    ?sub_bp  (rdfs:subClassOf|owl:someValuesFrom)? 
             / (rdfs:subClassOf|owl:someValuesFrom)? 
             / (rdfs:subClassOf|owl:someValuesFrom)?   ?bp_iri .
    ?protein  up:classifiedWith  ?sub_bp .
    ?protein  up:organism        <http://purl.uniprot.org/taxonomy/10090> .
  }

-- that got a result --

count
"77633"xsd:int

-- which I found was the same result down to a single level --

PREFIX    up: <http://purl.uniprot.org/core/> 
PREFIX taxon: <http://purl.uniprot.org/taxonomy/> 
PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX   owl: <http://www.w3.org/2002/07/owl#> 

SELECT (COUNT(DISTINCT ?protein) AS ?count) 
WHERE
  {
    ?sub_bp  (rdfs:subClassOf|owl:someValuesFrom)?  ?bp_iri .
    ?protein  up:classifiedWith  ?sub_bp .
    ?protein  up:organism        <http://purl.uniprot.org/taxonomy/10090> .
  }

I just ran this query through URIBurner.com (which permits Federated SPARQL for authenticated users) --

PREFIX    up: <http://purl.uniprot.org/core/> 
PREFIX taxon: <http://purl.uniprot.org/taxonomy/> 
PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX   owl: <http://www.w3.org/2002/07/owl#> 

SELECT *
WHERE
  {
    SERVICE <https://sparql.uniprot.org/sparql>
      {
        SELECT (COUNT(DISTINCT ?protein) AS ?count) 
        WHERE
          {
            ?sub_bp  (rdfs:subClassOf|owl:someValuesFrom)?  ?bp_iri .
            ?protein  up:classifiedWith  ?sub_bp .
            ?protein  up:organism        <http://purl.uniprot.org/taxonomy/10090> .
          }
      }
  }

That still produces an error --

Virtuoso HTCLI Error HC001: Read Error in HTTP Client

-- which suggests different settings are in play on the Uniprot server when you go directly through their web query form, which uses JDBC against their SPARQL server, then when you go straight through HTTP, as with Federated SPARQL.

I think the solution you need is a local Uniprot mirror, or a connection to the public Uniprot instance that has different permissions/settings than the primary public endpoint.