Reading RDF does not work

390 views Asked by At

I am trying to road a foaf file:

import org.apache.jena.rdf.model.Model;
import org.apache.jena.rdf.model.ModelFactory;

public class Testbed {
    public static void main(String[] args) {
        Model model = ModelFactory.createDefaultModel();

        try {
                model.read("http://www.csail.mit.edu/~lkagal/foaf", "RDF/XML"); 
        }
        catch(Exception ex) {
            System.out.println(ex.toString());
        }
    }
}

I am getting the following exception:

org.apache.jena.riot.RiotException: [line: 1, col: 50] White spaces are required between publicId and systemId.

I do not understand what this exception means. How can I fix it. Am I using the wrong format (does not look like "TURTLE" or any other format)?

My environment (Windows 10 x64, apache-jena-3.1.1):

java version "1.8.0_112" Java(TM) SE Runtime Environment (build 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode

1

There are 1 answers

0
Loris Securo On BEST ANSWER

The URL http://www.csail.mit.edu/~lkagal/foaf is actually redirecting to http://people.csail.mit.edu/lkagal/foaf. The presence of a redirect is the cause of the error.

The problem was already reported and fixed in the development branch of Jena (bug [JENA-1263]).

Analysis

Apache Jena uses Apache HttpClient for connection handling. In particular, Jena 3.1.0 uses HttpClient 4.2.6 which was updated to HttpClient 4.5.2 in Jena 3.1.1.

As @potame pointed out, the issue is not present using Jena 3.1.0, the reason is that it creates a connection which by default supports various features, including automatically following redirects (it uses new SystemDefaultHttpClient()).

On the contrary, with the update of HttpClient, in Jena 3.1.1 the code was modified to create a more minimal type of connection that is unable to follow redirects (it uses HttpClients.createMinimal()).

What happens is that, instead of reaching your foaf file, it just retrieves the redirect message which is:

name="[xml]",ch=DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://people.csail.mit.edu/lkagal/foaf">here</a>.</p>
<hr>
<address>Apache/2.2.16 (Debian) Server at www.csail.mit.edu Port 80</address>
</body></html>

and then tries to parse it with Apache Xerces which is actually the one that throws the exception (you can see that by using ex.printStackTrace() instead of System.out.println(ex.toString())):

...
at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:282)
at org.apache.xerces.impl.XMLScanner.reportFatalError(XMLScanner.java:1467)
at org.apache.xerces.impl.XMLScanner.scanExternalID(XMLScanner.java:1001)
...

Solutions

  • use the direct URL, http://people.csail.mit.edu/lkagal/foaf
  • use a previous version of Jena
  • use the development branch of Jena
  • provide Jena with your own "redirect capable" connection, to be used instead of the default one; you can do so calling the method HttpOp.setDefaultHttpClient prior to use model.read, for example:

    HttpOp.setDefaultHttpClient(HttpClientBuilder.create().build());