Java OutOfMemoryError in apache Jena using TDB

1k views Asked by At

Hi I've been using Jena for a project and now I am trying to query a Graph for storage in plain files for batch processing with Hadoop.

I open a TDB Dataset and then I query by pages with LIMIT and OFFSET.

I output files with 100000 triplets per file.

However at file 10th the performance degrades and at file 15th it goes down by a factor of 3 and at the 22th file the performances is down to 1%.

My query is:

SELECT DISTINCT ?S ?P ?O WHERE {?S ?P ?O .} LIMIT 100000 OFFSET X

The method that queries and writes to a file is shown in the next code block:

public boolean copyGraphPage(int size, int page, String tdbPath, String query, String outputDir, String fileName) throws IllegalArgumentException {
        boolean retVal = true;
        if (size == 0) {
            throw new IllegalArgumentException("The size of the page should be bigger than 0");
        }
        long offset = ((long) size) * page;
        Dataset ds = TDBFactory.createDataset(tdbPath);
        ds.begin(ReadWrite.READ);
        String queryString = (new StringBuilder()).append(query).append(" LIMIT " + size + " OFFSET " + offset).toString();
        QueryExecution qExec = QueryExecutionFactory.create(queryString, ds);
        ResultSet resultSet = qExec.execSelect();
        List<String> resultVars;
        if (resultSet.hasNext()) {
            resultVars = resultSet.getResultVars();
            String fullyQualifiedPath = joinPath(outputDir, fileName, "txt");
            try (BufferedWriter bwr = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(
                    new FileOutputStream(fullyQualifiedPath)), "UTF-8"))) {
                while (resultSet.hasNext()) {
                    QuerySolution next = resultSet.next();
                    StringBuffer sb = new StringBuffer();
                    sb.append(next.get(resultVars.get(0)).toString()).append(" ").
                            append(next.get(resultVars.get(1)).toString()).append(" ").
                            append(next.get(resultVars.get(2)).toString());
                    bwr.write(sb.toString());
                    bwr.newLine();
                }
                qExec.close();
                ds.end();
                ds.close();
                bwr.flush();
            } catch (IOException e) {
                e.printStackTrace();
            }
            resultVars = null;
            qExec = null;
            resultSet = null;
            ds = null;
        } else {
            retVal = false;
        }
        return retVal;
    }

The null variables are there because I didn't know if there was a possible leak in there.

However after the 22th file the program fails with the following message:

java.lang.OutOfMemoryError: GC overhead limit exceeded

    at org.apache.jena.ext.com.google.common.cache.LocalCache$EntryFactory$2.newEntry(LocalCache.java:455)
    at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.newEntry(LocalCache.java:2144)
    at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.put(LocalCache.java:3010)
    at org.apache.jena.ext.com.google.common.cache.LocalCache.put(LocalCache.java:4365)
    at org.apache.jena.ext.com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:5077)
    at org.apache.jena.atlas.lib.cache.CacheGuava.put(CacheGuava.java:76)
    at org.apache.jena.tdb.store.nodetable.NodeTableCache.cacheUpdate(NodeTableCache.java:205)
    at org.apache.jena.tdb.store.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:129)
    at org.apache.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:82)
    at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
    at org.apache.jena.tdb.store.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
    at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
    at org.apache.jena.tdb.solver.BindingTDB.get1(BindingTDB.java:122)
    at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
    at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
    at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
    at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
    at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
    at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:201)
    at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:183)
    at java.util.HashMap.hash(HashMap.java:338)
    at java.util.HashMap.containsKey(HashMap.java:595)
    at java.util.HashSet.contains(HashSet.java:203)
    at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.getInputNextUnseen(QueryIterDistinct.java:106)
    at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.hasNextBinding(QueryIterDistinct.java:70)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
    at org.apache.jena.sparql.engine.iterator.QueryIterSlice.hasNextBinding(QueryIterSlice.java:76)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
    at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)

Disconnected from the target VM, address: '127.0.0.1:57723', transport: 'socket'

Process finished with exit code 255

The memory viewer shows an increment in memory usage after querying a page :

enter image description here

enter image description here

It is clear that Jena LocalCache is filling up, I have changed the Xmx to 2048m and Xms to 512m with the same result. Nothing changed.

Do I need more memory?

Do I need to clear something?

Do I need to stop the program and do it in parts?

Is my query wrong?

Does the OFFSET has anyting to do with it?

I read in some old mail postings that you can turn the cache off but I could not find any way to do it. Is there a way to turn cache off?

I know it is a very difficult question but I appreciate any help.

2

There are 2 answers

2
AndyS On BEST ANSWER

It is clear that Jena LocalCache is filling up

This is the TDB node cache - it usually needs 1.5G (2G is better) per dataset itself. This cache persists for the lifetime of the JVM.

A java heap of 2G is a small Java heap by today's standards. If you must use a small heap, you can try running in 32 bit mode (called "Direct mode" in TDB) but this is less performant (mainly because the node cache is smaller and in this dataset you do have enough nodes to cause cache churn for a small cache).

The node cache is the main cause of the heap exhaustion but the query is consuming memory elsewhere, per query, in DISTINCT.

DISTINCT is not necessarily cheap. It needs to remember everything it has seen to know whether a new row is the first occurrence or already seen.

Apache Jena does optimize some cases of (a TopN query) but the cutoff for the optimization is 1000 by default. See OpTopN in the code.

Otherwise it is collecting all the rows seen so far. The further through the dataset you go, the more that is in the node cache and also the more than is in the DISTINCT filter.

Do I need more memory?

Yes, more heap. The sensible minimum is 2G per TDB dataset and then whatever Java itself requires (say, 0.5G) and plus your program and query workspace.

1
Irwan Hendra On

You seem to have memory leak somewhere, this is just a guess, but try this:

TDBFactory.release(ds);

REF: https://jena.apache.org/documentation/javadoc/tdb/org/apache/jena/tdb/TDBFactory.html#release-org.apache.jena.query.Dataset-