I'm generating Turtle triples, full dataset already about 2GB. I work on a small sample of a few K for most testing. Then I attempt a periodic test on the full dataset. It never loads all the way, but it tells me if there are errors.
My quick test is to load the ttl file into protege. I'm using Protege 5.2 (the windows version). There are no errors in the small samples. But when I larger samples it (protege) reads in the ttl file I generated and tells me there's an error.
• Level: INFO Time: 1504111914814 Message: ------------------------------- Loading Ontology -------------------------------
• Level: INFO Time: 1504111914815 Message: Loading ontology from file:/C:/Projects/gdelt/sample.ttl
• Level: INFO Time: 1504112075814 Message: Finished loading file:/C:/Projects/gdelt/sample.ttl
• **Level: ERROR Time: 1504112075818 Message: An error occurred whilst loading the ontology at GC overhead limit exceeded. Cause: {}**
• Level: INFO Time: 1504112075819 Message: Loading for ontology and imports closure successfully completed in 160995 ms
It can take a very long time to load these sample files- and then it only tells me there was an error without any indication of where the problem was. So my current method of debugging is binary search - generate file half as large, see if there is an error, split the difference, check for error, and that way I narrow it down to a few lines in which I can easily detect the error. This is really tedious. Is there a way to get protege to report the line where it puked?
If not, perhaps there is another tool can I use to check the syntax of the triples I generate?
The out of memory error is not raised in the parser, so there is no line number to provide. The number of lines that can be loaded with your memory limit can only be guessed with successive attempts.
The best workaround is to increase the -Xmx parameter value.