To give some context, I'm currently working on a dialog ontology and I have a python script that generate a dialog representation regarding the classes and properties in the ontology. Meaning that my script outputs one .rdf
file per dialog. For each dialog, I have to perform inference to determine the class of each utterance in the dialog. In my case, it is necessary to process each dialog individually.
My question is: I'm using HermiT
and Pellet
reasoners that I call using owlready2
functions being respectively sync_reasoner()
and sync_reasoner_pellet()
, and it takes approximately 0.7 sec per dialog to proceed. In a scenario where I have plenty of dialogues, is there a way to optimize it, meaning reduce computation time? Note that I can't run my code on cuda GPU because otherwise I would loose full texts which I don't want. Any suggestions to optimize the process?
Reasoning is notorious for taking long times to complete and our hands are generally tied when it comes to the complexity time for graph algorithms. Your best bet here is probably to parallelize the operation.
As you have it now, HermiT & Pellet are running on single threads, processing one statement at a time. For example, if you have 10 cores, you're currently using 1/10 of them to process your data.
Fortunately, each dialog file doesn't depend on the next. This will allow you to spawn several processes at once (using Pythons
multiprocessing
module), each reasoning over a particular RDF file.Summary
To reduce time it takes to reason: Vertically Scale (get a more powerful CPU)
To reduce time it takes to reason across your dataset: Horizontally Scale.
Come up with a way to use multiprocessing to reason your files in parallel. Some people batch their files into folders and give each process the responsibility of handling data in a file (easiest). Some people use a job queue system (think celerely + redis) where each process asks the queue which file needs to be worked on. Some use event-driven techniques, etc.
It it's a small project, I would make N number of folders, where N is the number of cores that you have -1. Then divide the total file count by N and put that many files in each folder. Then I'd spawn N processes, each handling a different folder. Each process loads the ontology and reasons over the files in it's respective folder.