I have a list of person names like, for example, this except (Person
is the column name):
Person
"Wilson, Charles; Harris Arthur"
"White, D.
Arthur Harris"
Note that the multiple persons are mentioned in different ways and are separated differently.
I would like to use the RDF Mapping Language https://rml.io/ to create the following RDF without cleaning (or changing) the input data:
:Wilson a foaf:Person;
foaf:firstName "Charles";
foaf:lastName "Wilson" .
:Harris a foaf:Person;
foaf:firstName "Arthur";
foaf:lastName "Harris" .
:White a foaf:Person;
foaf:firstName "D.";
foaf:lastName "White" .
Note that Arthur Harris is mentioned twice in the input data, but only a single RDF resource is created.
I use the function ontology https://fno.io/ and created a custom java method. Based on the argument mode
a list of person properties are returned (e.g. only the URIs or only the first names).
public static List<String> getPersons(String value, String mode) {
if(mode == null || value.trim().isEmpty())
return Arrays.asList();
List<String> results = new ArrayList<>();
for(Person p : getAllPersons(value)) {
if(mode.trim().isEmpty() || mode.equals("URI")) {
results.add("http://example.org/person/" + p.getLastName());
} else if(mode.equals("firstName")) {
results.add(p.getFirstName());
} else if(mode.equals("lastName")) {
results.add(p.getLastName());
} else if(mode.equals("fullName")) {
results.add(p.getFullName());
}
}
return results;
}
Assume that the getAllPersons
method correctly extracts the persons from a given string, like the ones above.
In order to extract multiple persons from one cell I call the getPersons
function in a subjectMap
like this:
:tripleMap a rr:TriplesMap .
:tripleMap rml:logicalSource :ExampleSource .
:tripleMap rr:subjectMap [
fnml:functionValue [
rr:predicateObjectMap [
rr:predicate fno:executes ;
rr:objectMap [ rr:constant cf:getPersons ]
] ;
rr:predicateObjectMap [
rr:predicate grel:valueParameter ;
rr:objectMap [ rml:reference "Person" ] # the column name
] ;
rr:predicateObjectMap [
rr:predicate grel:valueParameter2 ;
rr:objectMap [ rr:constant "URI" ] # the mode
]
];
rr:termType rr:IRI ;
rr:class foaf:Person
] .
I use RMLMapper https://github.com/RMLio/rmlmapper-java, however, it only allows to return one subject for each line, see https://github.com/RMLio/rmlmapper-java/blob/master/src/main/java/be/ugent/rml/Executor.java#L292 .
That is why I wrote a List<ProvenancedTerm> getSubjects(Term triplesMap, Mapping mapping, Record record, int i)
method and replaced it accordingly.
This leads to the following result:
:Wilson a foaf:Person .
:Harris a foaf:Person .
:White a foaf:Person .
I know that this extension is incompatible with the RML specification https://rml.io/specs/rml/ where the following is stated:
It [a triples map] must have exactly one subject map that specifies how to generate a subject for each row/record/element/object of the logical source (database/CSV/XML/JSON data source accordingly).
If I proceed to also add first name resp. last name, the following predicateObjectMap
could be added:
:tripleMap rr:predicateObjectMap [
rr:predicate foaf:firstName;
rr:objectMap [
fnml:functionValue [
rr:predicateObjectMap [
rr:predicate fno:executes ;
rr:objectMap [ rr:constant cf:getPersons ]
] ;
rr:predicateObjectMap [
rr:predicate grel:valueParameter ;
rr:objectMap [ rml:reference "Person" ] # the column name
] ;
rr:predicateObjectMap [
rr:predicate grel:valueParameter2 ;
rr:objectMap [ rr:constant "firstName" ] # the mode
]
]
]
] .
Because a predicateObjectMap
is evaluated for each subject and multiple subjects are returned now, every person resource will get the first name of every person. In order to make it more clear, it looks like this:
:Wilson a foaf:Person;
foaf:firstName "Charles" ;
foaf:firstName "Arthur" ;
foaf:firstName "D." .
:Harris a foaf:Person;
foaf:firstName "Charles" ;
foaf:firstName "Arthur" ;
foaf:firstName "D." .
:White a foaf:Person;
foaf:firstName "Charles" ;
foaf:firstName "Arthur" ;
foaf:firstName "D." .
My question is: Is there a solution or work-around in RML for multiple complex entities (e.g. persons having first and last names) in one data element (cell) of the input without cleaning (or changing) the input data?
Maybe this issue is related to my question: https://www.w3.org/community/kg-construct/track/issues/3
It would also be fine if such a use case is not meant to be solved by a mapping framework like RML. If this is the case, what could be alternatives? For example, a handcrafted extraction pipeline that generates RDF?
As far as I am aware, what you are trying to do is not possible using FnO functions and join conditions.
However, what you could try is specifying a clever
rml:query
orrml:iterator
which splits the complex values before they reach the RMLMapper. Whether this is possible depends on the specific source database, though.For instance, if the source is a SQL Server database, you could use the function STRING_SPLIT. Or if it is a PostgreSQL database, you could use STRING_TO_ARRAY together with unnest. (Since different separators are used in the data, it is possible you would have to call STRING_SPLIT or STRING_TO_ARRAY once for each different separator.
If you provide more information about the underlying database, I can update this answer with an example
(Note: I contribute to RML and its technologies.)