I have a list of person names like, for example, this except (Person is the column name):

Person
"Wilson, Charles; Harris Arthur"
"White, D.
Arthur Harris"

Note that the multiple persons are mentioned in different ways and are separated differently.

I would like to use the RDF Mapping Language https://rml.io/ to create the following RDF without cleaning (or changing) the input data:

:Wilson a foaf:Person;
    foaf:firstName "Charles";
    foaf:lastName "Wilson" .

:Harris a foaf:Person;
    foaf:firstName "Arthur";
    foaf:lastName "Harris" .

:White a foaf:Person;
    foaf:firstName "D.";
    foaf:lastName "White" .

Note that Arthur Harris is mentioned twice in the input data, but only a single RDF resource is created.

I use the function ontology https://fno.io/ and created a custom java method. Based on the argument mode a list of person properties are returned (e.g. only the URIs or only the first names).

public static List<String> getPersons(String value, String mode) {
    if(mode == null || value.trim().isEmpty())
        return Arrays.asList();

    List<String> results = new ArrayList<>();
    for(Person p : getAllPersons(value)) {
        if(mode.trim().isEmpty() || mode.equals("URI")) {
            results.add("http://example.org/person/" + p.getLastName());
        } else if(mode.equals("firstName")) {
            results.add(p.getFirstName());
        } else if(mode.equals("lastName")) {
            results.add(p.getLastName());
        } else if(mode.equals("fullName")) {
            results.add(p.getFullName());
        }
    }

    return results;
}

Assume that the getAllPersons method correctly extracts the persons from a given string, like the ones above. In order to extract multiple persons from one cell I call the getPersons function in a subjectMap like this:

:tripleMap a rr:TriplesMap .
:tripleMap rml:logicalSource :ExampleSource .
:tripleMap rr:subjectMap [
    fnml:functionValue [

        rr:predicateObjectMap [
            rr:predicate fno:executes ;
            rr:objectMap [ rr:constant cf:getPersons ]
        ] ;
        rr:predicateObjectMap [
            rr:predicate grel:valueParameter ;
            rr:objectMap [ rml:reference "Person" ] # the column name
        ] ;
        rr:predicateObjectMap [
            rr:predicate grel:valueParameter2 ;
            rr:objectMap [ rr:constant "URI" ] # the mode
        ]
    ];
    rr:termType rr:IRI ;
    rr:class foaf:Person
] .

I use RMLMapper https://github.com/RMLio/rmlmapper-java, however, it only allows to return one subject for each line, see https://github.com/RMLio/rmlmapper-java/blob/master/src/main/java/be/ugent/rml/Executor.java#L292 . That is why I wrote a List<ProvenancedTerm> getSubjects(Term triplesMap, Mapping mapping, Record record, int i) method and replaced it accordingly. This leads to the following result:

:Wilson a foaf:Person .

:Harris a foaf:Person .

:White a foaf:Person .

I know that this extension is incompatible with the RML specification https://rml.io/specs/rml/ where the following is stated:

It [a triples map] must have exactly one subject map that specifies how to generate a subject for each row/record/element/object of the logical source (database/CSV/XML/JSON data source accordingly).

If I proceed to also add first name resp. last name, the following predicateObjectMap could be added:

:tripleMap rr:predicateObjectMap [
    rr:predicate foaf:firstName;
    rr:objectMap [
        fnml:functionValue [

            rr:predicateObjectMap [
                rr:predicate fno:executes ;
                rr:objectMap [ rr:constant cf:getPersons ]
            ] ;
            rr:predicateObjectMap [
                rr:predicate grel:valueParameter ;
                rr:objectMap [ rml:reference "Person" ] # the column name
            ] ;
            rr:predicateObjectMap [
                rr:predicate grel:valueParameter2 ;
                rr:objectMap [ rr:constant "firstName" ] # the mode
            ]
        ]
    ]
] .

Because a predicateObjectMap is evaluated for each subject and multiple subjects are returned now, every person resource will get the first name of every person. In order to make it more clear, it looks like this:

:Wilson a foaf:Person;
    foaf:firstName "Charles" ;
    foaf:firstName "Arthur" ;
    foaf:firstName "D." .

:Harris a foaf:Person;
    foaf:firstName "Charles" ;
    foaf:firstName "Arthur" ;
    foaf:firstName "D." .

:White a foaf:Person;
    foaf:firstName "Charles" ;
    foaf:firstName "Arthur" ;
    foaf:firstName "D." .

My question is: Is there a solution or work-around in RML for multiple complex entities (e.g. persons having first and last names) in one data element (cell) of the input without cleaning (or changing) the input data?

Maybe this issue is related to my question: https://www.w3.org/community/kg-construct/track/issues/3

It would also be fine if such a use case is not meant to be solved by a mapping framework like RML. If this is the case, what could be alternatives? For example, a handcrafted extraction pipeline that generates RDF?

2

There are 2 answers

1
Thomas Delva On BEST ANSWER

As far as I am aware, what you are trying to do is not possible using FnO functions and join conditions.

However, what you could try is specifying a clever rml:query or rml:iterator which splits the complex values before they reach the RMLMapper. Whether this is possible depends on the specific source database, though.

For instance, if the source is a SQL Server database, you could use the function STRING_SPLIT. Or if it is a PostgreSQL database, you could use STRING_TO_ARRAY together with unnest. (Since different separators are used in the data, it is possible you would have to call STRING_SPLIT or STRING_TO_ARRAY once for each different separator.

If you provide more information about the underlying database, I can update this answer with an example

(Note: I contribute to RML and its technologies.)

3
David Chaves On

As I understood, you have a normalization problem (multi-value cells). Definitely, what you are asking for is to have a dataset in 1NF, see: https://en.wikipedia.org/wiki/First_normal_form

To address these usual heterogeneity problems in CSV files you can use CSV on the Web annotations (W3C recommendation). More in detail, the property you are asking for in this case is csvw:separator (https://www.w3.org/TR/tabular-data-primer/#sequence-values).

However, there are not many parsers for CSVW and the semantics of its properties to generate RDF is not very clear. We've been working on a solution that works with CSVW and RML+FnO to generate virtual KGs from tabular data (having also a SPARQL query as input and not transforming the input dataset to RDF). The output of our proposal is a well-formed database with a standard [R2]RML mapping, so any [R2]RML-compliant could be used to answer queries or to materialize the Knowledge Graph. Although we currently do not support the materialization step, it is in our ToDo list.

You can take a look at the contribution (under review right now): http://www.semantic-web-journal.net/content/enhancing-virtual-ontology-based-access-over-tabular-data-morph-csv

Website: https://morph.oeg.fi.upm.es/tool/morph-csv