Extracting skos:closeMatch from RDF/XML using GREL in OpenRefine

109 views Asked by At

This is a picture of my OpenRefine project. I need to extract all the instances of skos:CloseMacth URIs from an RDF/XML column into a separate column in OpenRefine.

This is my RDF/XML code:

<rdf:RDF xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/1999/02/22-rdf-schema#" xmlns:cs="http://purl.org/vocab/changeset/schema#" xmlns:skosxl="http://www.w3.org/2008/05/skos-xl#">
  <rdf:Description rdf:about="http://id.loc.gov/authorities/subjects/sh85145648">
    <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
    <skos:prefLabel xml:lang="en">Water-supply</skos:prefLabel>
    <skosxl:altLabel>
      <rdf:Description>
    <rdf:type rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/>
    <skosxl:literalForm xml:lang="en">Availability, Water</skosxl:literalForm>
      </rdf:Description>
    </skosxl:altLabel>
    <skosxl:altLabel>
      <rdf:Description>
    <rdf:type rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/>
    <skosxl:literalForm xml:lang="en">Water availability</skosxl:literalForm>
      </rdf:Description>
    </skosxl:altLabel>
    <skosxl:altLabel>
      <rdf:Description>
    <rdf:type rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/>
    <skosxl:literalForm xml:lang="en">Water resources</skosxl:literalForm>
      </rdf:Description>
    </skosxl:altLabel>
    <skos:closeMatch rdf:resource="http://www.yso.fi/onto/yso/p9967"/>
    <skos:closeMatch rdf:resource="http://id.worldcat.org/fast/1172350"/>
    <skos:closeMatch rdf:resource="http://www.wikidata.org/entity/Q1061108"/>
    <skos:closeMatch rdf:resource="http://id.worldcat.org/fast/1172350"/>
    <skos:closeMatch rdf:resource="http://www.wikidata.org/entity/Q1061108"/>
    <skos:closeMatch rdf:resource="http://www.yso.fi/onto/yso/p9967"/>
    <skos:changeNote>
      <cs:ChangeSet>
    <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/subjects/sh85145648"/>
    <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
    <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">1986-02-11T00:00:00</cs:createdDate>
    <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">new</cs:changeReason>
      </cs:ChangeSet>
    </skos:changeNote>
    <skos:changeNote>
      <cs:ChangeSet>
    <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/subjects/sh85145648"/>
    <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
    <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-11-17T07:36:37</cs:createdDate>
    <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">revised</cs:changeReason>
      </cs:ChangeSet>
    </skos:changeNote>
  </rdf:Description>
</rdf:RDF>

I tried this code value.parseHtml().select('skos|closematch') to add a column based on the RDF/XML column, but it doesn't work.

2

There are 2 answers

2
Tom Morris On BEST ANSWER

Your code is pretty close. Were you examining the display of the preview column to help guide you?

Your code returns an array of six XML elements. The things that you're missing are:

  • an iterator - forEach()
  • a function to fetch the value of the attribute - htmlAttr()
  • something to convert the array to a single value which can be stored in the column - join()

Altogether it'll look like: forEach(value.parseHtml().select('skos|closeMatch'), element, element.htmlAttr('rdf:resource')).join('|')

I actually built this from the inside out by starting with a single element: value.parseHtml().select('skos|closeMatch')[0] to see what it looked like and then adding the .htmlAttr('rdf:resource') before wrapping the entire thing with forEach(...).join('|') (Obviously you can choose whatever delimiter you find most useful)

Update: your data has duplicates, so you might want to add .uniques() like:

forEach(value.parseHtml().select('skos|closeMatch'), element, element.htmlAttr('rdf:resource')).uniques().join('|')

1
RolfBly On

What is your desired result? I just copied your code into OR's Clipboard and selected rdf:Description as first XML element. I assume the code in your question is just a short sample and you have in fact several rdf:Description's inside the rdf:RDF element (i.e. ). So you get a record for each rdf:Description.

This is what I get in the Configure parsing options pane...

screenshot1

And this is what I get when I do Create Project and switch to row mode.

Screenshot2

Is the third column what you mean by this (?):

all the instances of skos:CloseMacth URIs from an RDF/XML column into a separate column in OpenRefine.

If not, please clarify editing you question.