I'm trying to take biological data saved in a .csv format and convert it into a specific xml format set by Darwin Core standards (an extension of Dublin Core). The data are set up in rows of observation records with headers in the first row. I need to repackage the data with Darwin Core standard XML tags using a basic XML tree/schema. The purpose is to standardize the data and make it readily available to load into any kind of database program.
I am a biologist, so I'm fairly new in computer programming and code. I would like to write something in R or excel that can do this repackaging step automatically so I don't have to manually reenter thousands of records.
I have tried using the developer tools in excel 365 to save the .csv as an .xml file, but it seems like I would have to develop the xml tree or schema in a text editor program first. Also, it seems like the xml add-ons that I would use are no longer available. I have downloaded the free text editor called "Brackets" build 1.14 to write some simple xml. I also have RStudio version 1.1.419 with the XML package downloaded to potentially write a script with R version 3.4.3. I've read up on all the Darwin Core Terms and basic XML syntax and rules, but I don't really know where to start.
This is an example of the data in simple .csv format:
type,institutionCode,collectionCode,catalogNumber,scientificName,individualCount,datasetID
PhysicalObject,ANSP,PH,123,"Cryptantha gypsophila Reveal & C.R. Broome",12,urn:lsid:tim.lsid.tdwg.org:collections:1
PhysicalObject,ANSP,PH,124,"Buxbaumia piperi",2,urn:lsid:tim.lsid.tdwg.org:collections:1
This is what the records should look like as an end product:
[<?xml version="1.0"?>
<dwr:SimpleDarwinRecordSet
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://rs.tdwg.org/dwc/xsd/simpledarwincore/ http://rs.tdwg.org/dwc/xsd/tdwg_dwc_simple.xsd"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dwc="http://rs.tdwg.org/dwc/terms/"
xmlns:dwr="http://rs.tdwg.org/dwc/xsd/simpledarwincore/">
<dwr:SimpleDarwinRecord>
<dcterms:type>PhysicalObject</dcterms:type>
<dwc:institutionCode>ANSP</dwc:institutionCode>
<dwc:collectionCode>PH</dwc:collectionCode>
<dwc:catalogNumber>123</dwc:catalogNumber>
<dwc:scientificName>Cryptantha gypsophila reveal & C.R. Boome</dwc:scientificName>
<dwc:individualCount>12</dwc:individualCount>
<dwc:datasetID>urn:lsid:tim.lsid.tdwg.org:collections:1</dwc:datasetID>
</dwr:SimpleDarwinRecord>
<dwr:SimpleDarwinRecord>
<dcterms:type>PhysicalObject</dcterms:type>
<dwc:institutionCode>ANSP</dwc:institutionCode>
<dwc:collectionCode>PH</dwc:collectionCode>
<dwc:catalogNumber>124</dwc:catalogNumber>
<dwc:scientificName>Buxbaumia piperi</dwc:scientificName>
<dwc:individualCount>2</dwc:individualCount>
<dwc:datasetID>urn:lsid:tim.lsid.tdwg.org:collections:1</dwc:datasetID>
</dwr:SimpleDarwinRecord>
</dwr:SimpleDarwinRecordSet>]
This can be done in a number of ways. Here, I go for the stringi solution because it's easy to read what the inputs are.
The code below imports the data, writes the first part of the XML, then writes
SimpleDarwinRecords for each line and finally the last part of the file.unlinkis there to clean up before anything is appended to the file. If indentation matters (apparently it doesn't), you may need to tweak the template a bit.This could also be done using a Jinja2 template and Python.
This is the result: