<- updated for completeness (thanks to hrbrmstr for pointing it out)->
I'm trying to extract some data from Pubmed and I've been reading the example from here (relevant diagram here). A redacted version of my data looks like:
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">11841882</PMID>
<Article PubModel="Print">
<PublicationTypeList>
<PublicationType UI="D002363">Case Reports</PublicationType>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D006323">Heart Arrest</DescriptorName>
<QualifierName MajorTopicYN="Y" UI="Q000188">drug therapy</QualifierName>
<QualifierName MajorTopicYN="N" UI="Q000401">mortality</QualifierName>
<QualifierName MajorTopicYN="N" UI="Q000628">therapy</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">11841881</PMID>
<Article PubModel="Print">
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D000368">Aged</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
</PubmedArticleSet>
So far, I've been able to nicely extract the PublicationTypes using the following code (please run the code in the top segment at the end of this post first):
utilAtype <- function(x){
PMID <- xmlValue(x[[1]][[1]])
PublicationType <- sapply(xmlChildren(x[["Article"]][["PublicationTypeList"]], omitNodeTypes = "XMLInternalTextNode"), xmlValue)
data.frame(PMID = PMID, PublicationType=PublicationType, stringsAsFactors = FALSE)
}
PMIDAType <- xpathApply(hdisease, '//MedlineCitation', utilAtype)
PMIDAType <-do.call(rbind, PMIDAType)
PMID PublicationType
11841882 Case Reports
11841882 Journal Article
11841881 Journal Article
However, using a similar approach on the MeshHeadings results in sapply skipping the rest of the subnodes as below:
PMID LName
11841882 Cardiopulmonary Resuscitation
-Other entries for 11841182 Missing-
11841881 Aged
Would appreciate if anyone could enlighten me on this? The way it's done in the sample suggests that this approach should have worked with no issues. Please see code below for reference.
require("XML")
xmlfile=xmlParse("file.xml", useInternalNodes = TRUE)
hdisease = xmlRoot(xmlfile)
utilMesh <- function(x){
PMID <- xmlValue(x[[1]][[1]])
MHead <- ifelse(is.null(x[["MeshHeadingList"]]), NA,
sapply(xmlChildren(x[["MeshHeadingList"]], omitNodeTypes = "XMLInternalTextNode"), function(z) xmlValue(z[["DescriptorName"]])))
data.frame(PMID = PMID, MHead=MHead, stringsAsFactors = FALSE)
}
PMIDMesh <- xpathApply(hdisease, '//MedlineCitation', utilMesh)
PMIDMesh<-do.call(rbind, PMIDMesh)
c<-nrow(PMIDMesh)
row.names(PMIDMesh) <- 1:c
nrow(table(PMIDMesh))
write.csv(PMIDMesh,"Mesh1.csv")
I would use xpath instead, maybe...