I am trying to import an XML file from a URL:
library(xml2)
x <- read_xml('https://ftp.ncbi.nlm.nih.gov/pub/GTR/data/gtr_ftp.xml.gz')
Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, :
Start tag expected, '<' not found [4]
According to the documentation I should be able to pass a URL for a .gz file and it will be uncompressed. If I download the file, unzip it locally, and then use read_xml it works fine. This is a pretty large file (~ 2 GB unzipped) and so I am not sure if that is a problem over a connection. Any thoughts on how I can read this directly from a connection?
The catch is that the documentation says "Local paths ending in .gz, .bz2, .xz, .zip will be automatically uncompressed" (emphasis added). The logic seems to be in the
xml2:::path_to_connectionfunction. URLs are not automatically uncompressed, only local files on disc.The
read_xmlfunction will use thecurlpackage to work with URLs if installed. If you have that package, you can wrap the download call withgzconto do the decoding. Assuming you have enough RAM, you could try