Download multiple files form an online directory

48 views Asked by At

I have the following website: https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod and I wanted to download all the files from 2021 to 2023. Once you enter to the website you can choose between different folders but now I only want to focus en the 2023 one and download all the files in that folder.

I've try using loops and the rvest package with no avail. I want to be able to download all files in the 2023 folder but I can't find my way around the code. Please help.

Extra Info:

So the code I use is a very basic one since I'm just starting to work with more complex task in R, here is what I tried.

IOD <- read_html("https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post%20Operaci%C3%B3n%2FReportes%2FIEOD%2F2023%2F")
urls <- IOD %>% 
  html_nodes('context-menu-list-context-menu-root') %>%    # get all `area` nodes
  html_attr('href') %>%    # get the link attribute of each node
  sub('.htm$', '.zip', .) %>%    # change file suffix
  paste0('https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod', .)    # append to base URL

# create a directory for it all
dir <- file.path(tempdir(), 'COES')
dir.create(dir)

lapply(urls, function(url) download.file(url, file.path(dir, basename(url))))

# check it's there
list.files(dir)

Once I run that code the outputs were:

IOD <- read_html("https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post%20Operaci%C3%B3n%2FReportes%2FIEOD%2F2023%2F")
urls <- IOD %>% 
  html_nodes('context-menu-list-context-menu-root') %>%
     # get all `area` nodes
  html_attr('href') %>%
     # get the link attribute of each node
  sub('.htm$', '.zip', .) %>%
     # change file suffix
  paste0('https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod', .)    # append to base URL

# create a directory for it all
dir <- file.path(tempdir(), 'COES')
dir.create(dir)
# Warning message:
# In dir.create(dir) :
#   'C:\Users\RCV\AppData\Local\Temp\Rtmp0AUH8C\COES' already exists

lapply(urls, function(url) download.file(url, file.path(dir, basename(url))))
# probando la URL 'https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod'
# Content type 'text/html; charset=utf-8' length 48185 bytes (47 KB)
# downloaded 47 KB
# [[1]]
# [1] 0

# check it's there
list.files(dir)
# [1] "Ieod"  "Ieod#"

I'm actually lost on what to do to be honest. Sorry is this is kinda of a basic question.

1

There are 1 answers

0
Grzegorz Sapijaszko On

My attempt with rvest::read_html_live():

First of all we have to get monthly links:

url <- "https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post%20Operaci%C3%B3n%2FReportes%2FIEOD%2F2023%2F"

ses <- rvest::read_html_live(url)
# ses$view()

months <- ses |>
  rvest::html_elements(xpath = "//li[contains(@id, \"Post Operación/Reportes/IEOD/2023\")]") |>
  rvest::html_attr("id")

months <- paste0("https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=", months)

months[[1]]
# [1] "https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post Operación/Reportes/IEOD/2023/12_Diciembre/"

Now, for each month you have to get daily links (below just an example for one month, you should extend it using lapply or other iteration):

ses <- rvest::read_html_live(months[[1]])

days <- ses |>
  rvest::html_elements(xpath = "//a[contains(@id, \"Post Operación/Reportes/IEOD/2023/\")]") |>
  rvest::html_attr("id")

daily_urls <- paste0("https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=", days)
daily_urls[[1]]
# [1] "https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/"

Now, we have a link to particular day (31) in month (December). We have to extract the table from this page like:

ses <- rvest::read_html_live(daily_urls[[1]])

t <- ses |>
  rvest::html_elements(xpath = "//*[@id=\"tbDocumentLibrary\"]") |>
  rvest::html_table() |>
  purrr::pluck(1)

And build the urls to single files:

paste0("https://www.coes.org.pe/portal/browser/download?url=", days[[1]], t$Nombre)
# [1] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Dom_3112.pdf"               
# [2] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo6_CMgCP_3112.zip"      
# [3] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo5_Manttoejec_3112.xls" 
# [4] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo4_Hop_3112.xlsx"       
# [5] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo3_RPFyRSF_3112.xlsx"   
# [6] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo2_Hidrologia_3112.xlsx"
# [7] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo1_Resumen_3112.xlsx"   

Please note, it requires another iteration. In total 4 iterations: 1/ year, 2/ month, 3/ day, 4/ single files.