How to detect and tabulate data from an excel file?

141 views Asked by At

I have an Excel file (Available at Google Drive) with data which was saved with a very strange format in order to get printed easily:

Non-relational data, available at Google Drive

And every table repeats daily for over 5 years. I need to analyze this data and tried to get a relational format in order to load it in R/Python-like tools and get only 5 columns:

DATE CLIENT NAME TEST MEASUREMENT TESTER
01-01-2023 JOHN SMITH METABOLYTE A 0.01 PhD. IMA GU
01-01-2023 JOHN SMITH METABOLYTE B 10 PhD. IMA GU
01-01-2023 JOHN SMITH PCR NEGATIVE PhD. IMA GU
01-01-2023 JOHN SMITH MUTATION +++ PhD. IMA GU
01-01-2023 ALBUS DUMBLE PREGNANT NEGATIVE TECH. GUIVER
01-01-2023 ALBUS DUMBLE GLUCOSE 121 TECH. GUIVER
02-01-2023 MAYDAY JUNE METABOLYTE A 0.01 PhD. IMA GU
02-01-2023 JOHN SMITH METABOLYTE A 0.01 TECH. GUIVER
02-01-2023 JOHN SMITH METABOLYTE B 10 TECH. GUIVER
02-01-2023 JOHN SMITH PCR NEGATIVE TECH. GUIVER
02-01-2023 JOHN SMITH MUTATION +++ TECH. GUIVER

So, in order to get a conversion from non-relational data to relational table I have applied text-mining techniques available at this GitHub repo. But, basically, have converted everything into one column with tidyr::pivot_longer(). Is there any optimal function or method to detect and tabulate this kind of data, or should i try to do it with a loop (+843 files)?

1

There are 1 answers

2
Grzegorz Sapijaszko On BEST ANSWER

My attempt is based on fact, that the entries are formatted identically, so we can use kind of 'moving window'.

a <- openxlsx::read.xlsx(xlsxFile = "/home/sapi/Downloads/ENERO_2023_prueba.xlsx",
                    colNames = FALSE
                    )

Now we have to define data frame for data storage. Comments like # [2,2] +0, +1 corresponds to row and column of a (loaded excel).

entry <- data.frame(
  NOMBRE = character(),    # [2,2] +0, +1
  FECHA = character(),     # [2,6] +0, +5
  MUESTRA = character(),   # [3,3] +1, +2
  place = character(),     # [3,5] +1, +4
  COLOR = character(),     # [6,3] +4, +2
  ASPECTO = character(),   # [7,3] +5< +2
  DENSIDAD = double(),     # [8,3] +6, +2
  PH = character(),        # [9,3] +7, +2
# ...
  LEUCOCITOS = character(),#[19,3] +17, +2
  BACTERIAS = character(), # [6,7] +4, +6
  PIOCITOS = character()   # [7,7] +5, +6
# ...
)

Now we have to find all rows with NOMBRE

nombre_rows <- which(a[,"X1"] == "NOMBRE")

and use it in loop like:

for (i in 1:length(nombre_rows)) {
  x <- nombre_rows[i]
  nombre_cols <- which(a[x,] == "NOMBRE") # the same for columns
  for (j in 1:length(nombre_cols)) {
    y <- nombre_cols[j]

    entry <- data.frame(
      NOMBRE = a[x, y+1],
      FECHA = a[x, y+5],
      MUESTRA = a[x+1, y+2],
      place = a[x+1, y+4],
      COLOR = a[x+4, y+2],
      ASPECTO = a[x+5, y+2],
      DENSIDAD = a[x+6, y+2],
      PH = a[x+7, y+2],
      # ...
      LEUCOCITOS = a[x+17, y+2],
      BACTERIAS = a[x+4, y+6],
      PIOCITOS = a[x+5, y+6]
      # ...
    ) |> rbind(entry)
  }
}

And finally the data:

head(entry)
#>             NOMBRE   FECHA MUESTRA            place      COLOR      ASPECTO
#> 1      RUANO EDITH 44957.0   ORINA         CEXTERNA  AMARILLO   LIG.TURBIO 
#> 2    CUNIN ELVIRA  44957.0   ORINA HOSPÌTALIZACION   AMARILLO       TURBIO 
#> 3 LOACHAMIN MARIA  44957.0   ORINA         CEXTERNA  AMARILLO  TRANSPARENTE
#> 4    MANZANO RAUL  44957.0   ORINA         CEXTERNA  AMARILLO   LIG.TURBIO 
#> 5    MERCHAN IVAN  44957.0   ORINA      HIDRATACION ANARANJADO      TURBIO 
#> 6   ACERO ANTHONY  44957.0   ORINA         CEXTERNA   AMARILLO  LIG.TURBIO 
#>   DENSIDAD  PH LEUCOCITOS    BACTERIAS     PIOCITOS
#> 1   1005.0 8.0   NEGATIVO            +    1-2/CAMPO
#> 2   1020.0 5.0        +++           ++ CAMPO LLENO 
#> 3   1005.0 6.0   NEGATIVO OCASIONALES    1-2/CAMPO 
#> 4   1010.0 7.0   NEGATIVO            +   3-7/CAMPO 
#> 5   1015.0 6.0         ++           ++ 50-60/CAMPO 
#> 6   1010.0 5.0   NEGATIVO            +   0-2/CAMPO

Created on 2023-06-18 with reprex v2.0.2

You should extend the entry data frame to grab all variables from your data. And then loop it through all excels you have.