read.fst() crashes R : workaround needed to detect corrupted file before read.fst()

411 views Asked by At

This is an issue already opened on fstpackage's github, but it seems the package author is no longer actively maintaining it.

Meanwhile, I need a workaround to this crash problem, which is repeatable and occurs regularly but on a small subset of my files. I am trying to find a method to detect a corrupted .fst file, without actually reading it because the crash stops all the further processing of my script.

Here is a sample corrupted fst file that you can download and try to open it using fst::read.fst(). Your R session is likely to crash. If your R session does not crash and you just get an error, then you are lucky (I have tried on an Ubuntu R server as well as on Mac OS with latest R 4.2 and everytime the R session crashes). It may not crash in specific situations, but the question still remains. (For details of the error message please see the github issue link above.)

I want some way to detect if a file is clean or corrupted before running read.fst().

And yes, I have tried tryCatch() but the crash still occurs.

Perhaps scanning the header of the raw data of the file in octal / raw mode may be helpful in detecting unexpected characters like null characters that are causing the crash. But I leave it to you the expert to find a way.

UPDATE Waldi has detected that surprisingly, column wise read.fst() does not crash. However, there are a few problems in this approach.

  1. The column data is corrupted. The file I tested has last 2 (of 4 cols) corrupted. Outputs as follows:
> fst::read.fst("corrupted.fst",columns = c("termid","ts","rv","av"),from = 1,to = 10)
       termid                  ts            rv            av
1  1204011660                <NA> 4.646816e-310 4.646816e-310
2  1204011660 2022-07-21 07:52:43 4.646816e-310 4.646816e-310
3  1204011660 2022-08-18 16:37:19 4.646816e-310 4.646816e-310
4  1204011660 2022-08-18 16:37:20 4.646835e-310 4.646835e-310
5  1204011660 2022-08-18 16:37:50 4.646817e-310 4.646817e-310
6  1204011660 2022-08-18 16:38:13 4.646817e-310 4.646817e-310
7  1204011660 2022-08-18 16:38:43 4.646817e-310 4.646817e-310
8  1204011660 2022-08-18 16:39:13 4.646817e-310 4.646817e-310
9  1204011660 2022-08-18 16:39:15 4.646819e-310 4.646819e-310
10 1204011660 2022-08-18 16:39:45 4.646830e-310 4.646830e-310

  1. The response time tanks, to 5 seconds for just outputting the first 10 rows.
> system.time(fst::read.fst("corrupted.fst",columns = c("termid","ts","rv","av"),from = 1,to = 10))
   user  system elapsed 
  2.069   3.874   5.940 
  1. Column-wise read crashes other corrupted files I have, so it is not a reliable method.

I am waiting for a faster & more reliable solution.

0

There are 0 answers