This is an issue already opened on fstpackage's github, but it seems the package author is no longer actively maintaining it.
Meanwhile, I need a workaround to this crash problem, which is repeatable and occurs regularly but on a small subset of my files. I am trying to find a method to detect a corrupted .fst
file, without actually reading it because the crash stops all the further processing of my script.
Here is a sample corrupted fst file that you can download and try to open it using fst::read.fst()
.
Your R session is likely to crash.
If your R session does not crash and you just get an error, then you are lucky (I have tried on an Ubuntu R server as well as on Mac OS with latest R 4.2 and everytime the R session crashes). It may not crash in specific situations, but the question still remains. (For details of the error message please see the github issue link above.)
I want some way to detect if a file is clean or corrupted before running read.fst()
.
And yes, I have tried tryCatch()
but the crash still occurs.
Perhaps scanning the header of the raw data of the file in octal / raw mode may be helpful in detecting unexpected characters like null characters that are causing the crash. But I leave it to you the expert to find a way.
UPDATE Waldi has detected that surprisingly, column wise read.fst() does not crash. However, there are a few problems in this approach.
- The column data is corrupted. The file I tested has last 2 (of 4 cols) corrupted. Outputs as follows:
> fst::read.fst("corrupted.fst",columns = c("termid","ts","rv","av"),from = 1,to = 10)
termid ts rv av
1 1204011660 <NA> 4.646816e-310 4.646816e-310
2 1204011660 2022-07-21 07:52:43 4.646816e-310 4.646816e-310
3 1204011660 2022-08-18 16:37:19 4.646816e-310 4.646816e-310
4 1204011660 2022-08-18 16:37:20 4.646835e-310 4.646835e-310
5 1204011660 2022-08-18 16:37:50 4.646817e-310 4.646817e-310
6 1204011660 2022-08-18 16:38:13 4.646817e-310 4.646817e-310
7 1204011660 2022-08-18 16:38:43 4.646817e-310 4.646817e-310
8 1204011660 2022-08-18 16:39:13 4.646817e-310 4.646817e-310
9 1204011660 2022-08-18 16:39:15 4.646819e-310 4.646819e-310
10 1204011660 2022-08-18 16:39:45 4.646830e-310 4.646830e-310
- The response time tanks, to 5 seconds for just outputting the first 10 rows.
> system.time(fst::read.fst("corrupted.fst",columns = c("termid","ts","rv","av"),from = 1,to = 10))
user system elapsed
2.069 3.874 5.940
- Column-wise read crashes other corrupted files I have, so it is not a reliable method.
I am waiting for a faster & more reliable solution.