`read_fwf` and `vroom_fwf` accidentally skipping first lines?

100 views Asked by At

I'm sure I'm doing something silly, but I can't quite figure it out. Both read_fwf and vroom_fwf are producing files that lack one line (the first line, to be precise) when importing fixed-width files.

There are two files:

Suppose that both the fixed-width file and the CSV file are stored at the root directory. The code I used is

library(dplyr)
library(vroom)
library(data.table)

test <- fread(
  "test.csv",
  strip.white = TRUE, header = FALSE, blank.lines.skip = TRUE
) %>%
  filter(!is.na(V2)) %>%
  mutate(V1 = gsub(" |\\(", ".", gsub("\\)", "", V1)))
  
## gives one line
vroom::vroom_fwf(
  "vroom_fwf_test.txt", fwf_widths(test$V3, test$V1),
  n_max = 1000, col_types = cols(.default = "c"), id = "file_name"
)

This will only produce one row of data. But there are two lines in this raw file, as evidenced by

writeLines(read_lines(path)) ## two lines

which produces two lines as expected. If I leave only one line in the raw data, it'll produce zero imported rows.

The existing example in the manual, on the other hand, produces three lines as it should!

fwf_sample <- readr_example("fwf-sample.txt")
writeLines(read_lines(fwf_sample)) ## three lines
read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn"))) ## three lines as it should

I am not sure where I've gone wrong. My session info is as follows:

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x6
[test.csv](https
[vroom_fwf_test.txt](https://github.com/tidyverse/vroom/files/12156789/vroom_fwf_test.txt)
://github.com/tidyverse/vroom/files/12156786/test.csv)
4 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] sf_1.0-9           censusxy_1.1.1     tidygeocoder_1.0.5
 [4] foreign_0.8-83     lubridate_1.9.0    timechange_0.1.1  
 [7] data.table_1.14.8  vroom_1.6.0        janitor_2.1.0     
[10] readxl_1.4.1       assertthat_0.2.1   here_1.0.1        
[13] stringi_1.7.8      forcats_0.5.2      stringr_1.5.0     
[16] dplyr_1.1.0        purrr_1.0.0        readr_2.1.3       
[19] tidyr_1.2.1        tibble_3.1.8       ggplot2_3.4.0     
[22] tidyverse_1.3.2    plyr_1.8.8         MASS_7.3-58.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9          class_7.3-20        rprojroot_2.0.3    
 [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
 [7] backports_1.4.1     reprex_2.0.2        e1071_1.7-12       
[10] httr_1.4.4          pillar_1.8.1        rlang_1.0.6        
[13] googlesheets4_1.0.1 rstudioapi_0.14     googledrive_2.0.0  
[16] bit_4.0.5           munsell_0.5.0       proxy_0.4-27       
[19] broom_1.0.2         compiler_4.2.2      modelr_0.1.10      
[22] pkgconfig_2.0.3     tidyselect_1.2.0    fansi_1.0.3        
[25] crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1       
[28] withr_2.5.0         grid_4.2.2          jsonlite_1.8.4     
[31] gtable_0.3.1        lifecycle_1.0.3     DBI_1.1.3          
[34] magrittr_2.0.3      units_0.8-1         scales_1.2.1       
[37] KernSmooth_2.23-20  cli_3.6.0           renv_0.16.0        
[40] fs_1.5.2            snakecase_0.11.0    xml2_1.3.3         
[43] ellipsis_0.3.2      generics_0.1.3      vctrs_0.5.2        
[46] tools_4.2.2         bit64_4.0.5         glue_1.6.2         
[49] hms_1.1.2           parallel_4.2.2      colorspace_2.0-3   
[52] gargle_1.2.1        classInt_0.4-8      rvest_1.0.3        
[55] haven_2.5.1   

Has anybody encountered a similar problem? Thank you very much.

(Opend as an issue on vroom's GitHub repository: https://github.com/tidyverse/vroom/issues/503)


Edit for @jay.sf: yes, it works! But when the .txt file is on my local machine, it behaves differently (I've attached the screenshot and the code used there). Perhaps it's a line-ending problem of some sort?

enter image description here

## Both test.csv and vroom_fwf_test.txt are in the local root directory
library(dplyr)
library(vroom)
library(data.table)

test <- fread(
  "test.csv",
  strip.white = TRUE, header = FALSE, blank.lines.skip = TRUE
) %>%
  filter(!is.na(V2)) %>%
  mutate(V1 = gsub(" |\\(", ".", gsub("\\)", "", V1)))

## Originall submitted code: this gives only one line
vroom::vroom_fwf(
  file = "vroom_fwf_test.txt",
  col_positions = fwf_widths(test$V3, test$V1),
  n_max = 1000, col_types = cols(.default = "c"), id = "file_name"
)

## @jay.sf's code: this gives two lines, yes
vroom::vroom_fwf(
  file = "https://github.com/tidyverse/vroom/files/12156789/vroom_fwf_test.txt",
  col_positions = with(
    read.csv(
      "https://github.com/tidyverse/vroom/files/12156786/test.csv",
      header = F
    ), vroom::fwf_widths(V3, V1)
  ),
  n_max = 1000,
  col_types = vroom::cols(.default = "c"),
  id = "file_name"
)

## This gives only one line again (it's the same file)
## The only difference from @jay.sf's code is that the file is local
vroom::vroom_fwf(
  file = "vroom_fwf_test.txt",
  col_positions = with(
    read.csv(
      "https://github.com/tidyverse/vroom/files/12156786/test.csv",
      header = F
    ), vroom::fwf_widths(V3, V1)
  ),
  n_max = 1000,
  col_types = vroom::cols(.default = "c"),
  id = "file_name"
)

## This gives two lines, so the problem is the .txt file?
vroom::vroom_fwf(
  file = "https://github.com/tidyverse/vroom/files/12156789/vroom_fwf_test.txt",
  col_positions = fwf_widths(test$V3, test$V1),
  n_max = 1000,
  col_types = vroom::cols(.default = "c"),
  id = "file_name"
)
2

There are 2 answers

1
Roman Luštrik On

Your code is not exactly reproducible. I had to load dplyr and define path and n_max to get the code running.

When I import using vroom_fwf, I get two lines.

> vroom::vroom_fwf(
+   file = "~/Downloads/vroom_fwf_test.txt",
+   col_positions = fwf_widths(test$V3, test$V1),
+   n_max = Inf, col_types = cols(.default = "c"), id = "file_name"
+ )
# A tibble: 2 × 33                                                                                                                     
  file_name          COUNTY.CODE PRECINCT VUID  LAST.NAME FIRST.NAME MIDDLE.NAME FORMER.LAST.NAME SUFFIX GENDER DOB   PERM.HOUSE.NUMBER
  <chr>              <chr>       <chr>    <chr> <chr>     <chr>      <chr>       <chr>            <chr>  <chr>  <chr> <chr>            
1 ~/Downloads/vroom… 999         9        9999… TESTER    TEST       TES         NA               NA     M      1999… 999              
2 ~/Downloads/vroom… 999         9        9999… TESTER    TEST       TES         NA               NA     M      1999… 999              
# ℹ 21 more variables: PERM.DESIGNATOR <chr>, PERM.DIRECTIONAL.PREFIX <chr>, PERM.STREET.NAME <chr>, PERM.STREET.TYPE <chr>,
#   PERM.DIRECTIONAL.SUFFIX <chr>, PERM.UNIT.NUMBER <chr>, PERM.UNIT.TYPE <chr>, PERM.CITY <chr>, PERM.ZIPCODE <chr>,
#   MAILING.ADDRESS.1 <chr>, MAILING.ADDRESS.2 <chr>, MAILING.CITY <chr>, MAILING.STATE <chr>, MAILING.ZIPCODE <chr>,
#   EDR..EFFECTIVE.DATE.OF.REGISTRATION <chr>, STATUS.CODE <chr>, HISPANIC.SURNAME.FLAG <chr>, ELECTION.DATE <chr>,
#   ELECTION.TYPE <chr>, ELECTION.PARTY <chr>, ELECTION.VOTING.METHOD <chr>
0
jay.sf On

I'm not sure if you need all those packages.

with(read.csv('https://github.com/tidyverse/vroom/files/12156786/test.csv', header=FALSE),
     setNames(read.fwf('https://github.com/tidyverse/vroom/files/12156789/vroom_fwf_test.txt', V3), V1)) |> 
  lapply(trimws) |> as.data.frame()
#   COUNTY.CODE PRECINCT       VUID LAST.NAME FIRST.NAME MIDDLE.NAME FORMER.LAST.NAME SUFFIX GENDER      DOB
# 1         999        9 9999999999    TESTER       TEST         TES             <NA>   <NA>      M 19999999
# 2         999        9 9999999999    TESTER       TEST         TES             <NA>   <NA>      M 19999999
#   PERM.HOUSE.NUMBER PERM.DESIGNATOR PERM.DIRECTIONAL.PREFIX PERM.STREET.NAME PERM.STREET.TYPE PERM.DIRECTIONAL.SUFFIX
# 1               999            <NA>                       T          TESTERS               ST                    <NA>
# 2               999            <NA>                       T          TESTERS               ST                    <NA>
#   PERM.UNIT.NUMBER PERM.UNIT.TYPE PERM.CITY PERM.ZIPCODE MAILING.ADDRESS.1 MAILING.ADDRESS.2 MAILING.CITY MAILING.STATE
# 1             <NA>           <NA>    TESTER        99999         TS TESTER              <NA>       TESTER            TS
# 2             <NA>           <NA>    TESTER        99999         TS TESTER              <NA>       TESTER            TS
#   MAILING.ZIPCODE EDR..EFFECTIVE.DATE.OF.REGISTRATION. STATUS.CODE HISPANIC.SURNAME.FLAG ELECTION.DATE ELECTION.TYPE
# 1      99999-9999                             99999999        TRUE                  <NA>      99999999            TS
# 2      99999-9999                             99999999        TRUE                  <NA>      99999999            TS
#   ELECTION.PARTY ELECTION.VOTING.METHOD
# 1           <NA>                     TS
# 2           <NA>                     TS