I have 40+ CSV files with each being around 400MB. What I need to do is to read these 40+ big csv files, do some manipulation and formatting on them (such as commonize date formats, separating dates to months,day,etc..), and combine them in a single data frame. I have searched in the previous post about the quickest way to read these CSV files to be "fread" but even when I used fread, it took approx. 14 seconds for reading each file, and leaves me with a pretty significant runtime. I tried using SQLite through RSQLite for a single csv files:
setwd("raw_data/sqldatabase")
db <- dbConnect(SQLite(), dbname="test_db.sqlite") ## will make, if not present
dbWriteTable(conn=db, name="your_table", value="testdata.csv", row.names=FALSE, header=TRUE)
However, even using SQLite it took a considerable amount of time. What can be used to quickly read 40+ big csv folders into a "space" that makes manipulation on is very fast?
If I were to upload the data to a database once, and if it were to make the manipulation very fast from than on, I would be still fine, but the final folder (once merge is complete) expected to be 25+GB. So I am trying to find the most efficient way to manipulate the data
One alternative might be a "parquet datamart". The premise here:
.parquetfile..parquetfiles after splitting on one or more indexable (categorical/ordinal) columns.arrow::open_dataset,dplyr, and their lazy-evaluation. While this does not allow you to read the entire dataset into memory at one time, it does give you access to all of the data in smaller chunks, as if it were one big dataset.This is still compatible with
data.tablefor the in-memory work, usingarrowto access the data lazily. While my example below usesdata.table, it is not required, and in fact introduces an extra step or two in order to convert the sample data fromtibbletodata.table. I suggest this due to the large nature of the data and because you tagged it, not because it is required.Some things to note for both examples:
dsreflects all 336,776 rows of data, though the object is rather small (just an environment with references to files and metadata in them).collect(). Don't try to do this on the whole data unless you know that it can fit into memory.data.table(fromfread), andwrite_parquetkeeps several of the frame's attributes (including that), so when we realize the data below, it will be adata.table.collected data isdata.table, changes to that collected data will not migrate back to the parquet file itself. This means if you do something likecollect(ds)[, newcol := 1], doing anothercollect(ds)will not have thenewcolin it. It's worth noting thatparquetfiles are immutable once written: they cannot be updated or appended-to.Practical example:
nycflights13::flights. The data contains 336,776 rows of "Airline on-time data for all flights departing NYC in 2013". For the sake of this example, I'll split the data randomly into 5 frames and save into CSV files.For the two examples below, I'm not going to do any data filtering/formatting/augmentation; other than highlight where one would do this, I'll assume you know what you need to do to each individual CSV file before saving.
Plan 1: no indexes
Reading in the data:
Plan B: using
yearandmonthas nested subdirectoriesIn your data, an indexable field might be:
There is a balance between convenience and usability: if a categorical variable has 20,000 possible values, then it might be too many and one will lose much efficiency. The more directories/files found within the subdirectories, the longer it will take to call
open_datasetbefore you can do something with it. I don't have easy metrics for this.Note: one might be able to use
write_dataset, the counterpart to ouropen_datasetabove. It handlespartitions=in the same fashion. However, if you are not certain that each level within the partition fields are unique to a file (for instance, in my sample data I havemonth == 1in all CSV files), then each CSV file read would overwrite some data from the previous write. In this case, as I'll demonstrate here, I'll write to subdirectories manually.Reading data is the same as in part 1, though note that the indexes
yearandmonthare the last two columns instead of the first two:Notes:
year=?/month=?subdirectory and combine all of the data into one file. Remember my mention of "many files, slowopen_dataset"? Combining sibling.parquetfiles might help solve this. Again, this is only possible if you can read all of one subdir set into memory at a time. It might not be necessary, over to you.year=implicitly creates the indexable column. (If you read one of the1.parquetdirectory, you'll note that it does not haveyearormonthin them.) If this is not desired, you can create plainer path names (e.g.,2013/12/1.parquet) and useopen_dataset("datamart2", partitions=c("year","month")).