Objective
I'm building datalake, the general flow looks like Nifi -> Storage -> ETL -> Storage -> Data Warehouse.
The general rule for Data Lake sounds like no pre-processing on ingestion stage. All ongoing processing should happen at ETL, so you have provenance over raw & processed data.
Issue
Source system sends corrupted CSV files. Means besides header and data, the first too lines are always of free format metadata we'll never use. Only single table is corrupted, the corrupted CSV is used by single Spark job at the moment (lets call it X
).
Question
Is it a good approach to remove those two lines at Nifi layer? See option 3 at "Workarounds".
Workarounds
- Handle the corrupted records inside Spark job
X
. IMHO, this is bad approach, because we gonna use that file at different tools in future (data governance schema crawlers, maybe some Athena/ADLA-like engines over ADLS/S3). Means corrupted records handling logic should be implemented at multiple places. - Fix corrupted files on ETL layer and store them at "fixed" layer. All ongoing activities (ETL, data governance, MPP engines) will work only with "fixed" layer, instead of "raw" layer. This sounds for me as an overhead, to create a new layer for single CSV.
- Fix (remove the first two strings from the CSV) at Nifi layer. Means "raw" storage layer will always contain readable data. IMHO, this is good because it's simple and the handling logic is implemented at one place.
First thing, I think that your question is brilliant and in the way you expose the mental process I can say that you have your answer already.
As you mention
This is the philosophical bottom line, and all the hype is growing over this easy to oversimplify idea.
If we check the definition of AWS of what is a data lake.
It is a basic definition, but let's use it as a "appeal to authority". They say clearly that you can store data "as-is".
In the same link, a bit below, the also say
In general my way of looking at it, is that throwing everything there to follow the rule of "no preprocessing, is a general attempt of being more catholic than the pope, or maybe a general tendency to oversimplify the rules. I believe that the idea of "as is", and the power of it goes more in the direction of not doing data filtering or transformation in injection, assuming that we don't really know what are all the possible use cases in the future, so having raw data is good and scalable. But it doesn't mean that having data that we know is corrupted is good, and I believe that quality is a requirement always for data and in all stages should be at least accessible.
This takes me to the next thought: one very repeated idea is that data lake allows schema-on-read (AWS, Intuit, IBM, O'Reilly). Being so, it makes sense to keep as much as possible something with some kind of schema, if we don't want to overcomplicate the life of everyone that will potentially want to use it, otherwise, we could maybe render it in cases useless as the overhead of using it can be discouraging. Actually, the O'Reilly article above, called "the death of schema on read" talks about exactly the complexity added by the lack of governance. So I guess removing some chaos will help the success of the data lake.
So far I think my position is very clear for myself -it was not that much when I started writing the response- but I will try to wrap up with the latest reference, that is an article that I read a few time. Published in gartner.com' press room as early as 2014, it is called "Beware of the Data Lake Fallacy". The whole article is quite interesting, but I will highlight this part
I agree with that. It is fun at the beginning. Save everything, see you S3 bucket populated, and even run a few queries in Athena or Presto or run some Spark jobs over lots of gzip files and feel that we are in a magic time to live in. But then this small contamination comes, and we accept it, and someday the S3 buckets are not 10 but 100, and the small exceptions are not 2 but 20, and too many things to keep in mind and things get messier and messier.
Eventually this is opinion-based. But I would say usable data will make happier your future self.
Said this, I would go to your options:
Handle the corrupted records inside Spark job X. You said it. That would be hating yourself and your team, cursing them to do a work that could be avoided.
Fix corrupted files on ETL layer and store them at "fixed" layer. You said it, too much overhead. You will continually tempt to delete the first layer. Actually I forecast you would end up with a lifecycle policy to get rid of old objects automatically to save cost.
Seems neat and honest. No one can tell you "that is crazy". The only thing you need to make sure is that actually the data you will delete is not business-related, and there is not a possible use in the future that you cannot figure now. Even in this case, I would follow some approach to play safe:
Personally, I love data lakes, and I love the philosophy behind every system but I also like to question everything case by case. I have lots of data in flat files, json, csv, and a lot of production workload based on that. But the most beautiful part of my data lake is not really purely unprocessed data, we found extremely powerful to do a first minimal cleanup, and when possible -for data that has fundamentally inserts and not updates-, also transform it to Parquet or ORC and even compress it with snappy. And I can tell you that I really enjoy using that data, even run queries on it directly. Raw data yes, but usable.