Our data loads into hdfs with partition columns as date daily. The issue is each partition has small file size less than 50mb. So when we read the data from all these partition to load the data to next table take hours. How can we address this issue?
Related Questions in JAVA
- I need the BIRT.war that is compatible with Java 17 and Tomcat 10
- Creating global Class holder
- No method found for class java.lang.String in Kafka
- Issue edit a jtable with a pictures
- getting error when trying to launch kotlin jar file that use supabase "java.lang.NoClassDefFoundError"
- Does the && (logical AND) operator have a higher precedence than || (logical OR) operator in Java?
- Mixed color rendering in a JTable
- HTTPS configuration in Spring Boot, server returning timeout
- How to use Layout to create textfields which dont increase in size?
- Function for making the code wait in javafx
- How to create beans of the same class for multiple template parameters in Spring
- How could you print a specific String from an array with the values of an array from a double array on the same line, using iteration to print all?
- org.telegram.telegrambots.meta.exceptions.TelegramApiException: Bot token and username can't be empty
- Accessing Secret Variables in Classic Pipelines through Java app in Azure DevOps
- Postgres && statement Error in Mybatis Mapper?
Related Questions in SCALA
- Mocking AmazonS3 listObjects function in scala
- Last SPARK Task taking forever to complete
- How to upload a native scala project to local repo by sbt like using "maven install"
- Folding a list of OR clauses in io.getquill
- How to get latest modified file using scala from a folder in HDFS
- Enforce type bound for inferred type parameter in pattern matching
- can't write pyspark dataframe to parquet file on windows
- spark streaming and kafka integration dependency problem
- how to generate fresh singleton literal type in scala using macros
- exception during macro expansion: type T is not a class, play json
- Is there any benefit of converting a List to a LazyList in Scala?
- Get all records within a window in spark structured streaming
- sbt publishLocal of a project with provided dependencies in build.sbt doesn't make these dependencies visible to projects using the project as library
- Scala composition of partially-applied functions
- How to read the input json using a schema file and populate default value if column not being found in scala?
Related Questions in APACHE-SPARK
- Getting error while running spark-shell on my system; pyspark is running fine
- ingesting high volume small size files in azure databricks
- Spark load all partions at once
- Databricks Delta table / Compute job
- Autocomplete not working for apache spark in java vscode
- How to overwrite a single partition in Snowflake when using Spark connector
- Parse multiple record type fixedlength file with beanio gives oom and timeout error for 10GB data file
- includeExistingFiles: false does not work in Databricks Autoloader
- Spark connectors from Azure Databricks to Snowflake using AzureAD login
- SparkException: Task failed while writing rows, caused by Futures timed out
- Configuring Apache Spark's MemoryStream to simulate Kafka stream
- Databricks can't find a csv file inside a wheel I installed when running from a Databricks Notebook
- Add unique id to rows in batches in Pyspark dataframe
- Does Spark Dynamic Allocation depend on external shuffle service to work well?
- Does Spark structured streaming support chained flatMapGroupsWithState by different key?
Related Questions in CLOUDERA-CDH
- Does CDH 6.3.2 yarn have Resource or node restrictions?
- CDH 6.3.2 YARN's queue has a lots of pending applications,but yarn queue resources are sufficient
- How to integrate Hadoop of Oracle Bigdata Appliance into Ab Initio Metadata Hub?
- Hive How to disable Semantic check 'Schema of both sides of union should match'
- impala error: could not retrieve transaction read-only status from server. why?
- Spark: IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
- Cloudera Enterprise (Community Edition) for RHEL 8
- Why does timestamp value changes between different presto versions?
- How to build impala without any platform dependencies and with build fit in single folder which can be migrated to other machines as well
- Why can't Impala-shell log in to LDAP even though the code is successful?
- GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentails) in web browser using AWS Network Load Balancer
- In CDP how to update OneViewofProfile Id through (VisitorId, browserID)?
- Failed to configure cloudera manager agent
- how to run select query for impala struct column
- sql select query for impala column with map data type
Related Questions in SPARK2.4.4
- Inconsistent results in Spark-shell and Spark-submit
- Avoid Broadcast Nested Loop Join in Pyspark when the joining condition has a OR clause
- pyspark - How to split the string inside an array column and make it into json?
- Spark 3.3.1 picking up current date automatically in data frame if date is missing from given timestamp and not marking it as _corrupt record
- Missing methods in PySpark 2.4's pyspark.sql.functions but still works in local environment
- Extension of compressed parquet file in Spark
- Pyspark split the file while writing with specific limit
- In pyspark 2.4, how to handle columns with the same name resulting of a self join?
- Hive beeline and spark load count doesn't match for hive tables
- Specific Spark write operation gradually increase with time in streaming applicaiton
- How to set Spark timeout ( application Killing itself )
- Convert Spark2.2's UDAF to 3.0 Aggregator
- Can we set up both Spark2.4 and Spark3.0 in single system?
- Spark2.4 Unable to overwrite table from same table
- spark not downloading hive_metastore jars
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
I'd suggest you to run end of the day job to coalesce/combine and make a large file which is significantly bigger in size for processing in spark, before reading from spark.
Further reading cloudera blog/docs to address these problems Partition Management in Hadoop where several techniques were discussed to address these problems like
Select one of the technique discussed in cloudera blog to match your kind of requirements. Hope this helps!
Other good options Typical use case is using open source delta lake/ if you are using databricks go for their delta lake for getting rich set of features...
Example maven coordinates.
using delta. lake you can insert/update/delete the data as you want. it will reduce maintenance steps...
Compacting Small Files in Delta Lakes