..
Apache Hudi
Apache Hudi
Basics
- Data Lake usually stores data in file based storage.
- Apache HUDI is a file format that offers a way to handle updates, deletes and ACID properties on the dataset.
- These functionality is not present in other file based storage like Parquet/ORC.
- Also helps with - Data versioning - Rollback
Upserts
- For upserts, hudi will re write part file where the updated record is present instead of the complete partition (What’s the difference b/w partition and part file?)
Queries
There are 2 different types of query
- Snapshot queries: Latest data
- Incremental queries: Queries data after a given commit time
Table types
There are 2 different types of table types
- Copy on write: Stores in parquet and performs sync merge during write
- Merge on Read: Columnar (eg: Parquet)+ Row (eg: Avro)
- Updates are written in delta files
- Mostly used for NRT or real time