..
Apache Hudi
Apache Hudi
Basics
- Data Lake usually stores data in file based storage.
- Apache HUDI is a file format that offers a way to handle updates, deletes and ACID properties on the dataset.
- These functionality is not present in other file based storage like Parquet/ORC.
- Also helps with
- Data versioning
- Rollback
Upserts
- For upserts, hudi will re write part file where the updated record is present instead of the complete partition (What’s the difference b/w partition and part file?)
Queries
There are 2 different types of query
- Snapshot queries: Latest data
- Incremental queries: Queries data after a given commit time
Table types
There are 2 different types of table types
- Copy on write: Stores in parquet and performs sync merge during write
- Merge on Read: Columnar (eg: Parquet)+ Row (eg: Avro)
- Updates are written in delta files
- Mostly used for NRT or real time
References
- https://medium.com/@parth09/apache-hudi-the-basics-5c1848ca12e0
- https://medium.com/apache-hudi-blogs/employing-the-right-indexes-for-fast-updates-deletes-in-apache-hudi-814d863635f6