..

2022-05-15 ~1 min read

Apache Hudi

Apache Hudi

Basics

Data Lake usually stores data in file based storage.
Apache HUDI is a file format that offers a way to handle updates, deletes and ACID properties on the dataset.
These functionality is not present in other file based storage like Parquet/ORC.
Also helps with - Data versioning - Rollback

Upserts

For upserts, hudi will re write part file where the updated record is present instead of the complete partition (What’s the difference b/w partition and part file?)

Queries

There are 2 different types of query

Snapshot queries: Latest data
Incremental queries: Queries data after a given commit time

Table types

There are 2 different types of table types

Copy on write: Stores in parquet and performs sync merge during write
Merge on Read: Columnar (eg: Parquet)+ Row (eg: Avro)
1. Updates are written in delta files
2. Mostly used for NRT or real time

References