Kartikeya Sharma

Data Systems Paradigms

Extract: Scrape raw data from all the source systems, e.g., transactions, sensors, log files, experiments, tables, bytestreams etc.

Transform: Apply a series of rules or functions, wrangle data into schema(s)/format(s)

Load: Load data into a data storage solution

ETL (Traditional Warehouses)

Extract or scraping from API or log file, transform into common schema/format, load in parallel to “data warehouse”

ELT (e.g. Snowflake)

Extract or scraping from API or log file, Load without doing a lot of transformation, with transformations done in SQL 

Faster to get going, and more scalable, but requires more data warehousing knowledge (& may be more expensive).

ET (Data Lakes)

No need to “manage” data. Extract directly into a Data Lake, later transform for specific use cases.

Data is dumped in cheaply and massaged as needed for various use-cases Usually code-centric (Spark)

Data Warehouses ~ 1990s

Data Lake ~ 2010s

[!NOTE] ETLT & the Lakehouse Modern solutions are likely Many-to-Many. Sometimes start with a data lake. Empower data scientists to work on ad-hoc use cases .Allow for datasets that “graduate” to a carefully managed warehouse. Some datasets may directly be loaded into a data warehouse.

Databricks uses a Lakehouse which make managing such a system much easier.

Important Considerations

Data Discovery, Data Assessment
Data Quality & Integrity
Application Metadata:
Behavioral Metadata:
Change Metadata
Operationalization (Ops):
Feedback: