Data Systems Paradigms

October 15, 2024

Extract: Scrape raw data from all the source systems, e.g., transactions, sensors, log files, experiments, tables, bytestreams etc.

Transform: Apply a series of rules or functions, wrangle data into schema(s)/format(s)

Load: Load data into a data storage solution

ETL (Traditional Warehouses)

Extract or scraping from API or log file, transform into common schema/format, load in parallel to “data warehouse”

ELT (e.g. Snowflake)

Extract or scraping from API or log file, Load without doing a lot of transformation, with transformations done in SQL

Faster to get going, and more scalable, but requires more data warehousing knowledge (& may be more expensive).

ET (Data Lakes)

No need to “manage” data. Extract directly into a Data Lake, later transform for specific use cases.

Data is dumped in cheaply and massaged as needed for various use-cases Usually code-centric (Spark)

Data Warehouses ~ 1990s

“Single source of truth”: A central, organized repository of data used for analytics throughout an enterprise.
Design the uber-schema up-front of all of the rectangular tables you’d ever want.
Extract from trusted sources
Transform to warehouse schema using custom tools
Load data warehouse
Old school ETL solution: Informatica
Warehouses expect structure
Transformation is costly, not necessarily just computing but engineering time

Data Lake ~ 2010s

Emerged during Hadoop/Spark revolution
“Landing zone”: unconstrained storage for any and all data
Data is then analyzed on demand
Extract into files/storage
Load into storage (easy!)
Transform on demand for any use.
Create new files in the lake, catalog files as they go for reuse
Often code-centric

[!NOTE] ETLT & the Lakehouse Modern solutions are likely Many-to-Many. Sometimes start with a data lake. Empower data scientists to work on ad-hoc use cases .Allow for datasets that “graduate” to a carefully managed warehouse. Some datasets may directly be loaded into a data warehouse.

Databricks uses a Lakehouse which make managing such a system much easier.

Important Considerations

Data Discovery, Data Assessment

Ad-Hoc: End-users land data, explore it, label it
Systematic: Crawl/index the data lake for files
E.g., for CSV/JSON
Very content-centric: really a form of analytics/prediction
Try to figure out what type of data you have.
AI + People!

Data Quality & Integrity

Boolean Integrity checks
Often specified by people, also “mined” by AI
Data changes ALL the time, especially from clients.
Enforced: can “reject” or “sequester” data that violate.
e.g no two products that have the same product ID!

Application Metadata:

Data entities (e.g. students, courses, employees for a university)
Relationships between data
Constraints

Behavioral Metadata:

Data Lineage – where did it come from?
Audit Trails of Usage – who ran this job, and what did it do?

Change Metadata

Version info for all the above
Timestamps

Operationalization (Ops):

When do jobs kick off, and what do they do?
How are tests registered, exceptions handled, people alerted?
How do experiments “graduate” into processes?

Feedback:

Some are datasets in their own right. If you produce a table, that’s also data!
Many are new processes that generating new data feeds!
ML models: Constantly yielding predictions.
Compare old predictions to new predictions?