Show Menu

Data Warehousing - Data Lake Cheat Sheet (DRAFT) by [deleted]

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Data Lake

The term “data lake” has many defini­tions throughout the industry, ranging from a dumping ground for “to-be­-used” data, to a more or less tradit­ional EDW approach implem­ented on a big data platform. We would best define the data lake as an analytic system that allows data consol­idation and analytic access with tunable govern­ance. The data lake consists of a distri­buted, scalable file system, such as HDFS (Hadoop File System) or Amazon S3, coupled with one or many fit-fo­r-p­urpose query and processing engines such as Apache Spark, Drill, Impala, and Presto.

Landing Area

This layer is where source data is stored in its full fidelity. This layer reduces barriers to onboarding new data, allowing early analytic access for new insights and the raw materials for “to-be” data products. Only very basic governance policies are required in the form of metadata (very often in the form of a partit­ioning schema) and inform­ation lifecycle management (security and dispos­ition).

Data Lake

Data may be graduated from the landing area to the data lake. This data has basic governance policies, including data quality, retention, and metadata. It often has standard views or projec­tions, allowing users to interact via familiar tools such as SQL, data explor­ation, and business intell­igence tools.

Data Science Workspace

This is the foundry of new data products. The work of data science may result in new data products, including new EDW facts.

Big Data Warehouse

This layer is fully governed, providing accurate, reliable, and well-d­efined inform­ation to data consumers. This big data warehouse may be platformed alongside the broader data lake, or in combin­ation with tradit­ional relational or MPP database techno­logy.

Big Data Pyramid

The big data pyramid illust­rates the different layers of the lake and what they represent from a data consum­ption and governance view.

Warehouse vs Lake