Show Menu
Cheatography

Snowflake SnowPro Core Certification Cheat Sheet (DRAFT) by

Study guide for certification preparation

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Snowflake Database Archit­ecture

1. Service Layer: Accepts SQL requests from users, coordi­nates queries, managing transa­ctions and results. It holds a cached copy of the results.
2. Compute Layer: Where the actual SQL is executed across the nodes of a VW. It holds a cache of data queried (referred as Local Disk I/O, SSD storage in reality). All data held is temporary -VW is active
3. Storage Layer: Provides long term storage of results (Remote Disk - implem­ented on either Amazon S3 or Microsoft Blob storage)

Virtual Warehouses

Overview
• resources provision for executing SQL SELECT statements and DML operations (DELETE, INSERT, UPDATE, COPY INTO)

• Can be started and stoppe­d/r­esized at any time, running queries are not affected but only the new queries

• Two types: Standard and Snowpa­rk-­opt­imized

Credit Usage and Billing

• Snowflake utilizes per-second billing (with a 60-second minimum each time the warehouse starts) so warehouses are billed only for the credits they actually consume

• For a multi-­cluster warehouse, the number of credits billed is calculated based on the warehouse size and the number of clusters that run within the time period

Data Loading

• Data loading perfor­mance doesn't necess­arily improved by increasing the size of a warehouse. It is influenced more by the number of files being loaded (and the size of each file) than the size of the warehouse.

• Warehouse size can impact the amount of time required to execute queries especially for more complex queries. Larger warehouses have more compute resources available to process queries but not necess­arily faster for small, basic queries.

Query Processing and Concur­rency
• No. of queries that a warehouse can concur­rently process is determined by the size and complexity of each query

• If queries are queuing more than desired, create another warehouse and redirect the queries to it manually. Resizing a warehouse can enable limited scaling for query concur­rency but this is mainly intended for improving query perfor­mance.

Multi-­cluster warehouses

• Multi-­cluster warehouses are recomm­ended to enable fully automated scaling for concur­rency

• Only available for Enterprise Edition (or higher)

• Up to 10 clusters

• Support all same properties and actions as single warehouse

Mode:
• Maximized: same value for both max and min # clusters
- effective for statically contro­lling the available compute resources

• Auto-s­cale: different values for max and min # clusters
- dynami­cally manage the load based on scaling policy that determines when automa­tically starting or shutting down additional clusters.

> NOTE: Multi-­cluster wareho­uses: best for scaling resources to improve concur­rency for users/­que­ries. Resizing the warehouse: best for improving the perfor­mance of slow-r­unning queries or data loading

Scale Up vs. Scale Out
• Scale up: resizing a warehouse.

• Scale out: adding clusters to a multi-­cluster warehouse (requires Snowflake Enterprise Edition or higher).

Warehouse Size

Larger warehouse sizes 5X-Large and 6X-Large are generally available in all Amazon Web Services (AWS) regions, and are in preview in the US Government regions (requires FIPS support on ARM) and Azure regions

Scaling Policy (Auto Mode)

Policy
Strategy
Starts…
Shuts Down..
Standard (default)
Favours starting additional clusters over conserving credits
First cluster will start immedi­­ately when one or more queries are detected in the queue. Each successive clusters waits to start 20 seconds after the prior one has started.
After 2-3 consec­utive successful checks (performed at 1 minute interv­als), which determine whether the load on the least-­loaded cluster could be redist­ributed
Economy
Favours keeping running clusters fully-­loaded rather than starting additional clusters
Only if the system estimates there’s enough query load to keep the cluster busy for at least 6 minutes
After 5-6 consec­utive successful checks (performed at 1 minute interv­als), which determine whether the load on the least-­loaded cluster could be redist­ributed

Snowpa­rk-­opt­imized Warehouses

Provide 16x memory per node compared to a standard Snowflake VW

• Recomm­ended for workloads that have large memory requir­ements: ML training use cases using a stored procedure on a single VW node

• Benefits Snowpark workloads utilizing UDF or UDTF

• Snowpa­rk-­opt­imized warehouses are not supported on X-Small or SMALL warehouse sizes

• Snowpa­rk-­opt­imized warehouses are available in all regions across AWS, Azure, and Google Cloud.

Caching

1. Result Cache (Service Layer): Holds the results of every query executed in the past 24 hours. These are available across VWs, so query results returned to any user who executes the same query, provided the underlying data has not changed.

2. Local Disk Cache (VMs): Used to cache data used by SQL queries. Whenever data is needed for a given query it's retrieved from the Remote Disk storage, and cached in SSD and memory.

3. Remote Disk (DB Storage): Holds the long term storage. This level is respon­sible for data resili­ence.

> NOTE: Results are retained for 24 hours, but the clock is reset every time the query is re-exe­cuted, up to a limit of 30 days, after which results query the remote disk.

System Perfor­mance Tuning Best Practice
Auto-S­uspend: Snowflake will auto-s­uspend a virtual warehouse with the SSD cache after 10 minutes of idle time by default. Best to leave this setting alone.

Scale up for large data volumes: Have a sequence of large queries + massive (multi­-te­rabyte) size data volumes - scale up to improve query perfor­mance

Scale down - but not too soon: Tune the warehouse size dynami­cally (once large task has comple­ted), but don't keep adjusting it, or you'll lose the benefit.

Databases, Tables & Views

Micro-­par­titions
• All data in Snowflake tables is automa­tically divided into micro-­par­titions which contains between 50 MB and 500 MB of uncomp­ressed data
• Tables are partit­ioned using the ordering of the data as it is insert­ed/­loaded.

Benefits of Mico-p­art­itions
• Don’t need to be explicitly defined before­han­d/m­ain­tained by users (derived automa­tic­ally)
• Small in size which enables efficient DML and fine-g­rained pruning for faster queries
• Can overlap in their range of values, which, combined with their uniformly small size, helps prevent skew.
• Columns are stored indepe­ndently (columnar storage) which allows efficient scanning of individual columns

> NOTE: Snowflake does not prune micro-­par­titions based on a predicate with a subquery, even if the subquery results in a constant

Data Clustering
• Clustering metadata that is recorded for each micro-­par­tition created, is then utilised to avoid unnece­ssary scanning of micro-­par­titions during querying

Actions performed for queries on the table
1. First, prune micro-­par­titions that are not needed for the query.
2. Then, prune by column within the remaining micro-­par­tit­ions.

Clustering Inform­ation Maintained
• The total # micro-­par­titions
• # micro-­par­titions containing values that overlap with each other (in a specified subset of table columns).
• The depth of the overla­pping micro-­par­tit­ions.

Clustering Key
As DML occurs on large tables, the data might no longer cluster optimally. Snowflake designates one/more table columns as a clustering key to improve clustering of the underlying table micro-­par­titions automa­tically

Use clustering key when:
• Require the fastest possible response times, regardless of cost.
• Improved query perfor­mance offsets the credits required to cluster and maintain the table.
• Queries on the table are running slower than expected
• The clustering depth for the table is large.

> NOTE: Clustering is generally most cost-e­ffe­ctive for tables that are queried frequently and do not change frequently

Comparison of Table Types

• After creation, transi­ent­/te­mporary tables cannot be converted to any other table type.
• The Fail-safe period is not config­urable for any table type.
• Transient and temporary tables have no Fail-safe period (no additional data storage charges beyond the Time Travel retention period)
• Transient tables: a good option for managing the cost of very large transitory data but the data in these tables cannot be recovered after the Time Travel retention period passes.

Search Optimi­zation Service

• improve the perfor­mance of certain types of lookup and analytical queries that use an extensive set of predicates for filtering
• Features that available (Enter­prise Ed./Hi­gher)
- Column config­uration
- Support for substring + REGEX searches
- Support for fields in VARIANT, OBJECT and ARRAY columns
- Support for geospatial functions with GEOGRAPHY objects
- Selective point lookup queries on tables (returns only small # of distinct rows)

• One of the ways to optimize query perfor­mance (others: clustering a table & create materi­alised views - cluste­r/u­ncl­ust­ered)

>> Works best to improve the perfor­mance of a query when the table is frequently queried on columns other than the primary cluster key

Queries that benefited from SO
• runs for a few seconds or longer
• at least one of the columns accessed through query filter operation has at least 100k distinct values
• Equality or IN predic­ates; Substring & REGEX, etc.

Not supported by SO
• External tables, materi­alised views, columns defined with a collate clause, col concat­ena­tion, analytical expres­sions, cast on table columns

Costs of SO service
• The service creates a search access path data structure that requires space
- storage cost depends on: the # of distinct values in the table (size is approx. 1/4 of the original table's size); Worst case scenario = same size as original table
• Adding the service + mainta­ining it consumes additional resources
- higher cost when there is high churn (large values of data in the table change) - propor­tional to the amount of data ingested

>> The costs are propor­tional to the # of tables on which SO is enabled, the # of distinct values in those tables; the amount of data that changes in these tables

Options for Optimising Query Perfor­mance

Materi­alised Views

• Like a view that is frozen in place but when changes are detected, Snowflakes will refresh it automa­tically
• It's designed to improve query perfor­mance for workloads composed of common & repeated query patterns.
• It incurs additional costs.

When to use it?
• Query results contain a small # rows and/or columns relative to the base table
• Query results contain results that require signif­icant processing
• Query is on an external table
• The view's base table does not change frequently

Advantages of Materi­alised Views
• improve perfor­mance of queries that use the same subquery results repeatedly
• Background service updates the materi­alised view automa­tically after changes are made to the base table
• Data accessed through materi­alised views is always current

>> We can't put a materi­alised view DIRECTLY on top of staged data.But, if we put an External Table in bt. them, we CAN put a Materi­alised view over staged data

>> INFORM­ATI­ON_­SCH­EMA.VIEWS does not show materi­alized views. Materi­alized views are shown by INFORM­ATI­ON_­SCH­EMA.TABLES
 

Loading Data

Best Practices in File Sizing
• No. of load operations that run in parallel shouldn't be more than no. of data files to be loaded. Aim to have data files ~100-250 MB (or larger) in size compressed to optimise loading - same applies to Snowpipe

• Loading very large files (e.g. 100 GB or larger) is not recomm­ended. Use ON_ERROR copy option value should you need to do so.

• Aborting could cause delays & credits wastage. Risking no portion of the file to be committed if data loading takes more than 24 hours.

• Smaller files, smaller processing overhead. Split large files by line to avoid records that span chunks

• VARIANT data type (Semi-­str­uctured data) has a 16 MB size limit on individual rows

• Snowpipe cost involves an overhead to manage files in the load queue (staging data) and resource consum­ption (loading data).

• Creating a new data file once per minute as best practice as it provides good balance between cost (i.e. resources spent on Snowpipe queue management and the actual load) and perfor­mance (i.e. load latency)

Best Practices in File Formats
Delimited Text Files
• UTF-8 is the default character set
• Fields that contain delimiter charac­ter­s/c­arriage returns should be enclosed in quotes
• Consistent number of columns in each row

Semi-s­tru­ctured Data Files
• Extract semi-s­tru­ctured data elements containing “null” values into relational columns before loading them, else set STRIP_­NUL­L_V­ALUES to TRUE if the "­nul­l" indicates only missing values
• Ensure each unique element stores values of a single native data type (string or number)

Numeric Data Files
• Avoid embedded charac­ters, e.g. commas
• Number that includes a fractional component should be separated from the whole number portion by a decimal point

> NOTE: Snowflake checks temporal data values at load time. Invalid date, time, and timestamp values (e.g. 0000-0­0-00) produce an error

Best Practices in organising data by path
• Both internal and external stage references can include a path (or prefix in AWS termin­ology)
• It's recomm­ended to partition the data into logical paths that include identi­fying details such as geogra­phical location or other source identi­fiers
• Narrow the path to the most granular level that includes your data for improved data load perfor­mance.

Bulk Loading

Bulk Loading from a Local File System

COPY command is used to bulk load data from a local file system into tables using an internal (i.e. Snowfl­ake­-ma­naged) stage using two steps:

1. Upload (i.e. stage) one or more data files to a Snowflake stage (named internal stage or table/user stage) using the PUT command.

2. Use the COPY INTO command to load the contents into table.

NOTE: Virtual Warehouse must be running to execute this step as it provides the compute resources to perform the actual insertion of rows into the table

By default, each user and table in Snowflake is automa­tically allocated an internal stage for staging data files to be loaded. In addition, you can create named internal stages

3 types of internal stages are supported: User, Table and Named

• User Stage
- reference using @~
- cannot be altered or dropped.
- do not support setting file format options, it must be specified as part of the COPY INTO command.

NOT suitable when:
1. Multiple users require access to the files.
2. The current user does not have INSERT privileges on the tables the data will be loaded into.

• Table Stage
- same name as the table; reference @%tabl­ename
- cannot be altered or dropped
- do not support transf­orming data while loading it
- must be the table owner (have the role with the OWNERSHIP privilege on the table)

NOT suitable when:
1. you need to copy the data in the files into multiple tables

• Named Stage
- provide the greatest degree of flexib­ility for data loading
- users with the approp­riate privileges on the stage can load data into any table
- the privileges can be granted or revoked from roles. In addition, ownership of the stage can be transf­erred to another role.
- recomm­ended when you plan regular data loads that could involve multiple users and/or tables.

Snowpipe

Loads data from files according to the COPY statement defined in a referenced pipe (a named, first-­class Snowflake object that contains a COPY statement) as soon as they are available in a stage.

• All data types are supported, including semi-s­tru­ctured data types such as JSON and Avro

• 2 mechanisms for detecting the staged files:
- Automating Snowpipe using cloud messaging
- Calling Snowpipe REST endpoints

• Generally loads older files first, but there is no guarantee that files are loaded in the same order they are staged

• Uses file loading metadata associated with each pipe object to prevent reloading the same files

• Difficult to estimate latency as many factors can affect Snowpipe loads, e.g. File formats and sizes, and the complexity of COPY statements (including SELECT statement used for transf­orm­ations) can all impact the amount of time required

• Charges are calculated by per second/per core granul­arity

• Check usages­/ch­arges in 2 ways:
- Account -> Usage/Bill
- SQL -> Inform­ation Schema -> PIPE_U­SAG­E_H­ISTORY

Bulk vs. Continuous Loading

Areas
Bulk Data Loading
Snowpipe
Mechanism
load batches of data in cloud storag­e/copy data files from local using COPY command
load small volume of data within minutes after files are staged to ensure near real time availa­bility
Compute Resources
Requires a user-s­pec­ified warehouse to execute COPY statements
Snowfl­ake­-pr­ovided resources (i.e. a serverless compute model)
Authen­tic­ation
Security options supported by the client for authen­tic­ating and initiating a user session
When calling the REST endpoints: Requires key pair authen­tic­ation with JSON Web Token (JWT)
Load History
Stored in the metadata of the target table for 64 days. Available upon completion of the COPY statement as the statement output
Stored in the metadata of the pipe for 14 days. Must be requested from Snowflake via a REST endpoint, SQL table function, or ACCOUN­T_USAGE view
Transa­ctions
Always in a single transa­ction. Data is inserted into table alongside any other SQL statements submitted manually by users
Combined or split into a single or multiple transa­ctions based on the number and size of the rows in each data file. Rows of partially loaded files (based on the ON_ERROR copy option setting) can also be combined or split into one or more transa­ctions
Cost
The amount of time each virtual warehouse is active
According to the compute resources used in the Snowpipe warehouse while loading the files

Semi-s­tru­ctured Data

Data that does not conform to the standards of tradit­ional structured data and have two main charac­ter­istics: nested data structures and lack of a fixed schema.

• Snowflake can import semi-s­tru­ctured data from JSON, Avro, ORC, Parquet, and XML formats and store it in data types such as VARIANT, ARRAY or OBJECT

- A VARIANT can contain any other data type, including an ARRAY or an OBJECT (used to build and store hierar­chical data)

- An ARRAY or OBJECT can directly contain VARIANT (thus can indirectly contain any other data type, including itself)

> NOTE: a Snowflake OBJECT corres­ponds to a “dicti­onary” or a “map”. A Snowflake object is not an “object” in the sense of OOP. Curly braces indicate an OBJECT, which contains key-value pairs

• Semi-s­tru­ctured data can be stored in a single column or split into multiple columns.

Loading Semi-s­tru­ctured Data
• Can explicitly specify all, some, or none of the structure when you load and store semi-s­tru­ctured data

• If your data is a set of key-value pairs, you can load it into a column of type OBJECT.

• If your data is an array, you can load it into a column of type ARRAY.

• If the data is complex or an individual value requires more than about 16MB of storage space, then you can, for instance, split the data into multiple columns, and some of those columns can contain an explic­itl­y-s­pec­ified hierarchy of data types.

> NOTE: When we split the data across multiple columns, we may use detect­-an­d-r­etrieve feature but it is currently limited to Apache Parquet, Apache Avro, and ORC files.

Querying Semi-s­tru­ctured Data
• Operations that are supported:
- Accessing an element in an array.
- Retrieving a specified value from a key-value pair in an OBJECT.
- Traversing the levels of a hierarchy stored in a VARIANT.

• The query output is enclosed in double quotes because the query output is VARIANT, not VARCHAR. (Operators : and subsequent . and [] always return VARIANT values containing strings.)

• Flatten the array to retrieve all instances of a child element in a repeating array.

• Use GET/GE­T_PATH function to extract a value from a VARIANT column

Security Principles

Federated Authen­tic­ation
User authen­tic­ation is separated from user access through the use of one or more external entities that provide indepe­ndent authen­tic­ation of user creden­tials (e.g. Single Sign-On (SSO))

• consists 2 compon­ents: Service provider (SP) and Identity provider (IdP)
- Snowflake servers as the SP; IdP provides these services to Snowflake: Creating and mainta­ining user creden­tials and other profile inform­ation + Authen­tic­ating users for SSO access to the SP.

• 2 Types of IdP: The native Snowflake support provided by Okta and ADFS, Snowflake supports using most SAML 2.0-co­mpliant vendors as an IdP: Google G Suite, Microsoft Azure Active Directory, OneLogin & Ping Identity PingOne

>> For a web-based IdP (e.g. Okta), closing the browser tab/window does not necess­arily end the IdP session. If a user’s IdP session is still active, they can still access Snowflake until the IdP session times out.

Multi-­Factor Authen­tic­ation (MFA)
• MFA is enabled on a per-user basis; To use MFA, users must enroll themselves
• All users with the ACCOUN­TADMIN role be required to use MFA is recomm­ended.
• Duo Push authen­tic­ation mechanism is used when a user is enrolled in MFA

>> MFA token caching can be combined with connection caching in federated single sign-on: ensure that the ALLOW_­ID_­TOKEN parameter is set to true in tandem with the ALLOW_­CLI­ENT­_MF­A_C­ACHING parameter.

Network Policy
• Allow restri­cting access to your account based on user IP address. It enables you to create an IP allowed list, as well as an IP blocked list, if desired.

• A network policy is not enabled until it is activated at the account or individual user level.

>> Only security admini­str­ators (i.e. users with the SECURI­TYADMIN role) or higher or a role with the global CREATE NETWORK POLICY privilege can create network policies. Ownership of a network policy can be transf­erred to another role.

• If a network policy is activated for an individual user, the user-level network policy takes preced­ence.

Snowflake Session
• A session is indepe­ndent of an identity provider (i.e. IdP) session
• After the idle session timeout (with a max 4 hours of inacti­vity), the user must authen­ticate to Snowflake again.
• A session policy can modify the idle session timeout period
• Unset the session policy first and then set the new session policy to replace a session policy that is already set for an account or user
• The classic web interface tracks user activity and sends a heartbeat to an internal Snowflake monitor. This heartbeat recording is sent every three minutes and is not config­urable.

Access Control
• Access control privileges determine who can access and perform operations on specific objects in Snowflake
• Discre­tionary Access Control (DAC): Each object has an owner -> grant access to that object.
• Role-based Access Control (RBAC): Access privileges are assigned to roles, which are in turn assigned to users.

>> A role owner (i.e. the role that has the OWNERSHIP privilege on the role) does not inherit the privileges of the owned role. Privilege inheri­tance is only possible within a role hierarchy.

• System­-de­fined roles are created with privileges related to accoun­t-m­ana­gement. As a best practice, it is not recomm­ended to mix accoun­t-m­ana­gement privileges and entity­-sp­ecific privileges in the same role.