What is it?
Snowflake is a managed SaaS cloud data platform typically used to warehouse relational data and service analytics; functionally it behaves like Redshift or BigQuery. As a SaaS solution it significantly reduces the typical high operational overhead of running a data warehouse. Snowflake decouples storage and compute, and is designed as a multi-cluster, shared data system providing a single source of truth and on-demand, parallel data retrieval at scale. With zero-copy clones data can be copied and shared internally and externally for no extra cost - metadata maintains the lineage and updates. Built from the ground up, it uses its own purpose-built SQL-native engine.
How do you use it?
With its cloud-native, managed architecture it is simple to implement data pipelines, warehouse data, create data lakes, carry out data application development, and build data exchanges to easily and securely share governed data. This can be done through the console, or using Snowflake's own CLI tool, SnowSQL.
Snowflake must be hosted on a specific cloud provider (such as AWS, Azure, or GCP) and region, though the functionality remains the same across regions, and features are developed rapidly. This differs from the approach taken by AWS, who roll out new services to specific regions first. A Snowflake account can be integrated with an external cloud provider to access and ingest data, regardless of the cloud provider that hosts that Snowflake account.
Who is it for?
Snowflake is for anyone who needs to store relational and semi-structured data at any scale, with the intention of querying that data or connecting it to analytics tooling. It is very useful for secure data-sharing, both internally and externally to an organisation. If a customer needs a data warehouse with a low total cost of ownership and long running operational overhead, it is an ideal choice. Storage costs are low, only marginally above those of the native cloud provider they are hosted by.
When should you use it?
This service is ideal for warehousing relational data, as well as for providing data to dashboards, analysts and data scientists. It can be used in either an ETL or ELT framework.
If your data is primarily held as a key-value store then perhaps a service like DynamoDB would be more fitting, though Snowflake is capable of handling up to 16MB of non-relational data per row with its 'variant' data type.
The technology has a low barrier to entry as warehouses (in Snowflake world warehouses are compute) and storage/data objects are created and altered using SQL commands; under-the-hood, Snowflake manages the data, metadata and infrastructure.
With a simple but strong role-based access control it is trivial to give access and grant permissions to administrators, engineers, and consumers (both internal and external).
There are numerous offerings depending on your business and feature requirements (for example multi-cluster compute). Billing is simple and cost-effective as there is no physical hardware to maintain. All account types are billed for each byte of compressed data stored, and for the first minute and subsequent seconds of compute.
Why should you use it?
- Data is encrypted end-to-end.
- There is no hardware to maintain.
- Storage and compute are decoupled which is ideal for a business at any scale.
- Highly scalable with a number of compute instance sizes and multi-cluster computing. Behind the scenes, data are compressed and stored in services like S3, meaning storage is cheap and, for all intents and purposes, infinite. Storage can be bulk-bought at discount.
- Easy to manage permissions and grants with role-based access control. Security admins manage users and roles, sysadmins manage warehouses, databases, and schemas; account admin has super-user powers.
- Zero-copy clones allow copies of data to be made which reference back to the original data. This reduces costs by removing the need to store additional data, and ensures a single source of truth. Costs for extra storage are only incurred when cloned data is operated on and a specific record updated. This is ideal for providing staging or production data to development environments.
- Data can be shared securely internally and externally as a zero-copy clone. To share data across regions a single replication must occur. After that point, data can be cloned and shared within that region as many times as required.
- Secure views allow queried data to be shared without revealing the business logic.
- Materialised views automatically update as their source tables are updated - this is ideal for a single-source-of-truth at the analytics and presentation layer. This data is physically stored, quick to access, and contributes to storage costs.
- Transactions satisfy the ACID properties (atomicity, consistency, isolation, and durability).
- High level of governance and compliance across industries including financial and healthcare (SOC 1 Type II, SOC 2 Type II, PCI-DSS, HIPAA, ISO/IEC 27001, FedRAMP Moderate).
- Documentation is very clear and concise, providing example code for each cloud platform.
Currently the best and accepted tooling to deploy and manage a Snowflake account is through Terraform. Terraform uses providers, source code which allows API calls to other systems through programmatic connectors - in Snowflake's case, the Go connector. The Chan Zuckerberg Initiative created the open source Snowflake Terraform provider which allows management of infrastructure and role-based access control. Terraform allows versioned, consistent, stateful automated deployments; it is a common tool used to manage cloud infrastructure platforms such as AWS and Azure.
Are there any common gotchas?
Not all hosting providers are feature complete, the differences between AWS, Azure and GCP can be found here. AWS typically receives features first and you can also sign up for an early release account to test-drive updates.