The availability of on-demand storage and compute capacity has transformed the data warehouse (DWH) marketspace, as we know it today. We field a number of questions from our customers on how to optimize the cost of cloud data warehouses, and the various techniques that can be used. In this post, I would like to look at the key factors impacting DWH costs, and the various intervention options available.
Before we dive in, here are four key factors that impact the cost of cloud DWH:
- Data Ingestion: BigQuery does not charge for batch data ingestion, but charges for streaming ingestion. Snowflake charges for data ingestion is based on compute time used.
- Data Storage: Each of the cloud DWH charges for storage. In BigQuery and Snowflake, storage is clearly separated out, while in other DWH solutions, storage and compute are combined.
- Queries: BigQuery is completely managed, and the price is based on the quantum of data scanned. In Snowflake, the cost is based on the compute time of queries.
- Number of nodes: Snowflake charges for the size and the number of instances. BigQuery is completely managed and therefore does not have the concept of instance types.
While we look at cost optimization techniques for each of these categories, we will also consider the controls provided by the products to avoid billing surprises, and for optimization.
BigQuery:
Data Ingestion:
Batch data ingestion is not charged in BigQuery, however, streaming inserts are charged at $0.01 / 200 MB. Unless we require changes to be propagated real-time, or if our analytics use-cases are real-time, we should consider changing the streaming ingestion to a batch process.
Data Storage:
BigQuery charges $0.020 / month / GB for Active storage and half of that ($0.010 / GB / month) for long term storage. Data not edited for 90 days is automatically considered long term storage. BigQuery manages itself, the designation of active versus long term storage. Querying the table does not alter its long term status, and performance of queries are also not degraded. To save on costs, we could store new data in a separate table to avoid updating existing tables, and then bring them back to active storage.
Alternatively, Google Coldline can be used for storing data that is not frequently changed (<=once a quarter). Coldline storage costs are $0.004 / month / GB; significantly less than BigQuery! Data can be batch loaded into BigQuery as required, and deleted once utilized.
Table expiration rules help BigQuery delete transient tables, and save on storage costs. For instance, for weekly sales reports the table expiration can be set to seven days for underlying data, retaining only the weekly stats, which save on storage.
BigQuery also supports federated queries where query against the original source of data can be initiated, instead of data being replicated in BigQuery. However, costs for current options for federated queries – Cloud Bigtable ($0.17 / GB for just storage), Cloud Storage ($0.020 / GB for just storage), Google Drive ($0.020 / GB) and Cloud SQL ($0.170 / GB for SSD) – are more than that of BigQuery.
Data Queries
BigQuery charges the amount of data scanned, instead of the amount of data returned. This is why LIMIT clauses do not impact the cost of the query.
There are several ways to optimize query performance.
All cloud DWH including BigQuery, store data in a columnar format. Therefore, it is best to specify the exact columns to be returned by the query rather than SELECT (*. SELECT * returns all the columns and increases scanned data (by orders of magnitude)).
Since data is stored in columnar format, filter clauses against multiple columns means that each of these columns will be read. Therefore complex, multi-stage queries that return very specific data do not have lower costs. The bytes scanned during the first stage of query execution determines the cost of the query, which makes filtering our query as early as possible, a prudent move.
Partitioning data is another great way to reduce scanned data. Today, tables can be partitioned based on ingestion time, date, or any timestamp column. However this is not a constraint. A trick would be to replace unique values in a column that we would like to use as a partition, with custom date stamps. For instance, region codes could be replaced by specific date values, and data can be partitioned based on that.
Clustering can further reduce the amount of data scanned. Data can be clustered up to four columns. If data is clustered, BigQuery reduces the data scanned using a process of Block Pruning.
Caching also increases performance. Scans against cached data is not included in the query cost. Cache data is maintained upto 24 hours, and is then utilized if the source data has not been changed.
Controls
The “Maximum bytes billed” setting in BigQuery helps avoid billing surprises. This prevents situations where a person inadvertently runs an open-ended query against a PB of data, which can run up significant costs.
Snowflake
Data Ingestion
Snowflake charges are based on the compute time used for loading data. Based on a benchmark provided by Snowflake, and between the various ingestion formats, GZipped CSV files seems to offer the best performance.
Source Format | Target Layout | Load Time (sec) | TB/Hr (uncompressed) |
CSV (Gzipped) | Structured | 1104 | 15.4 |
Parquet (Snappy comp) | Semi-structured | 3518 | 4.8 |
Parquet (Snappy comp) | Structured | 3095 | 5.4 |
ORC (Snappy comp) | Semi-structured | 3845 | 4.4 |
ORC (Snappy comp) | Structured | 2820 | 6.0 |
Many data ingestion processes that we write, do not take advantage of all the nodes in the cluster. For instance, loading a single, large file does not engage multiple nodes, and will take similar amount of time as an XS, L or an M cluster. However, if we break down the file into multiple files of 16MB each, we can fully leverage parallelism across the nodes, and drastically reduce the ingestion time.
Similarly, for tables with a large number of columns – vertically splitting the file, ingesting them in parallel and then bringing them together in views provides for a better option.
For Data Ingestion processes, since the start and the end times are known, we could programmatically start / suspend a warehouse during the data load process to reduce data warehouse usage.
Also utilize COPY rather than INSERT INTO as COPY is optimized for bulk loads.
Data Storage:
Snowflake is centered around proper lifecycle management of data. Here, we can differentiate between three types of data – transient, temporary and permanent. Snowflake enables Time Travel and Fail Safe storage for each of these categories differently.
Temporary | Transient | Permanent | |
Tied to an individual session and persists only for the duration of the session | Designed for transitory data that needs to be maintained beyond each session | Designed for data that requires highest level fo data protection and recovery – Default for creating tables | |
Time Travel | Yes | Yes | Yes |
Fail Safe | No | No | Yes |
Setting up the Time Travel feature on our datasets creates multiple copies of our data as the underlying data changes. This leads to additional storage costs. If we can regenerate this data from our base ELT processes within acceptable SLAs, we could choose not to have Time Travel. The other option is to take snapshots of data on a periodic basis instead.
Clear designation of tables as transient or temporary also helps in reducing fail safe storage costs.
Data Queries:
Similar to BigQuery, Snowflake also uses Columnar databases. Selecting specific columns reduces the data scanned by Snowflake instead of Select *.
The common query problems in Snowflake are around using UNION instead of UNION ALL, exploding joins and queries that are too large for memory. Each of these can significantly degrade query performance. To identify these, the query profiler in Snowflake can be used.
Caching can be leveraged to reduce or eliminate compute time where we use the same results. In a BI tool where users execute same queries, the auto suspend time of the warehouse can optimize caching.
Similarly, Materialized views can be leveraged when multiple queries use the same sub query results. In these cases, a materialized view of the sub query and running the user queries on the materialized view is cost effective.
Snowflake usually does a great job of auto clustering of tables. However, if there is consistent degradation of performance, we could choose to have user-defined clustering on upto four columns. Clustering helps with block pruning, and reduces the compute time of the query.
Resource Control:
Snowflake allows for granular control of your virtual warehouses. The amount of credit usage can be controlled, and notifications can be sent when certain thresholds are breached.
Instance Types:
Unlike BigQuery which is completely managed, we need to choose the instance types / number of instances in Snowflake. Snowflake has options for both vertical scaling (scale up) and horizontal scaling (scale out). Choosing the right combination of instance types depending on the use case will help in optimizing the number of instances. For instance, if data loads happen at a particular period of time, and are from a single source, scaling up the instance during the load period and then suspending the instance could prove optimal. However, if users run simple queries throughout the day, utilizing a combination of multiple, small size instances can help follow the demand pattern, rather than overprovision with larger instances.
Summary
When you calculate the various cost heads with data, you will see that compute costs (or query costs for BigQuery) beats storage costs by orders of magnitude. Therefore, it would be good to start with optimizing compute costs on Snowflake, and query costs on Google BigQuery before looking at storage optimization.
Apart from this, also consider the overall costs of the DWH, including managing infrastructure. Google BigQuery scores over Snowflake in administrative costs. Being a fully managed product, BigQuery completely eliminates the need for instance planning, node management and optimization.
Applying the techniques that we have discussed and optimizing the overall costs of the solution will help us take advantage of both the agility provided by the cloud warehouses and the costs benefits of these solutions.
References:
- Google BigQuery best practices: https://cloud.google.com/blog/products/data-analytics/cost-optimization-best-practices-for-bigquery
- TCO analysis : https://services.google.com/fh/files/blogs/esg_economic_validation_google_bigquery_vs_cloud-based-edws-september_2019.pdf
- Snowflake best practices : https://www.youtube.com/watch?v=RD5_zYBl3Ss
- Saama Analytics Webinar : https://saamaanalytics.com/events/webinar-best-practices-for-managing-optimizing-snowflake-cloud-data-warehouse/
- Cloud cost optimization : https://www.atscale.com/blog/reduce-cost-and-increase-efficiency-on-your-cloud-data-warehouse/
- How to slash Snowflake costs by at least 30% : https://bryteflow.com/how-to-slash-snowflake-costs-by-at-least-30/