Optimizing Google Cloud Datastore (Firestore) Costs
Sardine uses Google Cloud Datastore to power part of our investigative UI. Datastore is a highly scalable managed NoSQL database with a flexible interface. Datastore is very important to our infrastructure, so we are always looking to optimize it.
As we scale, we are noticing Datastore costs increasing substantially as well. This led us to dive deeper into understanding the types of costs associated with Datastore and how we could optimize these costs with an eye towards our future scaling of our infrastructure.
Google also has a product called Firestore which is now positioned as the next generation of Datastore. When we got started, native Firestore was not GA. We are currently using Firestore with Datastore compatibility.
Understanding Datastore pricing
Our first step in this investigation was understanding the pricing for Datastore. Datastore charges for:
- Read Request
- Write Request
- Delete Request
- Storage
After the team’s analysis, we determined that most of Sardine’s cost is from data storage primarily because we have a lot of data (and indexes).
Understanding Datastore’s Indexing
Since storage (and indexing) is where we needed to focus on, we wanted to understand datastore’s indexing system. Since this would help us optimize datastore costs.
Built-in Indexes
Datastore is a flexible document store (similar to MongoDB), and by default, all columns are indexed. This means, if we are not careful, we’ll end up indexing lots of fields and drive up our costs.
Composite Indexes
To support queries with multiple filters, we must add composite indexes. Datastore also requires indexes for sorting, so to support the following query, we will need a composite index of (customer_id, timestamp)
SELECT * FROM transactions WHERE customer_id=1 ORDER BY timestamp ASC
Indexes are ordered, to support timestamp DESC
sort, we will need another index.
Analyzing Index usage
Google cloud provides various metadata to understand index usage. We have to look at both built-in and composite index usage to get the correct usage data.
Built-in Indexes
To analyze built-in index usage, the team ran the following query:
SELECT * FROM __Stat_Kind_BuiltinIndex__
This query reveals statistics about each built-in index, including:
- Index size
- Last used time
- Last update time
We found out that we accidentally indexed list fields for each session, which consumed a lot of spaces. And those indexes were not even used.
Composite Indexes
To analyze composite indexes, we ran the following query:
SELECT * FROM __Stat_Kind_CompositeIndex__
This query reveals statistics about each built-in index, including:
- Index size
- Last used time
- Last update time
Dropping indexes
Now that we identified the expensive indexes, the team can start the process of dropping them. For composite indexes, since we manage them in terraform, it’s easy to drop. Built-in indexes are more complex, because they are per-object.
When our dataflow job writes data to datastore we have code like the following:
entity.putProperties("score",
DatastoreHelper.makeValue(element.getScore()).build());
The code doesn’t mention anything about index. However, in this example, score
field is indexed, because by default all fields are indexed.
entity.putProperties("score",
// explicitly set excludeFromIndexes flag
DatastoreHelper.makeValue(element.getScore()).setExcludeFromIndexes(true).build());
To avoid confusion we built a wrapper function buildIndexedValue
and buildUnindexedValue
so the code is more explicit like the following:
entity.putProperties("transaction_id",
buildIndexedValue(element.getTransactionId());
entity.putProperties("score",
buildUnindexedValue(element.getScore());
Additionally, we had to unindex fields from billions of existing records. There is no easy way to perform this as far as we know. So, we wrote a python script that reads data, and writes back with an updated excludeFromIndex
flag.
Conclusion
Datastore costs often grow quietly through unnecessary indexes. By doing the following you can get noticeable cost savings:
- Auditing index stats (
__Stat_Kind_BuiltinIndex__
,__Stat_Kind_CompositeIndex__
), - Dropping what’s unused indexes.
- Explicitly marking non-queryable fields with
excludeFromIndexes=true.
Let us know what you think about this approach. Be sure to investigate your data storage solution to look for cost savings and optimization!
If you’re someone who is excited about optimizing cloud usage, apply here!