In the second part of the article we saw a simple data migration from Azure Cosmos DB to Atlas using a mongodump and mongorestore commands.
In this article we can further look on other complex aspects of migrations like Indexing, Partitioning, Sizing and Consistency.
Indexing, an inherent part of any database. Indexing data will reduce some of the overhead of collection scanning when looking for a document. It is very important to identity and index only required field, as it creates a separate tree structure that consumes RAM and CPU while creating/updating/deleting documents.
By default MongoDB has an _id field indexed. MongoDB offers various types of Indexes for different purposes
- Single Field Index
- Compound Index
- Multikey Index
- Geospatial Index
- Text Index
- Hashed Index
- Others : Unique Indexes, Partial Indexes, Sparse Indexes, TTL Indexes
Azure Cosmos DB will automatically index _id field like MongoDB ( checking wildcard option will create index on all fields ). It has special indexes for spatial data, arrays, nested documents, and, most important, it does sharding automatically.
When migrating data from Azure Cosmos DB to Atlas, the decision of carrying over the index as is or creating only the required indexes should be a conscious decision made based on the data usage pattern.
PS: If the source application has few indexes created explicitly then as a part of mongodump and mongorestore the indexes will be recreated in the target.
Database scaling is an important challenge for any database engine to handle. In MongoDB, scaling is handled through a process called sharding. MongoDB executes sharding of data at collection level.
Sharded Cluster consist of 3 components
- mongos(query router): acts as an interface between the application and the sharded cluster.
- config servers: Metadata store about the shards, used by mongos to perform targeted or scatter gather queries.
- shards: Data is distributed and stored across the shards enabling horizontal scaling of data.MongoDB shards data in chunks
The split of data occurs using the shardKey option, and the selection of shardKey is very important to provide optimal query performance during runtime. There are 3 types of shardKey:
- Range Key
- Hashed Key &
It is very important to select an optimal shard key, as this decides the performance of the cluster.
In Azure CosmosDB Sharding is referred to as partitioning and is enabled only on creating unlimited collection. The portal will automatically request for the shardKey as shown below. Like chunks, Azure Cosmos DB will have logical partition based on the partition key specified.
Migrating data from Azure CosmosDB to MongoDB with sharding can be done in one of the two ways.
- Create the Sharded Cluster and Import the data into the sharded cluster.
- Or, Import the data without sharding into a replicaset. Later, enable sharding on database in the required collection. We can run balancer to distribute the data across shards into multiple equally distributed chunks.
As a prerequisite setup the sharded cluster in Atlas.The decision on the partition keys should be carefully made
Use the mongoexport and mongoimport options for getting the data into the sharded Atlas cluster.
According to CAP theorem a distributed system cannot guarantee Consistency, Availability,Partition tolerance all at the same time
In MongoDB, strong consistency will be applicable by default for primary instances and, eventually, for read replicas. This behaviour can be influenced by the read and write concerns. The write request can specify the write concern, which dictates the acknowledgment of write from the number of replicated instances. It will ensure the durability of the write transactions.
Similarly for read requests, you can define four types of read concerns: local, available, majority, and linearisable.
In case of “local,” irrespective of the write concern, data will be available from the primary instance, without ensuring a durable commitment to other replicas.
The “majority” read concern will provide more consistent data, after a majority of nodes acknowledge the writes.
As a part of application migration we should provide the required read/write consistency levels after thoughtful consideration.There is no rule of thumb governing the selection of one over another, but it is suggested that we analyse the use case and select the appropriate consistency.
It is important to get the sizing before the deployment but its not must to get this accurate as this can be scaled up post migration as well. We can go-ahead with the anticipated cluster size, later we can monitor performance throughput and increase / decrease the cluster size without any downtime.
For appropriately sizing the Mongo Instance, you should consider the below metrics
- Total Number of CRUD operations(Create, Read, Update & Delete)
- Active Connections
- Total Number of Multi -Document Transactions
- Size of Indexes
- Total Number and Size of the document
Sizing Estimation can be done with a simple tool designed by us in PeerIslands from an open source considering some of the above parameters
In the next part of the article, we will see how on going changes(CDC) can be captured from CosmosDB.