Back

MongoDB Shard Cluster

We all choose one or the other database to easily store, query and fetch the data. The point is, what type of database would best support the project at hand. The question goes deeper than "SQL vs. NoSQL" . For example, if I need a flexible schema with recursive graph queries, I go with MongoDB, which is a NoSQL database with JSON documents which represent field and value pairs.

{
  "name": "Angelina",
  "place": "Bangalore",
  // field: value
}

Since the table structure is not fixed, queries are themselves JSON, and thus easily composable.

What is Database Sharding?

Consider a scenario where I have standalone mongoDB running on one machine where I have 2 million user data items. Now, my business is reaching that break-point and will likely surpass 2.5 million users soon. So, I decided to break the database up into two:

The system capacity is now doubled!

What is MongoDB Sharding?

MongoDB sharding is used for deployment consists large sets of data where machines are connected with a few simple rules. This result in high throughput operations.

Components:

  1. Shards: It contains a subset of sharded data for a sharded cluster.
  2. Mongos (query router): Provides an interface between client applications and the sharded cluster.
  3. Config-servers: Config servers store metadata and configuration settings for the cluster.

Sharding is done based on the Shard Key. It is a field or a set of fields that you select from a document of a targeted collection.

Types of Sharding:

  1. Ranged Sharding: Dividing data into contiguous ranges determined by the shard key values.
  2. Hashed Sharding: Uses a hashed index to partition data across your shared cluster.

Demo on how to enable hashed sharding for your database with shard key as "_id".

Deploy a Shard cluster

1. Config Server replica set

For production deployment, deploy a config server with at least three members. Below is mongoDB configuration for config servers.

sharding:
  clusterRole: configsvr
replication:
  replSetName: <replica_set_name>

Deploying config server with 2 replica members ie. 1 primary and 2 secondary replica members.

config-server replica set name: "test-config-server"
...
mongod --configsvr --replSet testing-config-server --dbpath /data --bind_ip localhost,<hostname(s)|ip address(es)> 

2. Shard Replica set

Below is mongod setting for each shard cluster

sharding:
  clusterRole: shardsvr
replication:
  replSetName: <replica_set_name>

Create three mongoDB shard clusters with primary, secondary and one arbiter members for each shard or you can add more replica members if needed. Give a specific name for each shard.

mongod --shardsvr --replSet testing-shard-1 --dbpath <path> --bind_ip localhost,<hostname(s)|ip address(es)>

mongod --shardsvr --replSet testing-shard-2 --dbpath <path> --bind_ip localhost,<hostname(s)|ip address(es)> 

mongod --shardsvr --replSet testing-shard-3 --dbpath <path> --bind_ip localhost,<hostname(s)|ip address(es)>

3. Mongos for the shard cluster

While running mongos add the below setting in configuration file i.e config server replica set name and at least one member of the replica set.

sharding:
  configDB: <configReplSetName>/<cfg-server1-ip:<port>,<cfg-server2-ip:<port>,<cfg-server3-ip:<port>.    

Running mongos:

mongos --configdb test-config-server/cfg1.example.net:27019,cfg2.example.net:27019,cfg3.example.net:27019 --bind_ip localhost,<hostname(s)|ip address(es)>

4. Connect to the Shard Cluster

mongo --host <hostname> --port <port>

4.1 Add Shards to the Cluster

sh.addShard("test-shard-1/<hostname(s)|ip address(es)")
sh.addShard("test-shard-2/<hostname(s)|ip address(es)")
sh.addShard("test-shard-3/<hostname(s)|ip address(es)")
...
sh.addShard("<shard-names>/<hostname(s)|ip address(es)")

4.2 Create database

Creating users database

use users
...
use <db name>

4.3 Create collection

Creating collection with name indianUsers

db.createCollection("indianUsers")
...
db.createCollection("<collectionName>")

4.4 Check indexes for collection

By default the sharding index will be {_id: "1"} i.e Range based sharding.

db.indianUsers.getIndexes()
...
db.<collectionName>.getIndexes()

Output:

[
  {
    "key": {
      "_id": 1
    },
    "name": "_id_",
    "ns": "users.indianUsers",
    "v": 2
  }
]

4.5 Create hashed index

We are going with hash based sharding i.e create an index with {_id: "hashed"} value.

db.indianUsers.ensureIndex({_id: "hashed"})
...
db.<collectionName>.ensureIndex({_id: "hashed"})
...
# Check indexes again
db.indianUsers.getIndexes()

Output:

[
  {
    "key": {
      "_id": 1
    },
    "name": "_id_",
    "ns": "users.indianUsers",
    "v": 2
  },
  {
    "key": {
      "_id": "hashed"
    },
    "name": "_id_hashed",
    "ns": "users.indianUsers",
    "v": 2
  }
]

4.6 Enable sharding for database

sh.enableSharding('users')
...
sh.enableSharding('<db name>')

4.7 Enable hash based sharding for collection

sh.shardCollection("<db name>.<collectionName>",{_id: "hashed"})
sh.shardCollection("users.indianUsers",{_id: "hashed"})

4.8 Check collection distribution

db.indianUsers.getShardDistribution()
...
db.<collectionName>.getShardDistribution()

Output:

Shard testing-shard-1-v2 at testing-shard-1-v2/<replica-server-ip:port>
 data : 0B docs : 0 chunks : 2
 estimated data per chunk : 0B
 estimated docs per chunk : 0

Shard testing-shard-3-v2 at testing-shard-3-v2/<replica-server-ip:port>
 data : 0B docs : 0 chunks : 2
 estimated data per chunk : 0B
 estimated docs per chunk : 0

Shard testing-shard-2-v2 at testing-shard-2-v2/<replica-server-ip:port>
 data : 0B docs : 0 chunks : 2
 estimated data per chunk : 0B
 estimated docs per chunk : 0

Totals
 data : 0B docs : 0 chunks : 6
 Shard testing-shard-1-v2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard testing-shard-3-v2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard testing-shard-2-v2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B

4.9 Insert Data into the collection

Using simple for loop to insert some data into the collection.

for (var i = 1; i <= 1000; i++) {
	db.indianUsers.insert(
		{
			'first': 'helloWorld',
			'rollno': i
		}
	) 
}

Output:

WriteResult({ "nInserted" : 1 })

# This inserts 1000 documents into the collection.
# Check collection count
db.indianUsers.count()
...
db.<collection name>.count()

Output: 1000

Check the shard distribution again. the data will be divided across 3 different shards. Mongos automatically rebalance the data across all shards.

db.indianUsers.getShardDistribution()
...
db.<collectionName>.getShardDistribution()
Shard testing-shard-2-v2 at testing-shard-2-v2/<replica-server-ip:port>
 data : 17KiB docs : 324 chunks : 2
 estimated data per chunk : 8KiB
 estimated docs per chunk : 162

Shard testing-shard-1-v2 at testing-shard-1-v2/<replica-server-ip:port>
 data : 17KiB docs : 332 chunks : 2
 estimated data per chunk : 8KiB
 estimated docs per chunk : 166

Shard testing-shard-3-v2 at testing-shard-3-v2/<replica-server-ip:port>
 data : 18KiB docs : 344 chunks : 2
 estimated data per chunk : 9KiB
 estimated docs per chunk : 172

Totals
 data : 53KiB docs : 1000 chunks : 6
 Shard testing-shard-2-v2 contains 32.4% data, 32.4% docs in cluster, avg obj size on shard : 55B
 Shard testing-shard-1-v2 contains 33.2% data, 33.2% docs in cluster, avg obj size on shard : 55B
 Shard testing-shard-3-v2 contains 34.39% data, 34.39% docs in cluster, avg obj size on shard : 55B

Conclusion

MongoDB makes it more manageable when it comes for sharding, scaling than other popular databases.

I hope this article was able to shed some light on what sharding is in MongoDB.

Lavanya Jain