Recently in MongoDB Category

Backup in mongodb replica-set configurations

There are multiple ways to take backups of mongodb is different configuraitions, one of the configuration that I have been involved recently is replica-sets. When mongodb is running in replica-set configuration, there is a single primary node and multiple secondary nodes. To take backup of the replica-set we can either do a mongodump of one of the nodes or shutdown one of the secondary nodes and take file copies, since in a replica-set all nodes have the same data (except arbiter). Lets see how we could deal with mongodump method of taking backup.

Sync the primary node so that all writes are flused to disk and lock the database for writes, doing a fsync allows for all writes to be persisted to the disk.

use admin
db.fsyncLock()

now we can issue the mongodump command, there are many more options for mongodump that can be changed, using defaults here

mongodump -h node_name  --out /data/backups/backup_file_name

once the mongodump command is done we can unlock the database so that writes can be issued.

use admin
db.fsyncUnlock()

The downside of this approach is that the primary is not available for writes, but reads are fine. If by any chance there is write issued, all reads are also blocked after that, which is pretty drastic. The other option is to operate on one of the secondaries, this allows us to keep the primary available for write and read, the secondary can be used for backup purposes. Which node is PRIMARY or SECONDARY can be dynamically determined by running some javascript on the command line

#!/usr/bin/env ruby
require 'rubygems'
require 'json'
mongo_nodes = JSON.parse `mongo node_name --quiet 
              --eval "printjson(rs.status().members.map(
                        function(m) { return 
                       {'name':m.name, 'stateStr':m.stateStr} }))"`
primary_node = mongo_nodes.detect { 
                   |member| member['stateStr'] == 'PRIMARY'
                    }

This dynamic script allows us to find the node that we want to take backup from, either the primary or secondary.

Replica sets in MongoDB

Replica sets is a feature of MongoDB for Automatic Failover, in this setup there is a primary server and the rest are secondary servers. If the primary server goes down, the rest of the secondary servers choose a new primary via an election process, each server can also be assigned number of votes, so that you can decide the next primary based on data-center location, machine properties etc, you can also start mongo database processes that act only as election tie-breakers these are known as arbiters, these arbiters will never have data, but just act as agents that break the tie.

All operations are directed at the primary server, the primary server writes the operations to its operation log (also known as opslog), the secondary servers get updates from the primary server. The data is written to the primary server and later replicated to the other secondary servers, so when the write happens at the primary and before the write is replicated to the secondary servers, if the primary server goes down you will loose the data that was written to the primary but never replicated to the secondary servers, you can get around this by specifying how many servers should have the data, before the write is considered good

db.runCommand( { getlasterror : 1 , w : 3 } )
in the above command, you are saying that the write to the database is considered good, only if the write has been propagated to at least 3 servers, off course doing this for every write is going to be very expensive, so you should batch all your writes for a user action and then issue getlasterror

This is how you start the the mongod servers, in a replica set, the can run on any machine any port, as long as they can talk to each other over the network and all of them have the same "--replSet" parameter, in the example below its "prod"

mongod --replSet prod --port 27017 --dbpath /data/node1 
mongod --replSet prod --port 27027 --dbpath /data/node2 
mongod --replSet prod --port 27037 --dbpath /data/node3 

Once the three servers are up, you have to create a replica configuration as shown below, if you use localhost as a server name, then all the members of the replica set have to be on localhost, if the mongo servers are on different servers, you should use distinct machine names and not localhost for anyone of them, once the replica config is defined, you then initiate the replica using the configuration as shown below

replica_config = {_id: 'prod', members: [
                          {_id: 0, host: 'localhost:27017'},
                          {_id: 1, host: 'localhost:27027'},
                          {_id: 2, host: 'localhost:27037'}]}
#Now initiate the replica_config
rs.initiate(replica_config);
When you are connecting to a replica set, you have to connect to atleast one server which is alive, using the ruby driver you can connect to more than one server using the "multi" method, one part you should be careful about is, lets say you define all the servers in the replica set as your connection string, but one of the members of the replica set is down, you will get connection failures, so the best thing to do is give members of the replica set that are up and the drivers will discover the other servers when they come online or go offline. Here is a sample ruby program to find a doc in a loop.
#!/usr/bin/env ruby
require 'mongo'
begin
  @connection = Mongo::Connection.multi([
                                       ['localhost',27017],
                                       ['localhost',27027],
                                       ['localhost',27037]])
  @collection = @connection.db("sales").collection("products")
  product = { "name" => "Refactoring", 
              "code" => "023XX3",
              "type" => "book", 
              "in_stock" => 100}
  @collection.insert(product)
  100.times do
    sleep 0.5
    begin
  	  product = @collection.find_one "code" => "023XX3"
  	  puts "Found Book: "+product["name"]
  	rescue Exception => e
    	puts e.message
    	next
  	end
  end
end
While the ruby program is running, you can kill the current primary and you will see that the program gets connection exceptions, while the replica set is figuring out the next master, once the next master is picked, the program starts going about its way finding the same data from the newly elected primary, here is a screen cast of the replica sets in action. Replica Sets screencast

Schema less databases and its ramifications.

In the No-SQL land schema-less is a power full feature that is advertised a lot, schema-less basically means you don't have to worry about column names and table names in a traditional sense, if you want to change the column name you just start saving the data using the new column name Lets say you have a document database like mongoDB and you have JSON document as shown below.

{  "_id":"4bc9157e201f254d204226bf",
   "FIRST_NAME":"JOHN",
   "MIDDLE_NAME":"D",
   "LAST_NAME":"DOE",
   "CREATED":"2010-10-12"
}

You have some corresponding code to read the documents from the database and lets say you lots of data in the database in the order of millions of documents. If you want to change the name of some attributes or columns at this point and the new JSON would look like

{  "_id":"4bc9157e201f254d204226bf",
   "first_name":"JOHN",
   "middle_name":"D",
   "last_name":"DOE",
   "created":"2010-10-12"
}

You will have to either change every document in the database to match the new attribute names or you have to make sure you code can handle both types of attribute names like

   first_name = doc["first_name"] 
   first_name = doc["FIRST_NAME"] unless !first_name.nil? 
   middle_name = doc["middle_name"]  
   middle_name = doc["MIDDLE_NAME"] unless !middle_name.nil?
   last_name = doc["last_name"]
   last_name = doc["LAST_NAME"] unless !last_name.nil?

This attribute name change also affects the indexes created on mongoDB, since the attribute name change is not across all the documents, an Index created on

   db.people.ensureIndex({first_name:1})

will not index documents where the attribute name is FIRST_NAME, so you have to create another index for this new attribute name

   db.people.ensureIndex({FIRST_NAME:1})

As you can see this gets really complicated if you do multiple refactorings, over a period of time. So when you hear schema less make sure you understand the ramifications of refactoring the attribute names at will and its effect on the code base and the database.

Schema design in a document database

We are using MongoDB on our project, since mongo is document store, schema design is somewhat different, when you are using traditional RDBMS data stores, one thinks about tables and rows, while using a document database you have to think about the schema in a some what different way. Lets say, we want to save a customer object, when using a RDBMS we would come up with Customer, Address, Phone, Email. They are related to each other as shown below. customer.jpg When doing a document database, the schema design actually does not change much, the Customer document contains an array of Addresses, a one to many relationship. You will not need the FK columns or the Primary Key columns on the child tables, since the child rows are embedded in the parent object. The JSON object below shows how the data would look.
{
"_id" : ObjectId("4bd8ae97c47016442af4a580"),
"customerid" : 99999,
"name" : "Foo Sushi Inc",
"type" : "Good",
"since" : "12/12/2001",
"addresses" : [{
		"address" : "4821 Big Street",
		"city" : "Stone",			
		"state" : "IL",
		"country" : "USA"
	},
	{	"address" : "1248 Barlow Ln",
		"city" : "Hedgestone",			
		"country" : "UK"
	}		
],
"emails" : [ 
	{"email" : "foousa@sushi.com"},
	{"email" : "foouk@sushi.com"}
],
"phones" : [ 
	{"phone" : "773-7777-7777"},
	{"phone" : "020-6666-6666"}
]
}
So Instead of 1 Row for customer, 2 rows for address, phone and email each, you get one Customer document. If you want to query for customers in USA. Using RDBMS you would do
SELECT customer.name FROM customer, address 
WHERE customer.customerid = address.customerid 
AND address.country="USA"
The same query in mongo would look like
db.customers.find({"addresses.country":"USA"},{"name":true})
where customers is the collection in which we store our customers.

My experience with MongoDB

The current project I'm on is using MongoDB. MongoDB is a document based database, it stores JSON objects as BSON (Binary JSON objects). MongoDB provides a middle ground between the traditional RDBMS and the NOSql databases out there, it provides for indexes, dynamic queries, replication, map reduce and auto sharding, its open source and can be downloaded here, starting up mongodb is pretty easy.
./mongod --dbpath=/user/data/db
is all you need, where /user/data/db is the path where you want mongo to create its data files. There are many other options that you can use to customize the mongo instance. Each mongo instance has databases and each database has many collections, mapping back to oracle, mongo database is a oracle schema and mongo collection is a oracle table, The difference is, each collection can hold any type of object, basically every row can be different. An example connection to the database using java looks like this
   Mongo mongo = new Mongo("localhost");
   db = mongo.getDB("mydatabase");
If the "mydatabase" does not exist, it will be created. When you want to put objects in the database, you need to have a collection which holds the objects.
   users = db.getCollection("applicationusers");
if the "applicationusers" collection does not exist, it will be created, at this point you are ready to put objects into the collection.
    BasicDBObject userDocument = new BasicDBObject();
    userDocument.put("name", "jack");
    userDocument.put("type", "super");
    users.insert(userDocument);
You create a document by using the BasicDBObject and put attribute names and their values, in the above example "name" is the attribute and "jack" is the value, the users.insert takes the document and inserts it into the collection "users". At this point you have a JSON object put into the database. You can query for the object using the mongo query tool or the rest full api mongo provides using the flag --rest, when you start mongodb, visiting http://127.0.0.1:28017/mydatabase/users/ should give you
{
  "offset" : 0,
  "rows": [
    { "_id" : { "$oid" : "4bc9157e201f254d204226bf" }, "name" : "jack", "type" : "super" }
  ],
  "total_rows" : 1 ,
  "query" : {} ,
  "millis" : 0
}
Every object you insert, gets a auto generated id, more about update, delete and complex objects in next blog post.

All Content Copyright © 2012 Pramod Sadalage. All Rights Reserved.