I sharded my mongoDB cluster by hashed _id. I checked the index size, there lies an _id_hashed index which is taking much space:
"indexSizes" : { "_id_" : 14060169088, "_id_hashed" : 9549780576 },
mongoDB manual says that an index on the sharded key is created if you shard a collection. I guess that is the reason the _id_hashed index is out there.
My question is : what is the _id_hashed index for if I only query document by the _id field? can I delete it? as it takes too much space.
ps: it seems mongoDB use the _id index when query, not the _id_hashed index. execution plan for a query:
"clusteredType" : "ParallelSort", "shards" : { "rs1/192.168.62.168:27017,192.168.62.181:27017" : [ { "cursor" : "BtreeCursor _id_", "isMultiKey" : false, "n" : 0, "nscannedObjects" : 0, "nscanned" : 1, "nscannedObjectsAllPlans" : 0, "nscannedAllPlans" : 1, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 0, "nChunkSkips" : 0, "millis" : 0, "indexBounds" : { "start" : { "_id" : "spiderman_task_captainStatus_30491467_2387600" }, "end" : { "_id" : "spiderman_task_captainStatus_30491467_2387600" } }, "server" : "localhost:27017" } ] }, "cursor" : "BtreeCursor _id_", "n" : 0, "nChunkSkips" : 0, "nYields" : 0, "nscanned" : 1, "nscannedAllPlans" : 1, "nscannedObjects" : 0, "nscannedObjectsAllPlans" : 0, "millisShardTotal" : 0, "millisShardAvg" : 0, "numQueries" : 1, "numShards" : 1, "indexBounds" : { "start" : { "_id" : "spiderman_task_captainStatus_30491467_2387600" }, "end" : { "_id" : "spiderman_task_captainStatus_30491467_2387600" } }, "millis" : 574
MongoDB uses a range based sharding approach. If you choose to use hashed based sharding, you must have a hashed index on the shard key and cannot drop it since it will be used to determine shard to use for any subsequent queries ( note that there is an open ticket to allow you to drop the _id index once hashed indexes are allowed to be unique SERVER-8031 ).
As to why the query appears to be using the _id index rather than the _id_hashed index - I ran some tests and I think the optimizer is choosing the _id index because it is unique and results in a more efficient plan. You can see similar behavior if you shard on another key that has a pre-existing unique index.