Recently , I found a spike cpu use(over 200%-400% ) in our standalone mongo(4.2.2 version) which deployed with docker on a 16g , 6 core cpu machine .
[root@localhost ~]# free -h
total used free shared buff/cache available
Mem: 15G 6.1G 1.0G 17M 8.4G 9.0G
Swap: 11G 528M 11G
top - 14:02:41 up 54 days, 21:50, 0 users, load average: 2.25, 1.06, 0.80
Tasks: 9 total, 1 running, 8 sleeping, 0 stopped, 0 zombie
%Cpu(s): 75.4 us, 2.1 sy, 0.0 ni, 20.7 id, 0.0 wa, 0.0 hi, 1.8 si, 0.0 st
KiB Mem : 16266496 total, 203948 free, 6697376 used, 9365172 buff/cache
KiB Swap: 12386300 total, 11849724 free, 536576 used. 9212400 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 mongodb 20 0 5431804 3.403g 9344 S 441.2 21.9 26714:21 mongod
2300 root 20 0 1005132 10320 4476 S 0.7 0.1 0:00.49 mongostat
1023 root 20 0 18492 2136 1672 S 0.0 0.0 0:00.03 bash
1404 root 20 0 18600 2140 1676 S 0.0 0.0 0:00.03 bash
1615 root 20 0 18496 2124 1672 S 0.0 0.0 0:00.03 bash
1662 root 20 0 38856 2064 1396 S 0.0 0.0 3:31.04 top
1952 root 20 0 18496 2156 1680 S 0.0 0.0 0:00.06 bash
2255 root 20 0 18496 2128 1660 S 0.0 0.0 0:00.02 bash
2319 root 20 0 38724 1812 1308 R 0.0 0.0 0:00.01 top
Later we figured out it was the batch insert operation (unorder) that cause the problem , more specifically , is the index created on collection that we operate on . I say so because i drop the index and cpu usage drop to 50% , and back to 200-300% when i recreate it (still with batch insert enable). Down below is a fraction of my mongostat result
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn time
554 *0 1036 *0 0 3|0 6.5% 79.5% 0 5.18G 3.40G 0|0 1|0 548k 62.5k 60 Nov 25 14:02:58.107
769 2 2599 *0 2 2|0 6.3% 79.4% 0 5.18G 3.40G 0|0 1|0 925k 869k 60 Nov 25 14:02:59.109
555 72 1049 *0 6 4|0 6.2% 79.3% 0 5.18G 3.40G 0|0 1|0 646k 547k 60 Nov 25 14:03:00.107
720 514 1985 *0 72 0|0 5.9% 79.3% 0 5.18G 3.40G 0|0 1|0 1.66m 5.64m 60 Nov 25 14:03:01.107
553 40 2735 *0 7 2|0 5.6% 79.1% 0 5.18G 3.40G 0|0 1|0 1.01m 863k 60 Nov 25 14:03:02.109
438 3 1379 *0 3 1|0 5.3% 78.9% 0 5.18G 3.40G 0|0 1|0 561k 854k 60 Nov 25 14:03:03.107
143 *0 892 *0 0 6|0 5.0% 78.8% 0 5.18G 3.40G 0|0 1|0 198k 48.4k 60 Nov 25 14:03:04.108
*0 96 404 *0 9 7|0 4.7% 78.7% 0 5.18G 3.40G 0|0 1|0 197k 1.47m 60 Nov 25 14:03:05.108
1699 530 2325 *0 77 3|0 5.4% 79.0% 0 5.18G 3.40G 0|0 1|0 2.69m 5.54m 60 Nov 25 14:03:06.107
874 2 3640 *0 2 1|0 5.4% 78.9% 0 5.18G 3.40G 0|0 1|0 1.02m 874k 60 Nov 25 14:03:07.107
The index i mentions above is a single field index , and i really dont know why the index has such a huge impact on the cpu usage . From my point of view , insert operation consumer I/O rather than cpu . If mongo should traverse the index to locate the exact position to insert new document , then the number inserted isn't really big(around 1000 , seem from mongostat) , and i dont think mongo could't process it properly and thus has such a huge overhead on cpu
db.getCollection("mycollection").createIndex({code:1});
Im not familiar with mongo , maybe i miss or misunderstand something really happen during mongo insert operation with index. And i dont know if there is some setting or variable to help to tune the performance. Thanks advanced if anyone knows how to fix this problem !!!
New Edit: supply some info
no strange docker restriction to mongo
docker run -p 27017:27017 --restart always -v /data/mongo1/db:/data/db -v /etc/localtime:/etc/localtime --name mongo1 -m 8G -d mongo
thread info during high cpu usage period
top - 10:25:59 up 55 days, 18:13, 0 users, load average: 2.29, 2.62, 2.04
Threads: 99 total, 5 running, 94 sleeping, 0 stopped, 0 zombie
%Cpu(s): 76.3 us, 2.0 sy, 0.0 ni, 19.8 id, 0.1 wa, 0.0 hi, 1.8 si, 0.0 st
KiB Mem : 16266496 total, 742292 free, 5017916 used, 10506288 buff/cache
KiB Swap: 12386300 total, 11807484 free, 578816 used. 10891612 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31 mongodb 20 0 5431804 1.772g 9228 R 99.7 11.4 3301:11 mongod
33 mongodb 20 0 5431804 1.772g 9228 R 99.3 11.4 3619:33 mongod
34 mongodb 20 0 5431804 1.772g 9228 R 99.0 11.4 3506:26 mongod
38 mongodb 20 0 5431804 1.772g 9228 R 96.7 11.4 1034:06 WTCheck.tThread
32 mongodb 20 0 5431804 1.772g 9228 S 16.6 11.4 3625:46 mongod
2450 mongodb 20 0 5431804 1.772g 9228 S 3.0 11.4 0:21.02 conn1431
2462 mongodb 20 0 5431804 1.772g 9228 S 2.7 11.4 0:59.76 conn1443
2452 mongodb 20 0 5431804 1.772g 9228 S 2.3 11.4 0:53.96 conn1433
I set slow log threshold to 50 ms, and get some slow log related to insert
db.getCollection("").insert( {
op: "insert",
ns: "stock_market_a.t_a_stock_trade12",
command: {
insert: "t_a_stock_trade12",
ordered: false,
$db: "stock_market_a"
},
ninserted: NumberInt("1"),
keysInserted: NumberInt("2"),
numYield: NumberInt("0"),
locks: {
ParallelBatchWriterMode: {
acquireCount: {
r: NumberLong("2")
}
},
ReplicationStateTransition: {
acquireCount: {
w: NumberLong("2")
}
},
Global: {
acquireCount: {
r: NumberLong("1"),
w: NumberLong("1")
}
},
Database: {
acquireCount: {
w: NumberLong("1")
}
},
Collection: {
acquireCount: {
w: NumberLong("1")
}
},
Mutex: {
acquireCount: {
r: NumberLong("2")
}
}
},
flowControl: {
acquireCount: NumberLong("1")
},
storage: { },
responseLength: NumberInt("45"),
protocol: "op_msg",
millis: NumberInt("100"),
ts: ISODate("2021-11-26T02:54:46.171Z"),
client: "192.168.1.128",
allUsers: [ ],
user: ""
} );
Down below are server parameters info obtained by executing db.adminCommand( { getParameter: '*'} ) . Here I did change maxIndexBuildMemoryUsageMegabytes to 2048 ,and keep other default setting unchange , but it didn't solve the problem .
db.getCollection("").insert( {
AlwaysRecordTraffic: "",
KeysRotationIntervalSec: NumberInt("7776000"),
ShardingTaskExecutorPoolHostTimeoutMS: NumberInt("300000"),
ShardingTaskExecutorPoolMaxConnecting: NumberInt("2"),
ShardingTaskExecutorPoolMaxSize: NumberInt("32767"),
ShardingTaskExecutorPoolMinSize: NumberInt("1"),
ShardingTaskExecutorPoolRefreshRequirementMS: NumberInt("60000"),
ShardingTaskExecutorPoolRefreshTimeoutMS: NumberInt("20000"),
ShardingTaskExecutorPoolReplicaSetMatching: "matchPrimaryNode",
TransactionRecordMinimumLifetimeMinutes: NumberInt("30"),
adaptiveServiceExecutorIdlePctThreshold: NumberInt("60"),
adaptiveServiceExecutorMaxQueueLatencyMicros: NumberInt("500"),
adaptiveServiceExecutorRecursionLimit: NumberInt("8"),
adaptiveServiceExecutorReservedThreads: NumberInt("-1"),
adaptiveServiceExecutorRunTimeJitterMillis: NumberInt("500"),
adaptiveServiceExecutorRunTimeMillis: NumberInt("5000"),
adaptiveServiceExecutorStuckThreadTimeoutMillis: NumberInt("250"),
allowRolesFromX509Certificates: true,
allowSecondaryReadsDuringBatchApplication: true,
authSchemaVersion: NumberInt("5"),
authenticationMechanisms: [
"MONGODB-X509",
"SCRAM-SHA-1",
"SCRAM-SHA-256"
],
authorizationManagerCacheSize: NumberInt("100"),
authorizationManagerPinnedUsers: [ ],
authorizationManagerPinnedUsersRefreshIntervalMillis: NumberLong("1000"),
bgSyncOplogFetcherBatchSize: NumberInt("13981010"),
cachePressureThreshold: NumberInt("95"),
checkCachePressurePeriodSeconds: NumberInt("5"),
clientCursorMonitorFrequencySecs: NumberInt("4"),
cloudFreeMonitoringEndpointURL: "https://cloud.mongodb.com/freemonitoring/mongo",
clusterAuthMode: "undefined",
collectionClonerBatchSize: NumberInt("0"),
collectionClonerUsesExhaust: true,
connPoolMaxConnsPerHost: NumberInt("200"),
connPoolMaxInUseConnsPerHost: NumberInt("2147483647"),
connPoolMaxShardedConnsPerHost: NumberInt("200"),
connPoolMaxShardedInUseConnsPerHost: NumberInt("2147483647"),
createRollbackDataFiles: true,
cursorTimeoutMillis: NumberLong("600000"),
diagnosticDataCollectionDirectoryPath: "/data/db/diagnostic.data",
diagnosticDataCollectionDirectorySizeMB: NumberInt("200"),
diagnosticDataCollectionEnabled: true,
diagnosticDataCollectionFileSizeMB: NumberInt("10"),
diagnosticDataCollectionPeriodMillis: NumberInt("1000"),
diagnosticDataCollectionSamplesPerChunk: NumberInt("300"),
diagnosticDataCollectionSamplesPerInterimUpdate: NumberInt("10"),
disableIndexSpecNamespaceGeneration: false,
disableJavaScriptJIT: true,
disableLogicalSessionCacheRefresh: false,
disableNonSSLConnectionLogging: false,
disableNonTLSConnectionLogging: false,
disabledSecureAllocatorDomains: [ ],
enableElectionHandoff: true,
enableFlowControl: true,
enableHybridIndexBuilds: true,
enableIndexBuildsCoordinatorForCreateIndexesCommand: false,
enableLocalhostAuthBypass: true,
enableTestCommands: false,
failIndexKeyTooLong: true,
featureCompatibilityVersion: {
version: "4.2"
},
flowControlDecayConstant: 0.5,
flowControlFudgeFactor: 0.95,
flowControlMaxSamples: NumberInt("1000000"),
flowControlMinTicketsPerSecond: NumberInt("100"),
flowControlSamplePeriod: NumberInt("1000"),
flowControlTargetLagSeconds: NumberInt("10"),
flowControlThresholdLagPercentage: 0.5,
flowControlTicketAdderConstant: NumberInt("1000"),
flowControlTicketMultiplierConstant: 1.05,
flowControlWarnThresholdSeconds: NumberInt("10"),
forceRollbackViaRefetch: false,
globalConnPoolIdleTimeoutMinutes: NumberInt("2147483647"),
heapProfilingEnabled: false,
heapProfilingSampleIntervalBytes: NumberLong("262144"),
honorSystemUmask: false,
initialSyncOplogBuffer: "collection",
initialSyncOplogBufferPeekCacheSize: NumberInt("10000"),
initialSyncOplogFetcherBatchSize: NumberInt("13981010"),
internalDocumentSourceCursorBatchSizeBytes: NumberInt("4194304"),
internalDocumentSourceGroupMaxMemoryBytes: NumberLong("104857600"),
internalDocumentSourceLookupCacheSizeBytes: NumberInt("104857600"),
internalDocumentSourceSortMaxBlockingSortBytes: NumberLong("104857600"),
internalGeoNearQuery2DMaxCoveringCells: NumberInt("16"),
internalGeoPredicateQuery2DMaxCoveringCells: NumberInt("16"),
internalInsertMaxBatchSize: NumberInt("64"),
internalLookupStageIntermediateDocumentMaxSizeBytes: NumberLong("104857600"),
internalProhibitShardOperationRetry: false,
internalQueryAllowShardedLookup: false,
internalQueryAlwaysMergeOnPrimaryShard: false,
internalQueryCacheDisableInactiveEntries: false,
internalQueryCacheEvictionRatio: 10,
internalQueryCacheFeedbacksStored: NumberInt("20"),
internalQueryCacheListPlansNewOutput: false,
internalQueryCacheSize: NumberInt("5000"),
internalQueryCacheWorksGrowthCoefficient: 2,
internalQueryDisableExchange: false,
internalQueryEnumerationMaxIntersectPerAnd: NumberInt("3"),
internalQueryEnumerationMaxOrSolutions: NumberInt("10"),
internalQueryExecMaxBlockingSortBytes: NumberInt("33554432"),
internalQueryExecYieldIterations: NumberInt("128"),
internalQueryExecYieldPeriodMS: NumberInt("10"),
internalQueryFacetBufferSizeBytes: NumberInt("104857600"),
internalQueryForceIntersectionPlans: false,
internalQueryIgnoreUnknownJSONSchemaKeywords: false,
internalQueryMaxScansToExplode: NumberInt("200"),
internalQueryPlanEvaluationCollFraction: 0.3,
internalQueryPlanEvaluationMaxResults: NumberInt("101"),
internalQueryPlanEvaluationWorks: NumberInt("10000"),
internalQueryPlanOrChildrenIndependently: true,
internalQueryPlannerEnableHashIntersection: false,
internalQueryPlannerEnableIndexIntersection: true,
internalQueryPlannerGenerateCoveredWholeIndexScans: false,
internalQueryPlannerMaxIndexedSolutions: NumberInt("64"),
internalQueryProhibitBlockingMergeOnMongoS: false,
internalQueryProhibitMergingOnMongoS: false,
internalQueryS2GeoCoarsestLevel: NumberInt("0"),
internalQueryS2GeoFinestLevel: NumberInt("23"),
internalQueryS2GeoMaxCells: NumberInt("20"),
internalValidateFeaturesAsMaster: true,
javascriptProtection: false,
journalCommitInterval: NumberInt("100"),
jsHeapLimitMB: NumberInt("1100"),
localLogicalSessionTimeoutMinutes: NumberInt("30"),
logComponentVerbosity: {
verbosity: NumberInt("0"),
accessControl: {
verbosity: NumberInt("-1")
},
command: {
verbosity: NumberInt("-1")
},
control: {
verbosity: NumberInt("-1")
},
executor: {
verbosity: NumberInt("-1")
},
geo: {
verbosity: NumberInt("-1")
},
index: {
verbosity: NumberInt("-1")
},
network: {
verbosity: NumberInt("-1"),
asio: {
verbosity: NumberInt("-1")
},
bridge: {
verbosity: NumberInt("-1")
},
connectionPool: {
verbosity: NumberInt("-1")
}
},
query: {
verbosity: NumberInt("-1")
},
replication: {
verbosity: NumberInt("-1"),
election: {
verbosity: NumberInt("-1")
},
heartbeats: {
verbosity: NumberInt("-1")
},
initialSync: {
verbosity: NumberInt("-1")
},
rollback: {
verbosity: NumberInt("-1")
}
},
sharding: {
verbosity: NumberInt("-1"),
shardingCatalogRefresh: {
verbosity: NumberInt("-1")
}
},
storage: {
verbosity: NumberInt("-1"),
recovery: {
verbosity: NumberInt("-1")
},
journal: {
verbosity: NumberInt("-1")
}
},
write: {
verbosity: NumberInt("-1")
},
ftdc: {
verbosity: NumberInt("-1")
},
tracking: {
verbosity: NumberInt("-1")
},
transaction: {
verbosity: NumberInt("-1")
}
},
logLevel: NumberInt("0"),
logicalSessionRefreshMillis: NumberInt("300000"),
maxAcceptableLogicalClockDriftSecs: NumberLong("31536000"),
maxBSONDepth: NumberInt("200"),
maxIndexBuildDrainBatchSize: NumberInt("1000"),
maxIndexBuildDrainMemoryUsageMegabytes: NumberInt("16"),
maxIndexBuildMemoryUsageMegabytes: NumberInt("2048"),
maxLogSizeKB: NumberInt("10"),
maxNumberOfTransactionOperationsInSingleOplogEntry: NumberInt("2147483647"),
maxOplogTruncationPointsAfterStartup: NumberLong("100"),
maxOplogTruncationPointsDuringStartup: NumberLong("100"),
maxSessions: NumberInt("1000000"),
maxSyncSourceLagSecs: NumberInt("30"),
maxTargetSnapshotHistoryWindowInSeconds: NumberInt("5"),
maxTransactionLockRequestTimeoutMillis: NumberInt("5"),
migrateCloneInsertionBatchDelayMS: NumberInt("0"),
migrateCloneInsertionBatchSize: NumberInt("0"),
migrationLockAcquisitionMaxWaitMS: NumberInt("500"),
minMillisBetweenSnapshotWindowDec: NumberInt("500"),
minMillisBetweenSnapshotWindowInc: NumberInt("500"),
minOplogTruncationPoints: NumberLong("10"),
notablescan: false,
numInitialSyncAttempts: NumberInt("10"),
numInitialSyncCollectionCountAttempts: NumberInt("3"),
numInitialSyncCollectionFindAttempts: NumberInt("3"),
numInitialSyncConnectAttempts: NumberInt("10"),
numInitialSyncListCollectionsAttempts: NumberInt("3"),
numInitialSyncListDatabasesAttempts: NumberInt("3"),
numInitialSyncListIndexesAttempts: NumberInt("3"),
numInitialSyncOplogFindAttempts: NumberInt("3"),
opensslCipherConfig: "",
opensslDiffieHellmanParameters: "",
oplogFetcherInitialSyncMaxFetcherRestarts: NumberInt("10"),
oplogFetcherSteadyStateMaxFetcherRestarts: NumberInt("1"),
oplogInitialFindMaxSeconds: NumberInt("60"),
oplogRetriedFindMaxSeconds: NumberInt("2"),
oplogTruncationPointSizeMB: NumberInt("0"),
orphanCleanupDelaySecs: NumberInt("900"),
periodicNoopIntervalSecs: NumberInt("10"),
priorityTakeoverFreshnessWindowSeconds: NumberInt("2"),
quiet: false,
rangeDeleterBatchDelayMS: NumberInt("20"),
rangeDeleterBatchSize: NumberInt("0"),
recoverFromOplogAsStandalone: false,
replBatchLimitBytes: NumberInt("104857600"),
replBatchLimitOperations: NumberInt("5000"),
replElectionTimeoutOffsetLimitFraction: 0.15,
replWriterThreadCount: NumberInt("16"),
reportOpWriteConcernCountersInServerStatus: false,
reservedServiceExecutorRecursionLimit: NumberInt("8"),
roleGraphInvalidationIsFatal: false,
rollbackRemoteOplogQueryBatchSize: NumberInt("2000"),
rollbackTimeLimitSecs: NumberInt("86400"),
saslHostName: "04d109525aa0",
saslServiceName: "mongodb",
saslauthdPath: "",
scramIterationCount: NumberInt("10000"),
scramSHA256IterationCount: NumberInt("15000"),
scriptingEngineInterruptIntervalMS: NumberInt("1000"),
shardedConnPoolIdleTimeoutMinutes: NumberInt("2147483647"),
skipCorruptDocumentsWhenCloning: false,
skipShardingConfigurationChecks: false,
skipShellCursorFinalize: false,
snapshotWindowAdditiveIncreaseSeconds: NumberInt("2"),
snapshotWindowMultiplicativeDecrease: 0.75,
sslMode: "disabled",
sslWithholdClientCertificate: false,
startupAuthSchemaValidation: true,
suppressNoTLSPeerCertificateWarning: false,
syncdelay: 60,
synchronousServiceExecutorRecursionLimit: NumberInt("8"),
takeUnstableCheckpointOnShutdown: false,
taskExecutorPoolSize: NumberInt("1"),
tcmallocAggressiveMemoryDecommit: NumberInt("0"),
tcmallocMaxTotalThreadCacheBytes: NumberLong("1073741824"),
testingSnapshotBehaviorInIsolation: false,
tlsMode: "disabled",
tlsWithholdClientCertificate: false,
traceExceptions: false,
traceWriteConflictExceptions: false,
trafficRecordingDirectory: "",
transactionLifetimeLimitSeconds: NumberInt("60"),
transactionSizeLimitBytes: NumberLong("9223372036854775807"),
ttlMonitorEnabled: true,
ttlMonitorSleepSecs: NumberInt("60"),
useReadOnceCursorsForIndexBuilds: true,
waitForSecondaryBeforeNoopWriteMS: NumberInt("10"),
waitForStepDownOnNonCommandShutdown: true,
watchdogPeriodSeconds: NumberInt("-1"),
wiredTigerConcurrentReadTransactions: NumberInt("128"),
wiredTigerConcurrentWriteTransactions: NumberInt("128"),
wiredTigerCursorCacheSize: NumberInt("-100"),
wiredTigerEngineRuntimeConfig: "",
wiredTigerFileHandleCloseIdleTime: NumberInt("100000"),
wiredTigerFileHandleCloseMinimum: NumberInt("250"),
wiredTigerFileHandleCloseScanInterval: NumberInt("10"),
wiredTigerMaxCacheOverflowSizeGB: 0,
wiredTigerSessionCloseIdleTimeSecs: NumberInt("300"),
writePeriodicNoops: true,
ok: 1
} );
New discovery
When i drop the original index(built on a 6 digit string field ,say 688280, and i call it "code" which has a bad cardinality) ,and create a new one(built on a 17 digit time string,say 20211207150727460,and i call it "time" which has a good cardinality) ,the cpu usage drop significantly.
Thus , I guess the situation has something to do with the Cardinality of the index . Or maybe the sequentiality of the inserted data(time field has a better sequentiality) ?
Nevertheless , keepping the sequentiality of "code" matters for business query reason, so at the monent, i change the single field index to a composited index(add field _id to increase the cardinality and create a new one: code_1__id_-1). Check the cpu usage again , the result is of course not as good as "none index" or "time index" , but it also decline to around 100%(compare to previous 300%-400% usage) and sometime spike to 200% for a few seconds.
Anyway,this is a temporary solution and i still don't get the "point". If anyone knows anything ,pls offer some help, thanks in advance!!!
In general it is best you do the initial batch insert first and create the index in background later to keep your database still available during index creation: