I am studying the MIT 6.824 course, and the lab2 of the course is to implement the basic Raft consensus algorithm. The environment for my course labs is WSL2 of Windows operating system, and the Linux distribution is Ubuntu22.04.2 LTS, the version of Go is go1.20.1 linux/amd64, CPU is intel i7-12650h.
In my code implementation, after a cluster member is elected as Leader, a logSyncer Goroutine will be started for each members in the cluster, to regularly check whether there are new log entries that need to be replicated to the followers or send heartbeats.
The code logic of logSyncer is as follows:
func (rf *Raft) logSyncer(server int, expectTerm int) {
noOpsCnt := -1
for {
if rf.killed() {
return
}
rf.mu.Lock()
if rf.currentTerm != expectTerm {
rf.mu.Unlock()
return
}
curLastLogIndex := rf.getGlobalIndex(len(rf.log) - 1)
if curLastLogIndex >= rf.nextIdx[server] || (noOpsCnt+1)%5 == 0 {
// the condition to step into this block:
// 1. time has passed a Heartbeat period (5*20ms ~ 5*30ms)
// 2. new log entries already exist and can be sent to other followers
rf.syncEntriesToFollower(server, expectTerm)
noOpsCnt = 0
} else {
// no-ops, just increment counter
noOpsCnt++
}
rf.mu.Unlock()
// before starting a new for-loop, sleep 20~30 ms
randSleepTime := GetRandom(20, 30)
time.Sleep(time.Duration(randSleepTime))
}
}
However, when I start 40 sub-processes and continuously execute the lab test program about the log replication function multiple times, I may encounter the following puzzling situation every 2500 times, which ultimately leads to the failure of the test: command line execute test program
In the normal process, the leader will send a heartbeat to the follower every once in a while (I set it to 100~150ms) to ensure that the follower will not time out and start a new round of elections. However, in the failed case, there will be a time interval of 500 ms ~ 1s where the leader does not send a heartbeat, causing a follower to start a new round of election.
Then I begin the difficult process of debugging what I think there was some bugs in my implementation. I initially suspect that the rf.mu.Lock() operation can't return in a short time because the mutex is held while performing blocking operations (channel send, rpc or sleep, etc.)
I add a check() Goroutine to the above code. When the lock is not acquired for a long time(I set it to 200ms), it will print the stack frames of all Goroutines. But to my disappointment, running the failed test case doesn't trigger a panic and print the Goroutine function stack.
for{
//...
done := make(chan struct{}, 1)
check := func() {
ticker := time.NewTicker(200 * time.Millisecond)
defer ticker.Stop()
select {
case <-done:
return
case <-ticker.C:
// print Goroutines stack
buf := make([]byte, 24576) // 24 KB
buf = buf[:runtime.Stack(buf, true)]
fmt.Println(string(buf))
panic("spend more than 200ms to acquire rf.mu")
}
}
go check()
rf.mu.Lock()
// let check() exit
done<-struct{}{}
// ...
}
Later, I add more log information:
func (rf *Raft) logSyncer(server int, expectTerm int) {
noOpsCnt := -1
tPrevEndLoop := time.Now().UnixMilli()
for {
tBeginLoop := time.Now().UnixMilli()
// 1. print tBeginLoop - prevEndLoop
if rf.killed() {
return
}
tKill := time.Now().UnixMilli()
// 2. print tKill - tBeginLoop
// start check()
// ...
rf.mu.Lock()
// exit check()
// ...
tLock := time.Now().UnixMilli()
// 3. print tLock - tKill
curLastLogIndex := rf.getGlobalIndex(len(rf.log) - 1)
if curLastLogIndex >= rf.nextIdx[server] || (noOpsCnt+1)%5 == 0 {
// the condition to step into this block:
// 1. time has passed a Heartbeat period (5*20ms ~ 5*30ms)
// 2. new log entries already exist and can be sent to other followers
rf.syncEntriesToFollower(server, expectTerm)
noOpsCnt = 0
} else {
// no-ops, just increment counter
noOpsCnt++
}
rf.mu.Unlock()
tBeforeSleep := time.Now().UnixMilli()
// before starting a new for-loop, sleep 20~30 ms
randSleepTime := GetRandom(20, 30)
time.Sleep(time.Duration(randSleepTime))
tAfterSleep := time.Now().UnixMilli()
// 4. print tAfterSleep - tBeforeSleep
tPrevEndLoop = tAfterSleep
}
}
I add multiple time points and print the difference between them:
tPrevEndLoop: the end moment of the previous for-loop.tBeginLoop: the start moment of the current for-loop.tKill: the moment to check the survival status of the raft instance.tLock: the moment whenlogSyncerGoroutine acquires the mutex lockrf.mu.tBeforeSleep,tAfterSleep: the moment whenlogSyncerstarts and ends sleeping after executing a loop.
What I don't expect is that there would be an interval of up to 1s between the end of a for-loop and the execution of a new for-loop. This is incredible for code that does not include network and disk I/O!
017048 TEST Server 0 Leader Term 1, logSyncer-> Server 2, noOpsCnt 3,
(tBeginLoop - tPrevEndLoop) takes 1038 ms, (tKill- tBeginLoop) takes 0 ms, (tLock- tKill) takes 0 ms, (tAfterSleep - tBeforeSleep) takes 26ms
I would like to ask my friends, what may have happened to cause the above situation to happen?