How to pinpoint a function that is hanging

65 views Asked by At

I have a simple golang program:

package main

import (
    "fmt"
    "net/http"
    _ "net/http/pprof"
    "os/exec"
)

func bar() {
    // wait 5 seconds
    cmd := exec.Command("sleep", "5")
    err := cmd.Run()
    if err != nil {
        panic(err)
    }
}

func fooHandler(w http.ResponseWriter, r *http.Request) {
    bar()
    fmt.Fprintf(w, "Hello, world!")
}

func main() {
    http.HandleFunc("/foo", fooHandler)
    http.ListenAndServe(":8080", nil)
}

All I want to do is using golang's trace feature to pinpoint the function causing the 5 second delay. However, I can't figure out how to do this.

I've run

curl -o trace.out http://localhost:8080/debug/pprof/trace?seconds=20

And then hit the /foo endpoint. That generates me a trace.out which I can view with go tool trace trace.out. I can see this:

enter image description here

And indeed around the 9 second mark I can see in the visualizer that a thread took 5 seconds to complete, but I cannot seem to find how to get a stack trace or something that could somehow point me to the fact that bar() is the culprit using up all the time. Am I using the wrong tool here?

2

There are 2 answers

0
coxley On

You have a couple options:

  1. Define custom trace regions with runtime/trace
  2. Use golang.org/x/net/trace and use the real-time visualizer

Option 1 is best if you want to use the de facto go tool trace or /dominikh/gotraceui methods of digging into traces.

Option 2 is quick to start but I've not used it outside of sample programs. Here's how we can bolt it onto your example:

package main

import (
    "context"
    "fmt"
    "net/http"
    _ "net/http/pprof"
    "os/exec"

    "golang.org/x/net/trace"
)

func bar(ctx context.Context) {
    tr, _ := trace.FromContext(ctx)
    tr.LazyPrintf("bar")

    cmd := exec.Command("sleep", "5")
    err := cmd.Run()
    if err != nil {
        panic(err)
    }
}

func fooHandler(w http.ResponseWriter, r *http.Request) {
    tr := trace.New("foo", r.URL.Path)
    defer tr.Finish()

    ctx := trace.NewContext(r.Context(), tr)
    bar(ctx)

    fmt.Fprintf(w, "Hello, world!")
}

func main() {
    // This pulls in /debug/requests and /debug/events
    mux := http.DefaultServeMux

    mux.HandleFunc("/foo", fooHandler)
    http.ListenAndServe(":8080", mux)
}

After issuing a request to /foo, we can open localhost:8080/debug/requests in the browser to see the following:

Screenshot of traces

0
Gillespie On

In case it wasn't obvious, my toy example was a surrogate for a real problem I was experiencing, which in this case was a cold start delay I was experiencing in Cloud Run. During a cold start, the first request would take 20 seconds, but subsequent requests were fast.

One gotcha was that my webapp was behind a load balancer route so I had to change the pprof default endpoint with:

import (
    "log"
    "net/http"
    pprofhttp "net/http/pprof"
)

func main() {
    // ...
    mux.Handle("/trace", http.HandlerFunc(pprofhttp.Trace))
    // ...
}

Once I did that, I was indeed able to use curl -o trace.out https://example.com/mywebapp/trace?seconds=60 to debug the problem. The key for me was to ignore the event timeline view and focus on the goroutine analysis and profiles.

enter image description here

In the goroutine analysis, I discovered that 21 seconds were being tied up in "network wait":

enter image description here

So then I went to the network profile and pretty far down found:

enter image description here

Which led me to the problematic function