BigQuery golang client not using storage API

179 views Asked by At

In my bigquery client, I am enabling the storage read client, as per the documentation and examples at https://github.com/googleapis/google-cloud-go/blob/f2b13307a85e81e278476ea51a359cbb3974667a/bigquery/examples_test.go#L165-L186

My code is basically identical:

        err = bigQueryClient.BQ.EnableStorageReadClient(context.Background(), option.WithCredentialsFile(o.GoogleServiceAccountCredentialFile))
     [...]

    it, err := query.Read(context.TODO())
    if err != nil {
        log.WithError(err).Error("error querying test status from bigquery")
        errs = append(errs, err)
        return status, errs
    }

    for {
        testStatus := apitype.ComponentTestStatusRow{}
        err := it.Next(&testStatus)
        if err == iterator.Done {
            break
        }
        ....
    }

I then make a query that returns about 150,000 rows, about 150MB. I then iterate over them. This is taking about 30 seconds, which is really slow to me. Similar activity from postgres is about 3 seconds.

I was trying to figure out if my app is using the storage API at all, and my understanding is this is an RPC-based protocol. I profiled the golang app using pprof and I see a lot of time being spent in JSON decoding from the bigquery client. I then look at tcpdump and see it making lots of https connections.

This makes me think it's not using the storage API. Is there a way to figure out why, or to force it to use the faster storage API if it's not?

1

There are 1 answers

0
stbenjam On

While debugging and trying to use the storage API directly, I discovered my user was missing the correct permission to use the storage API (bigquery.readsessions.create). EnableStorageReadClient in the golang library doesn't actually return an error if you can't use it, everything just falls back silently to the very slow REST API.

Reproducer code

package main

import (
    "context"
    "errors"
    "flag"
    "fmt"
    "log"
    "time"

    "cloud.google.com/go/bigquery"
    "google.golang.org/api/iterator"
    "google.golang.org/api/option"
)

func main() {
    ctx := context.Background()

    serviceAccountPath := flag.String("service-account", "", "Path to service account JSON file")
    project := flag.String("project", "", "Project")
    dataset := flag.String("dataset", "", "Dataset")
    table := flag.String("table", "", "Table")
    flag.Parse()

    client, err := bigquery.NewClient(ctx, *project, option.WithCredentialsFile(*serviceAccountPath))
    if err != nil {
        log.Fatalf("Failed to create client: %v", err)
    }

    // Enable Storage API usage for fetching data
    if err := client.EnableStorageReadClient(context.Background(), option.WithCredentialsFile(*serviceAccountPath)); err != nil {
        panic(err)
    }

    query := client.Query(fmt.Sprintf(`
        SELECT * FROM `+"`%s.%s.%s`", *project, *dataset, *table))

    start := time.Now()
    it, err := query.Read(ctx)
    if err != nil {
        log.Fatalf("Failed to read query: %v", err)
    }

    log.Printf("processing records...")
    var rows []bigquery.Value
    for {
        var row []bigquery.Value
        err := it.Next(&row)
        if errors.Is(err, iterator.Done) {
            break
        }
        if err != nil {
            log.Fatalf("Failed during iteration: %v", err)
        }
        rows = append(rows, row)
    }

    elapsed := time.Since(start)
    log.Printf("Time taken: %s\n", elapsed)
}

Results using an account with and without that permission:

$ go run bigquery.go -service-account $HOME/tokens/no-read-session.json -project XXXX -dataset XXXX -table junit_precomputed_20231004_20231031_414
Time taken: 1m50.318955958s
 $ go run bigquery.go -service-account $HOME/tokens/with-read-session.json -project XXXX -dataset XXXX -table junit_precomputed_20231004_20231031_414
Time taken: 17.922398s

I filed a bug against the golang library.

https://github.com/googleapis/google-cloud-go/issues/9102