I'm new to Go and am trying to take advantage of the concurrency in Go to build a basic scraper to pull extract title, meta description, and meta keywords from URLs.
I am able to print out the results to terminal with the concurrency but can't figure out how to write output to CSV. I've tried many a variations that I could think of with limited knowledge of Go and many end up breaking the concurrency - so losing my mind a bit.
My code and URL input file is below - Thanks in advance for any tips!
// file name: metascraper.go
package main
import (
// import standard libraries
"encoding/csv"
"fmt"
"io"
"log"
"os"
"time"
// import third party libraries
"github.com/PuerkitoBio/goquery"
)
func csvParsing() {
file, err := os.Open("data/sample.csv")
checkError("Cannot open file ", err)
if err != nil {
// err is printable
// elements passed are separated by space automatically
fmt.Println("Error:", err)
return
}
// automatically call Close() at the end of current method
defer file.Close()
//
reader := csv.NewReader(file)
// options are available at:
// http://golang.org/src/pkg/encoding/csv/reader.go?s=3213:3671#L94
reader.Comma = ';'
lineCount := 0
fileWrite, err := os.Create("data/result.csv")
checkError("Cannot create file", err)
defer fileWrite.Close()
writer := csv.NewWriter(fileWrite)
defer writer.Flush()
for {
// read just one record
record, err := reader.Read()
// end-of-file is fitted into err
if err == io.EOF {
break
} else if err != nil {
fmt.Println("Error:", err)
return
}
go func(url string) {
// fmt.Println(msg)
doc, err := goquery.NewDocument(url)
if err != nil {
checkError("No URL", err)
}
metaDescription := make(chan string, 1)
pageTitle := make(chan string, 1)
go func() {
// time.Sleep(time.Second * 2)
// use CSS selector found with the browser inspector
// for each, use index and item
pageTitle <- doc.Find("title").Contents().Text()
doc.Find("meta").Each(func(index int, item *goquery.Selection) {
if item.AttrOr("name", "") == "description" {
metaDescription <- item.AttrOr("content", "")
}
})
}()
select {
case res := <-metaDescription:
resTitle := <-pageTitle
fmt.Println(res)
fmt.Println(resTitle)
// Have been trying to output to CSV here but it's not working
// writer.Write([]string{url, resTitle, res})
// err := writer.WriteString(`res`)
// checkError("Cannot write to file", err)
case <-time.After(time.Second * 2):
fmt.Println("timeout 2")
}
}(record[0])
fmt.Println()
lineCount++
}
}
func main() {
csvParsing()
//Code is to make sure there is a pause before program finishes so we can see output
var input string
fmt.Scanln(&input)
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
The data/sample.csv input file with URLs:
http://jonathanmh.com
http://keshavmalani.com
http://google.com
http://bing.com
http://facebook.com
In the code you supplied, you had commented the following code:
This code is correct, except you have one issue. Earlier in the function, you have the following code:
This code causes the fileWriter to close once your
csvParsing()
func exits. Because you've closed fileWriter with the defer, you are unable to write to it in your concurrent function.Solution: You'll need to use
defer fileWrite.Close()
inside your concurrent func or something similar so you do not close the fileWriter before you have written to it.