I'm still learning Go and was doing the exercise of a web crawler as linked here. The main part I implemented is as follows. (Other parts remain the same and can be found in the link.)
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
cache.Set(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
if cache.Get(u) == false {
fmt.Println("Next:", u)
Crawl(u, depth-1, fetcher) // I want to parallelize this
}
}
return
}
func main() {
Crawl("https://golang.org/", 4, fetcher)
}
type SafeCache struct {
v map[string]bool
mux sync.Mutex
}
func (c *SafeCache) Set(key string) {
c.mux.Lock()
c.v[key] = true
c.mux.Unlock()
}
func (c *SafeCache) Get(key string) bool {
return c.v[key]
}
var cache SafeCache = SafeCache{v: make(map[string]bool)}
当我运行上面的代码时,结果是预期的:
found: https://golang.org/ "The Go Programming Language"
Next: https://golang.org/pkg/
found: https://golang.org/pkg/ "Packages"
Next: https://golang.org/cmd/
not found: https://golang.org/cmd/
Next: https://golang.org/pkg/fmt/
found: https://golang.org/pkg/fmt/ "Package fmt"
Next: https://golang.org/pkg/os/
found: https://golang.org/pkg/os/ "Package os"
However, when I tried to parallelize the crawler (on the line with a comment in the program above) by changing Crawl(u, depth-1, fetcher)
to go Crawl(u, depth-1, fetcher)
, the results were not as I expected:
found: https://golang.org/ "The Go Programming Language"
Next: https://golang.org/pkg/
Next: https://golang.org/cmd/
I thought directly adding a go
keyword is as straightforward as it seems, but I'm not not sure what went wrong and confused on how I should best approach this problem. Any advice would be appreciated. Thank you in advance!
Your program is most likely exiting before the crawlers finish doing their work. One approach would be for the
Crawl
to have aWaitGroup
where it waits for all of it's sub crawlers to finish. For example