2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Web scrapers are tools that automatically extract data from websites. They are widely used in data collection, search engine optimization, market research, and other fields. This article will detail how to use Go 1.19 to implement a simplified site template automated scraping tool to help developers efficiently collect data.
Before you begin, make sure you have Go 1.19 installed on your system. You can check the Go version with the following command:
go version
If you don't have Go installed yet, you can install it from Go Official Website Download and install the latest version.
The basic workflow of a web crawler is as follows:
In the Go language, there are several popular crawler frameworks, such as:
This article will mainly use Colly and Goquery for web crawling and content parsing.
We will design a simplified site template automated crawler, the basic process is as follows:
First, create a new Go project:
mkdir go_scraper
cd go_scraper
go mod init go_scraper
Then, install Colly and Goquery:
go get -u github.com/gocolly/colly
go get -u github.com/PuerkitoBio/goquery
Next, write a simple crawler to crawl web content:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
// 创建一个新的爬虫实例
c := colly.NewCollector()
// 设置请求时的回调函数
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// 设置响应时的回调函数
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
fmt.Println("Response:", string(r.Body))
})
// 设置错误处理的回调函数
c.OnError(func(r *colly.Response, err error) {
fmt.Println("Error:", err)
})
// 设置HTML解析时的回调函数
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Title:", e.Text)
})
// 开始爬取
c.Visit("http://example.com")
}
Running the above code will fetch the content of http://example.com and print the title of the webpage.
In order to extract the required data from the web page, we need to use Goquery to parse the HTML content. The following example shows how to use Goquery to extract links and text from a web page:
package main
import (
"fmt"
"github.com/gocolly/colly"
"github.com/PuerkitoBio/goquery"
)
func main() {
c := colly.NewCollector()
c.OnHTML("body", func(e *colly.HTMLElement) {
e.DOM.Find("a").Each(func(index int, item *goquery.Selection) {
link, _ := item.Attr("href")
text := item.Text()
fmt.Printf("Link #%d: %s (%s)n", index, text, link)
})
})
c.Visit("http://example.com")
}
To improve the efficiency of the crawler, we can use the concurrency function of Colly:
package main
import (
"fmt"
"github.com/gocolly/colly"
"github.com/PuerkitoBio/goquery"
"log"
"time"
)
func main() {
c := colly.NewCollector(
colly.Async(true), // 启用异步模式
)
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2, // 设置并发数
Delay: 2 * time.Second,
})
c.OnHTML("body", func(e *colly.HTMLElement) {
e.DOM.Find("a").Each(func(index int, item *goquery.Selection) {
link, _ := item.Attr("href")
text := item.Text()
fmt.Printf("Link #%d: %s (%s)n", index, text, link)
c.Visit(e.Request.AbsoluteURL(link))
})
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
c.OnError(func(r *colly.Response, err error) {
log.Println("Error:", err)
})
c.Visit("http://example.com")
c.Wait() // 等待所有异步任务完成
}
Save the captured data to a local file or database. Here we take a CSV file as an example:
package main
import (
"encoding/csv"
"fmt"
"github.com/gocolly/colly"
"github.com/PuerkitoBio/goquery"
"log"
"os"
"time"
)
func main() {
file, err := os.Create("data.csv")
if err != nil {
log.Fatalf("could not create file: %v", err)
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
c := colly.NewCollector(
colly.Async(true),
)
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 2 * time.Second,
})
c.OnHTML("body", func(e *colly.HTMLElement) {
e.DOM.Find("a").Each(func(index int, item *goquery.Selection) {
link, _ := item.Attr("href")
text := item.Text()
fmt.Printf("Link #%d: %s (%s)n", index, text, link)
writer.Write([]string{text, link})
c.Visit(e.Request.AbsoluteURL(link))
})
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
c.OnError(func(r *colly.Response, err error) {
log.Println("Error:", err)
})
c.Visit("http://example.com")
c.Wait()
}
In order to improve the stability of the crawler, we need to handle request errors and implement a retry mechanism:
package main
import (
"fmt"
"github.com/gocolly/colly"
"github.com/PuerkitoBio/goquery"
"log"
"os"
"time"
)
func main() {
file, err := os.Create("data.csv")
if err != nil {
log.Fatalf("could not create file: %v", err)
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
c := colly.NewCollector(
colly.Async(true),
colly.MaxDepth(1),
)
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 2 * time.Second,
})
c.OnHTML("body", func(e *colly.HTMLElement) {
e.DOM.Find("a").Each(func(index int, item *goquery.Selection) {
link, _ := item.Attr("href")
text := item.Text()
fmt.Printf("Link #%d: %s (%s)
n", index, text, link)
writer.Write([]string{text, link})
c.Visit(e.Request.AbsoluteURL(link))
})
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
c.OnError(func(r *colly.Response, err error) {
log.Println("Error:", err)
// 重试机制
if r.StatusCode == 0 || r.StatusCode >= 500 {
r.Request.Retry()
}
})
c.Visit("http://example.com")
c.Wait()
}
The following example shows how to scrape the titles and links of a news website and save them to a CSV file:
package main
import (
"encoding/csv"
"fmt"
"github.com/gocolly/colly"
"log"
"os"
"time"
)
func main() {
file, err := os.Create("news.csv")
if err != nil {
log.Fatalf("could not create file: %v", err)
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
writer.Write([]string{"Title", "Link"})
c := colly.NewCollector(
colly.Async(true),
)
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 5,
Delay: 1 * time.Second,
})
c.OnHTML(".news-title", func(e *colly.HTMLElement) {
title := e.Text
link := e.ChildAttr("a", "href")
writer.Write([]string{title, e.Request.AbsoluteURL(link)})
fmt.Printf("Title: %snLink: %sn", title, e.Request.AbsoluteURL(link))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
c.OnError(func(r *colly.Response, err error) {
log.Println("Error:", err)
if r.StatusCode == 0 || r.StatusCode >= 500 {
r.Request.Retry()
}
})
c.Visit("http://example-news-site.com")
c.Wait()
}
To avoid being blocked by the target website, you can use a proxy:
c.SetProxy("http://proxyserver:port")
By setting the user agent, disguise as a different browser:
c.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
You can use Colly's extension library Colly-Redis to implement distributed crawlers:
import (
"github.com/gocolly/redisstorage"
)
func main() {
c := colly.NewCollector()
redisStorage := &redisstorage.Storage{
Address: "localhost:6379",
Password: "",
DB: 0,
Prefix: "colly",
}
c.SetStorage(redisStorage)
}
For dynamic web pages, you can use a headless browser such as chromedp:
import (
"context"
"github.com/chromedp/chromedp"
)
func main() {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
var res string
err := chromedp.Run(ctx,
chromedp.Navigate("http://example.com"),
chromedp.WaitVisible(`#some-element`),
chromedp.InnerHTML(`#some-element`, &res),
)
if err != nil {
log.Fatal(err)
}
fmt.Println(res)
}
Through the detailed introduction of this article, we learned how to use Go 1.19 to implement a simplified site template automated crawler. We started with the basic crawler design process, gradually delved into key links such as HTML parsing, concurrent processing, data storage, and error handling, and demonstrated how to crawl and process web page data through specific code examples.
Go language's powerful concurrent processing capabilities and rich third-party libraries make it an ideal choice for building efficient and stable web crawlers. Through continuous optimization and expansion, more complex and advanced crawler functions can be achieved, providing solutions for various data collection needs.
I hope this article can provide you with valuable references for implementing web crawlers in Go language and inspire you to explore and innovate more in this field.