-
-
Notifications
You must be signed in to change notification settings - Fork 921
Tips and tricks
The go.net/html
package used by goquery
requires that the html document is UTF-8 encoded. When you know the encoding of the html page is not UTF-8, you can use the iconv
package to convert it to UTF-8 (there are various implementation of the iconv
API, see godoc.org for other options):
$ go get -u github.com/djimenez/iconv-go
and then:
// Load the URL
res, err := http.Get(url)
if err != nil {
// handle error
}
defer res.Body.Close()
// Convert the designated charset HTML to utf-8 encoded HTML.
// `charset` being one of the charsets known by the iconv package.
utfBody, err := iconv.NewReader(res.Body, charset, "utf-8")
if err != nil {
// handler error
}
// use utfBody using goquery
doc, err := goquery.NewDocumentFromReader(utfBody)
if err != nil {
// handler error
}
// use doc...
Thanks to github user @YuheiNakasaka.
Actually, the official go.text repository covers this use case too, see its godoc page for the details.
User @jayme-github used the following to guess a pages charset (if charset is unknown) using x/text/encoding and x/net/html/charset:
func detectContentCharset(body io.Reader) string {
r := bufio.NewReader(body)
if data, err := r.Peek(1024); err == nil {
if _, name, ok := charset.DetermineEncoding(data, ""); ok {
return name
}
}
return "utf-8"
}
// DecodeHTMLBody returns an decoding reader of the html Body for the specified `charset`
// If `charset` is empty, DecodeHTMLBody tries to guess the encoding from the content
func DecodeHTMLBody(body io.Reader, charset string) (io.Reader, error) {
if charset == "" {
charset = detectContentCharset(body)
}
e, err := htmlindex.Get(charset)
if err != nil {
return nil, err
}
if name, _ := htmlindex.Name(e); name != "utf-8" {
body = e.NewDecoder().Reader(body)
}
return body, nil
}
Also, charset.NewReader
could be handy:
import (
"github.com/PuerkitoBio/goquery"
"golang.org/x/net/html/charset"
"net/http"
)
func main() {
resp, err := http.Get("https://example.com")
if err != nil {
// handle error
}
defer resp.Body.Close()
ct := resp.Header.Get("Content-Type")
bodyReader, err := charset.NewReader(resp.Body, ct)
if err != nil {
// handle error
}
// bodyReader converts the content of resp.Body to UTF-8 if in need
doc, err := goquery.NewDocumentFromReader(bodyReader)
if err != nil {
// handle error
}
// use the doc as usual
}
Contributed by Github user @nanmu42
goquery
is great to handle normal html pages, but when most of the page is build dynamically using javascript, there's not much it can do. There are various options when faced with this problem:
You can find a code example using otto
in this gist. Thanks to github user @cryptix.
If all you need is a normal for
loop over all nodes in the current selection, where Map/Each
-style iteration is not necessary, you can use the following:
sel := Doc().Find(".selector")
for i := range sel.Nodes {
single := sel.Eq(i)
// use `single` as a selection of 1 node
}
Thanks to github user @jmoiron.