Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expressions with functions are not goroutine-safe #43

Closed
a13xb opened this issue Jan 1, 2020 · 2 comments
Closed

Expressions with functions are not goroutine-safe #43

a13xb opened this issue Jan 1, 2020 · 2 comments

Comments

@a13xb
Copy link

a13xb commented Jan 1, 2020

I'm using this library to batch parse a large number of HTML documents in parallel (via antchfx/htmlquery) and it will always result either in a nil pointer dereference or wrong result.

The issue is reuse of query objects due to shallow cloning of functionQuery objects.

  1. htmlquery calls xpath.Expr.Select, which creates a new node iterator containing a cloned query.
  2. The query is cloned recursively.
  3. Eventually functionQuery is cloned but the clone is shallow, because the function is actually a closure that captured references to the original query objects when it was constructed.
  4. At this point the all cloned queries will reference the same original child query nodes that will be mutated concurrently when Select is called (for example attributeQuery.iterator in attributeQuery.Select).

Here is a self-contained example with htmlquery, which uses a global cache of xpath.Expr objects, that hits the problem fairly quickly on my machine.

package main

import (
	"github.com/antchfx/htmlquery"
	"golang.org/x/net/html"
	"log"
	"strings"
)

func work() {
	doc, err := html.Parse(strings.NewReader(`
		<html>
			<head><title>Hello</title></head>
			<body>
				<div class="c1 c2">A</div>
				<div class="c2 c3">B</div>
				<div class="c3 c1">C</div>
			</body>
		</html>
	`))
	if err != nil {
		log.Fatal(err)
	}
	ns := htmlquery.Find(doc, `//div[contains(@class,"c1")]/text()`)
	if len(ns) == 0 {
		log.Fatal("text not found")
	}
	elems := []string{}
	for _, n := range ns {
		elems = append(elems, htmlquery.InnerText(n))
	}
	log.Printf(`%d elems: %v`, len(elems), elems)
}

func main() {
	for i := 0; i < 4; i++ {
		go func() {
			for {
				work()
			}
		}()
	}
	select {}
}

Note that it will run correctly if you remove the contains clause, and just use =, for example.

My current workaround is to disable query cache in htmlquery, but what really needs to be done is to refactor functionQuery so it can be cloned correctly.

zhengchun added a commit that referenced this issue Jan 3, 2020
allows clone about `functionQuery` input arguments which implemented `query` interface.
@zhengchun
Copy link
Contributor

Thanks for your feedback. I added a new method to decide whether clone input argument. This might help.

@zhengchun
Copy link
Contributor

Closed, fixed in v1.1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants