Decide what kind of tree cssselect2 should work on #1

SimonSapin · 2014-04-07T15:49:32Z

Here is a brain dump:

One of the design goals of cssselect2 is to allow WeasyPrint to move away from lxml (in order to run on PyPy, and have one less non-obvious-to-install dependency.) See Kozea/WeasyPrint#64.

The obvious alternative is ElementTree. If cssselect2 works with ElementTree documents, it should also work with lxml documents.

However, ElementTree elements do not have lxml’s .getparent(), .iterancestors(), or .getroot() methods. Given an element, there is no way to go up the tree. This was a design decision made for ElementTree to avoid reference cycles, at a time when CPython (version 1.x) did not have a cycle collector. (And letting reference counting leak entire trees would have been kinda bad.) Nowadays, adding parent references would change the semantics of tree mutations and thus is not done for backwards compatibility.

To efficiently match entire stylesheets against a document, cssselect2 wants to match Selectors right-to-left, which requires going up the tree to find an element’s parent or ancestor. It also wants to cache a bunch of stuff (such as a parsed set of classes, or a language based on lang attribute of ancestors) on the objects representing elements. In both ElementTree and lxml, Element objects do not accept new attributes.

cssselect2 currently (c0a1e70) works around this by having ElementWrapper objects that are created and destroyed on the fly when traversing the tree. As of Selectors Level 3, matching against an element only requires access to its ancestors, previous siblings, and the previous siblings of its ancestors.

However, new features in Selectors Level 4 will require arbitrary access to the whole tree for matching. ElementWrapper objects can still be created lazily, but we’ll probably want (for efficiency) to keep all of them in memory until we drop the whole tree.

At that point, if we have our own objects for representing the entire tree, do we even need to keep the ElementTree objects around? Would it make more sense to design and use a new tree API? My experiments with tinydom show that it’s not too hard to integrate with expat’s XML parser and html5lib’s HTML parser to builds trees in a new API. WeasyPrint has very minimal needs (the access the tree after parsing can be reduced to a single in-order traversal, with no mutation.)

But I also want other projects using cssselect to move to cssselect2. (Because XPath sucks and cssselect is broken in subtle but hard-to-fix ways.) And these projects probably want more than WeasyPrint from a tree API: various kinds of traversal, mutation, serialization… I don’t really want to maintain a new tree API doing all that.

But the current idea of having element wrapper objects still requires starting from a wrapper of the root element and doing traversal with the wrappers, so it’s kind of a new tree API already anyway.

So, there is this big decision to make before cssselect2 is really usable.

To anyone who may want to use a CSS Selector matching library, what do you expect from that library? Please comment here.

The text was updated successfully, but these errors were encountered:

tabatkins · 2014-04-11T04:44:37Z

At minimum, I expect it to select nodes correctly from some sufficiently useful tree structure, where "sufficiently useful" is basically what you described - traversal, mutation, and serialization. If cssselect2 just returns nodes from something equivalent to or weaker than ElementTree, I won't be able to use it, and will have to stick with cssselect1 and its limitations.

SimonSapin · 2014-04-11T08:30:48Z

So ElementTree is too weak? What’s missing?

tabatkins · 2014-04-11T17:44:21Z

Parent links, as you mention. Several of Bikeshed's features involve searching the ancestors of a node, such as figuring out the type of a dfn (see the treeAttr() function in htmlhelpers.py). Dealing with headings is even more complicated, as I have to search backwards in the sibling lists of the ancestors too.

kovidgoyal · 2015-02-17T12:04:54Z

The "best" way is:

Since python has a cycle collector, why not extend lxml to allow every node to refer to an arbitrary python object?

This will probably require lxml to be built with its own private copy of libxml2. I dont know if the lxml maintainers would be willing to do that.

A compromise (less efficient) way would be to require support for user defined XPath functions. lxml already has that. It should be possible to map the css selectors if the XPath expressions can make use of arbitrary python functions.

As for caching, one would have to use something like ElelemtWrapper.

SimonSapin · 2015-02-17T12:13:03Z

Having lxml associate an arbitrary Python object to every node would definitely help, but it currently doesn’t and I’m not willing to invest the time and energy myself to make this happen.

I’m not sure how XPath functions are relevant here. The whole point of cssselect2 (as opposed to cssselect) is to implement Selectors directly, without translating them to XPath.

kovidgoyal · 2015-02-17T12:19:50Z

Presumably, the reason you want to not use XPath is because you cannot express some CSS selectors in XPath. If that were no longer true, there is no reason to not use XPath, anymore. Hence the suggestion for using functions. That way you can (relatively easily) get full CSS compliance, with vanilla lxml.

kovidgoyal · 2015-02-17T12:22:53Z

Incidentally, I looked at WeasyPrint, looks painful. Is it really planning to implement the CSS box model from scratch? Why not just use a browser engine? It is perfectly possible to do pagination, links, etc. with a browser engine using the CSS 3 columns module (see the PDF output in calibre).

SimonSapin · 2015-02-17T12:35:37Z

Presumably, the reason you want to not use XPath is because you cannot express some CSS selectors in XPath.

The reason is that it’s a terrible, terrible idea. I’ve been there. At first it seems "easier" than actually implementing Selectors, and it’s fine to get 80% of Selectors, but then there are so many things that are hard to get right that it’s just not worth it.

If you’re interested in the XPath approach, please see scrapy/cssselect#48. But note that cssselect2 is not cssselect, and please keep each discussion topic in the tracker of the relevant project, creating new issues if needed.

relatively easily

[citation needed]

Is [WeasyPrint] really planning to implement the CSS box model from scratch?

Not planning to, because it already has.

Why not just use a browser engine?

For a variety of reasons that are completely out of scope for this issue. Please don’t hijack this discussion.

It is perfectly possible to do pagination, links, etc. with a browser engine using the CSS 3 columns module (see the PDF output in calibre).

Again, [citation needed]. And again, this is out of scope. If you want to keep discussing WeasyPrint design choices, please take it elsewhere.

kovidgoyal · 2015-02-18T01:07:58Z

Woah, easy, there is no need to fly off the handle, I am only trying to help. I was trying to point out to you that your reasons for abandoning cssselect dont make much sense (to me). Namely:

It is actually possible to implement CSS selectors with XPath using lxml's support for custom XPath functions http://lxml.de/extensions.html
There is no way to achieve the goals you've set out for cssselect2 without sacrificing interoperability with widely used tree implementations
a) You need a custom tree implementation
b) Or you need to extend existing tree implementations in ways that are not simple/widely applicable
Your main motivation for abandoning lxml appears to be because you want to use PyPI, presumably for enhanced performance. However, you will get much better performance and fidelity to the ever evolving web standards by using a real browser engine. And a real browser engine is perfectly capable of supporting pagination, links, etc. in PDF output. You ask for a citation, I gave you one, install calibre, run

ebook-convert file.html .pdf --override-profile-size --paper-size a4

It uses PyQt's QtWebKit bindings, supports pagination, links, PDF outlines, custom header and footer templates with javascript, does not need a running X server, works on all major OSes, etc, etc.

Anyway, from your tone, you dont seem to be interested in my ideas. All the best and thanks for cssselect and tinycss (both of which I maintain custom versions of already).

SimonSapin · 2015-02-18T06:12:58Z

Alright, I’m sorry for my tone in the earlier message.

It is actually possible to implement CSS selectors with XPath using lxml's support for custom XPath functions http://lxml.de/extensions.html

You’re just asserting this “actually possible” without supporting it. I spent a lot of time trying to make this work, and I’ve come to believe that it’s either not possible, or that it would be so twisted that it wouldn’t be worthwhile: the original “translating to XPath seems easier than actually implementing matching” idea would be completely out of the window.

But I’d be happy to see a fix for e.g. scrapy/cssselect#12 and be proven wrong.

However, you will get much better performance and fidelity to the ever evolving web standards by using a real browser engine.

Thank you very much for asserting that two years of my work is not “real”. That said, I won’t discuss motivations for WeasyPrint’s existence any more here, it’s way off topic.

kovidgoyal · 2015-02-18T06:38:25Z

No problem, online discussions, often turn nasty simply because of a lack of emotional cues, as I'm sure you are aware :)

As I understand it, lxml's XPath python extensions allow you to define a custom python function that is passed an element node, and can then do anything you like with that node, including iterating over it, accessing its parents/siblings/descendants/attributes/etc. If that understanding is correct, it is trivially true that you can use extensions functions to implement any CSS selector. But, it's been a while since I actually used that API, so maybe I am mis-remembering. If I find the time, I will look at fixing your bug. However, that will be a while, as that particular selector class is not currently useful for my use case, so I would need to justify taking the time away from my day job.

And let me say, that I agree, using XPath is not the nicest solution. However, it has the great virtue of allowing cssselect to be used with any XML tree implementation, which, IMO, is not a property to be thrown away lightly.

As for WeasyPrint, dont get me wrong, it's a great project, and I think that it is important to have an HTML rendering stack apart from the major browsers. However, it seems pretty self evident that a non-browser rendering engine is always going to lag behind in terms of support for evolving standards, simply because of the manpower disparity if for no other reason. That does not make it worthless, just not necessarily the optimal solution for rendering to PDF. All I'm saying is, be aware that one can actually generate proper PDF output from a browser, despite the lacklustre default PDF output implementations in the major browsers.

kovidgoyal · 2015-02-18T06:39:27Z

Oh and if you wish to take this discussion elsewhere, I'll be happy to do so, just let me know where.

tabatkins · 2015-02-18T16:38:47Z

For serious, WeasyPrint is irrelevant to this discussion, stop bringing it up. It's one of the many projects using CSSSelect1; it's only marginally relevant here as a user of the library, and because Simon actually wrote it, so he's familiar with what features it needs. Beyond that it doesn't need to be discussed.

If you have actual answers to the question at the top of this issue, please give feedback.

kovidgoyal · 2015-02-18T16:47:36Z

Sigh, whatever, I am done with this. Goodbye and goodluck. Enjoy the deafening silence in this thread.

kovidgoyal · 2015-02-20T11:11:03Z

FYI: I implemented a replacement for cssselect that works with vanilla lxml trees, uses caches, implements the full CSS Level 3 spec (including all the bits that cssselect gets wrong). The only caveats are that I dropped namespace support and used case insensitive tag/attribute names, because for my use case implementing those would be an unnecessary perf hit.

The code is designed in a way that should make it trivial to adapt to any other kind of tree implementation (you just have to override half a dozen small methods in one class).

https://github.com/kovidgoyal/calibre/blob/master/src/css_selectors/select.py#L86

It uses the parser from cssselect (so thanks for that) and also passes the full csselect test suite (with a few modifications for tests that were wrong and dropping tests for functionality I dont implement (namespaces, :contains(), [a!=b])

This is of course new code, and there may well be bugs, in particular I have not really optimized the performance and the caching strategy is very simple minded.

Feel free to use the code if you want (it is GPLv3 licensed) or just use the idea to implement your own cssselect2.

kovidgoyal · 2015-02-20T11:12:25Z

Also note that most of the select tests are compared against running the same queries with WebKit, to give me extra confidence I didn't muck it up.

SimonSapin · 2015-02-20T11:18:22Z

Feel free to use the code if you want (it is GPLv3 licensed)

You’re aware that cssselect and cssselect2 are BSD-licensed, right? I have no intention on changing this.

kovidgoyal · 2015-02-20T11:19:07Z

Which is why I said, use the idea if you dont want to use the code.

kovidgoyal · 2015-02-20T11:20:38Z

All I ask is that if you do use the idea, leave a note in the source file, crediting me.

SimonSapin · 2015-02-20T11:20:48Z

I don’t know if that’s how the GPL works, and I’m not interested in spending the energy to find out.

kovidgoyal · 2015-02-20T11:21:18Z

OK, whatever, I tried.

bukzor · 2015-02-23T04:02:41Z

... back to the question: "To anyone who may want to use a CSS Selector matching library, what do you expect from that library? Please comment here."

We only use cssselect via pyquery, and only use pyquery to do some simple read-only assertions about our markup.

Directly to your design problem, I think you want to keep a lazily-constructed mirror document that contains any data structures necessary to implement your matches. The tricky bit comes when you must insure that the mirror document never becomes stale, since the original document is mutable. I think clojure is onto something with their immutable-by-default philosophy...

liZe · 2017-10-02T09:19:26Z

Version 0.2.x has been released and is used in WeasyPrint with a tree designed for its needs. @bukzor could it be used in pyquery as is?

reedstrm · 2017-10-11T15:56:27Z

I've got my own fork of an earlier version of cssselect2, over at http://github.com/Connexions/cssselect2, in support of a CSS based transformational extension, to take the place of finding people who know XSLT. :-) I added an extension capability, which might be upstreamable. Mostly posting here to say "yes, there are others interested in cssselect2". We're not 100% happy with version 1.0 of our CSSt (as is typical with any new thing, now that we have actually used it for more than toy examples), so are planning a rewrite soonish. I'll take a look at minimum at rebasing against your new version, and perhaps offer a PR for the extension mechanism. Time frame - end of year/mid January.

SimonSapin mentioned this issue Apr 7, 2014

Run on PyPy Kozea/WeasyPrint#64

Closed

SimonSapin changed the title ~~Decide what kind of tree should cssselect2 work on~~ Decide what kind of tree cssselect2 should work on Apr 7, 2014

SimonSapin mentioned this issue Dec 5, 2014

*:first-of-type and friends are not implemented yet scrapy/cssselect#4

Open

SimonSapin mentioned this issue Dec 19, 2014

:nth-last-child selector incorrectly starts at 0 instead of 1 scrapy/cssselect#46

Closed

SimonSapin mentioned this issue Feb 17, 2015

Project maintenance scrapy/cssselect#48

Closed

SimonSapin mentioned this issue Apr 19, 2016

release to pypi? #2

Closed

SimonSapin mentioned this issue Jun 12, 2016

Documentation not found #4

Closed

liZe mentioned this issue Jul 6, 2017

Major changes for WeasyPrint #5

Merged

liZe closed this as completed Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide what kind of tree cssselect2 should work on #1

Decide what kind of tree cssselect2 should work on #1

SimonSapin commented Apr 7, 2014

tabatkins commented Apr 11, 2014

SimonSapin commented Apr 11, 2014

tabatkins commented Apr 11, 2014

kovidgoyal commented Feb 17, 2015

SimonSapin commented Feb 17, 2015

kovidgoyal commented Feb 17, 2015

kovidgoyal commented Feb 17, 2015

SimonSapin commented Feb 17, 2015

kovidgoyal commented Feb 18, 2015

SimonSapin commented Feb 18, 2015

kovidgoyal commented Feb 18, 2015

kovidgoyal commented Feb 18, 2015

tabatkins commented Feb 18, 2015

kovidgoyal commented Feb 18, 2015

kovidgoyal commented Feb 20, 2015

kovidgoyal commented Feb 20, 2015

SimonSapin commented Feb 20, 2015

kovidgoyal commented Feb 20, 2015

kovidgoyal commented Feb 20, 2015

SimonSapin commented Feb 20, 2015

kovidgoyal commented Feb 20, 2015

bukzor commented Feb 23, 2015

liZe commented Oct 2, 2017

reedstrm commented Oct 11, 2017

Decide what kind of tree cssselect2 should work on #1

Decide what kind of tree cssselect2 should work on #1

Comments

SimonSapin commented Apr 7, 2014

tabatkins commented Apr 11, 2014

SimonSapin commented Apr 11, 2014

tabatkins commented Apr 11, 2014

kovidgoyal commented Feb 17, 2015

SimonSapin commented Feb 17, 2015

kovidgoyal commented Feb 17, 2015

kovidgoyal commented Feb 17, 2015

SimonSapin commented Feb 17, 2015

kovidgoyal commented Feb 18, 2015

SimonSapin commented Feb 18, 2015

kovidgoyal commented Feb 18, 2015

kovidgoyal commented Feb 18, 2015

tabatkins commented Feb 18, 2015

kovidgoyal commented Feb 18, 2015

kovidgoyal commented Feb 20, 2015

kovidgoyal commented Feb 20, 2015

SimonSapin commented Feb 20, 2015

kovidgoyal commented Feb 20, 2015

kovidgoyal commented Feb 20, 2015

SimonSapin commented Feb 20, 2015

kovidgoyal commented Feb 20, 2015

bukzor commented Feb 23, 2015

liZe commented Oct 2, 2017

reedstrm commented Oct 11, 2017