-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide what kind of tree cssselect2 should work on #1
Comments
At minimum, I expect it to select nodes correctly from some sufficiently useful tree structure, where "sufficiently useful" is basically what you described - traversal, mutation, and serialization. If cssselect2 just returns nodes from something equivalent to or weaker than ElementTree, I won't be able to use it, and will have to stick with cssselect1 and its limitations. |
So ElementTree is too weak? What’s missing? |
Parent links, as you mention. Several of Bikeshed's features involve searching the ancestors of a node, such as figuring out the type of a dfn (see the |
The "best" way is: Since python has a cycle collector, why not extend lxml to allow every node to refer to an arbitrary python object? This will probably require lxml to be built with its own private copy of libxml2. I dont know if the lxml maintainers would be willing to do that. A compromise (less efficient) way would be to require support for user defined XPath functions. lxml already has that. It should be possible to map the css selectors if the XPath expressions can make use of arbitrary python functions. As for caching, one would have to use something like ElelemtWrapper. |
Having lxml associate an arbitrary Python object to every node would definitely help, but it currently doesn’t and I’m not willing to invest the time and energy myself to make this happen. I’m not sure how XPath functions are relevant here. The whole point of cssselect2 (as opposed to cssselect) is to implement Selectors directly, without translating them to XPath. |
Presumably, the reason you want to not use XPath is because you cannot express some CSS selectors in XPath. If that were no longer true, there is no reason to not use XPath, anymore. Hence the suggestion for using functions. That way you can (relatively easily) get full CSS compliance, with vanilla lxml. |
Incidentally, I looked at WeasyPrint, looks painful. Is it really planning to implement the CSS box model from scratch? Why not just use a browser engine? It is perfectly possible to do pagination, links, etc. with a browser engine using the CSS 3 columns module (see the PDF output in calibre). |
The reason is that it’s a terrible, terrible idea. I’ve been there. At first it seems "easier" than actually implementing Selectors, and it’s fine to get 80% of Selectors, but then there are so many things that are hard to get right that it’s just not worth it. If you’re interested in the XPath approach, please see scrapy/cssselect#48. But note that cssselect2 is not cssselect, and please keep each discussion topic in the tracker of the relevant project, creating new issues if needed.
[citation needed]
Not planning to, because it already has.
For a variety of reasons that are completely out of scope for this issue. Please don’t hijack this discussion.
Again, [citation needed]. And again, this is out of scope. If you want to keep discussing WeasyPrint design choices, please take it elsewhere. |
Woah, easy, there is no need to fly off the handle, I am only trying to help. I was trying to point out to you that your reasons for abandoning cssselect dont make much sense (to me). Namely:
ebook-convert file.html .pdf --override-profile-size --paper-size a4 It uses PyQt's QtWebKit bindings, supports pagination, links, PDF outlines, custom header and footer templates with javascript, does not need a running X server, works on all major OSes, etc, etc. Anyway, from your tone, you dont seem to be interested in my ideas. All the best and thanks for cssselect and tinycss (both of which I maintain custom versions of already). |
Alright, I’m sorry for my tone in the earlier message.
You’re just asserting this “actually possible” without supporting it. I spent a lot of time trying to make this work, and I’ve come to believe that it’s either not possible, or that it would be so twisted that it wouldn’t be worthwhile: the original “translating to XPath seems easier than actually implementing matching” idea would be completely out of the window. But I’d be happy to see a fix for e.g. scrapy/cssselect#12 and be proven wrong.
Thank you very much for asserting that two years of my work is not “real”. That said, I won’t discuss motivations for WeasyPrint’s existence any more here, it’s way off topic. |
No problem, online discussions, often turn nasty simply because of a lack of emotional cues, as I'm sure you are aware :) As I understand it, lxml's XPath python extensions allow you to define a custom python function that is passed an element node, and can then do anything you like with that node, including iterating over it, accessing its parents/siblings/descendants/attributes/etc. If that understanding is correct, it is trivially true that you can use extensions functions to implement any CSS selector. But, it's been a while since I actually used that API, so maybe I am mis-remembering. If I find the time, I will look at fixing your bug. However, that will be a while, as that particular selector class is not currently useful for my use case, so I would need to justify taking the time away from my day job. And let me say, that I agree, using XPath is not the nicest solution. However, it has the great virtue of allowing cssselect to be used with any XML tree implementation, which, IMO, is not a property to be thrown away lightly. As for WeasyPrint, dont get me wrong, it's a great project, and I think that it is important to have an HTML rendering stack apart from the major browsers. However, it seems pretty self evident that a non-browser rendering engine is always going to lag behind in terms of support for evolving standards, simply because of the manpower disparity if for no other reason. That does not make it worthless, just not necessarily the optimal solution for rendering to PDF. All I'm saying is, be aware that one can actually generate proper PDF output from a browser, despite the lacklustre default PDF output implementations in the major browsers. |
Oh and if you wish to take this discussion elsewhere, I'll be happy to do so, just let me know where. |
For serious, WeasyPrint is irrelevant to this discussion, stop bringing it up. It's one of the many projects using CSSSelect1; it's only marginally relevant here as a user of the library, and because Simon actually wrote it, so he's familiar with what features it needs. Beyond that it doesn't need to be discussed. If you have actual answers to the question at the top of this issue, please give feedback. |
Sigh, whatever, I am done with this. Goodbye and goodluck. Enjoy the deafening silence in this thread. |
FYI: I implemented a replacement for cssselect that works with vanilla lxml trees, uses caches, implements the full CSS Level 3 spec (including all the bits that cssselect gets wrong). The only caveats are that I dropped namespace support and used case insensitive tag/attribute names, because for my use case implementing those would be an unnecessary perf hit. The code is designed in a way that should make it trivial to adapt to any other kind of tree implementation (you just have to override half a dozen small methods in one class). https://github.com/kovidgoyal/calibre/blob/master/src/css_selectors/select.py#L86 It uses the parser from cssselect (so thanks for that) and also passes the full csselect test suite (with a few modifications for tests that were wrong and dropping tests for functionality I dont implement (namespaces, :contains(), [a!=b]) This is of course new code, and there may well be bugs, in particular I have not really optimized the performance and the caching strategy is very simple minded. Feel free to use the code if you want (it is GPLv3 licensed) or just use the idea to implement your own cssselect2. |
Also note that most of the select tests are compared against running the same queries with WebKit, to give me extra confidence I didn't muck it up. |
You’re aware that cssselect and cssselect2 are BSD-licensed, right? I have no intention on changing this. |
Which is why I said, use the idea if you dont want to use the code. |
All I ask is that if you do use the idea, leave a note in the source file, crediting me. |
I don’t know if that’s how the GPL works, and I’m not interested in spending the energy to find out. |
OK, whatever, I tried. |
... back to the question: "To anyone who may want to use a CSS Selector matching library, what do you expect from that library? Please comment here." We only use cssselect via pyquery, and only use pyquery to do some simple read-only assertions about our markup. Directly to your design problem, I think you want to keep a lazily-constructed mirror document that contains any data structures necessary to implement your matches. The tricky bit comes when you must insure that the mirror document never becomes stale, since the original document is mutable. I think clojure is onto something with their immutable-by-default philosophy... |
Version 0.2.x has been released and is used in WeasyPrint with a tree designed for its needs. @bukzor could it be used in pyquery as is? |
I've got my own fork of an earlier version of |
Here is a brain dump:
One of the design goals of cssselect2 is to allow WeasyPrint to move away from lxml (in order to run on PyPy, and have one less non-obvious-to-install dependency.) See Kozea/WeasyPrint#64.
The obvious alternative is ElementTree. If cssselect2 works with ElementTree documents, it should also work with lxml documents.
However, ElementTree elements do not have lxml’s
.getparent()
,.iterancestors()
, or.getroot()
methods. Given an element, there is no way to go up the tree. This was a design decision made for ElementTree to avoid reference cycles, at a time when CPython (version 1.x) did not have a cycle collector. (And letting reference counting leak entire trees would have been kinda bad.) Nowadays, adding parent references would change the semantics of tree mutations and thus is not done for backwards compatibility.To efficiently match entire stylesheets against a document, cssselect2 wants to match Selectors right-to-left, which requires going up the tree to find an element’s parent or ancestor. It also wants to cache a bunch of stuff (such as a parsed set of classes, or a language based on
lang
attribute of ancestors) on the objects representing elements. In both ElementTree and lxml,Element
objects do not accept new attributes.cssselect2 currently (c0a1e70) works around this by having
ElementWrapper
objects that are created and destroyed on the fly when traversing the tree. As of Selectors Level 3, matching against an element only requires access to its ancestors, previous siblings, and the previous siblings of its ancestors.However, new features in Selectors Level 4 will require arbitrary access to the whole tree for matching.
ElementWrapper
objects can still be created lazily, but we’ll probably want (for efficiency) to keep all of them in memory until we drop the whole tree.At that point, if we have our own objects for representing the entire tree, do we even need to keep the ElementTree objects around? Would it make more sense to design and use a new tree API? My experiments with tinydom show that it’s not too hard to integrate with expat’s XML parser and html5lib’s HTML parser to builds trees in a new API. WeasyPrint has very minimal needs (the access the tree after parsing can be reduced to a single in-order traversal, with no mutation.)
But I also want other projects using cssselect to move to cssselect2. (Because XPath sucks and cssselect is broken in subtle but hard-to-fix ways.) And these projects probably want more than WeasyPrint from a tree API: various kinds of traversal, mutation, serialization… I don’t really want to maintain a new tree API doing all that.
But the current idea of having element wrapper objects still requires starting from a wrapper of the root element and doing traversal with the wrappers, so it’s kind of a new tree API already anyway.
So, there is this big decision to make before cssselect2 is really usable.
To anyone who may want to use a CSS Selector matching library, what do you expect from that library? Please comment here.
The text was updated successfully, but these errors were encountered: