-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: get ElementTree #34
Comments
What do you think of returning a Result object? So you can do the following: >>> r = extract(doc)
>>> r.tree
<ElementTree instance at 0x...>
>>> r.nodes
[node1, node2, node3] Do you want to write a PR implementing the functionality? :) |
Hi @eugene-eeo, give me until Monday to look into what you're proposing. The internship's been keeping me busy, but I'll squeeze this in somehow 😅 |
Straightforward: |
@bofm: 👍 I'm in favor of using the namedtuple approach, and returning the tree alongside the HtmlElements |
I don't see a reason why not, but I feel that a >>> r = extract(doc)
>>> r.tree
<lxml.ElementTree>
>>> list(r)
[<Node>] But once again I think this suggestion boils down into how minimalist the library would be. I am personally in favour of the Result object approach since it helps the user a little more. A nice compromise would probably be to inherit from the namedtuple and add our own |
@eugene-eeo: 👍 I am ok with this. While "minimalism" is cliché, it fits well with libextract. I don't think we require anything more than a |
I don't think it's a good idea to override |
While it is not a big overhead, imagine if the whole Python language were designed so that whenever you needed to iterate over some object you had to do: for item in obj.iter:
pass I think that kind of illustrates my point :) Also the advantage is that it is more intuitive (depending on what you name it, I'm going with extracted = extract(doc)
for item in extracted:
print '#{0}'.format(item['id']) Inheriting also allows us to add some docstrings in a nicer way- class Result(namedtuple('Result', ['nodes', 'tree'])):
"""
Describe the klass.
"""
def __iter__(self):
return iter(self.nodes) |
I'm afraid somebody might fall into this class Result(namedtuple('Result', ['nodes', 'tree'])):
"""
Describe the klass.
"""
def __iter__(self):
return iter(self.nodes)
r = Result(('node1', 'node2'), 'tree')
print r
nodes, tree = r
print 'nodes:', nodes
print 'tree:', tree
# Result(nodes=('node1', 'node2'), tree='tree')
# nodes: node1
# tree: node2 after which he would need to go to the sources or the docs to realize that the |
Fair enough 👍 I'd advocate for inheritance just to add the docstring as there seems to be no nice way of adding it currently... correct me if I'm wrong. |
The docstring is not a problem.
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
|
Oh, that was for the attributes, not for the class. Btw, |
What's the consensus on this? |
Yup. |
The
api.extract
function returns a generator ofHtmlElement
objects.If you need to analyze the results of
api.extract
in relation with the HTML page, then it would be great to have a way to get theElementTree
object. This is required (for example) to get the XPath of anHtmlElement
usingetree.getpath(element)
as described on http://lxml.de/xpathxslt.html#generating-xpath-expressions.Currently I use the following lazy workaround:
The text was updated successfully, but these errors were encountered: