Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: get ElementTree #34

Open
bofm opened this issue Jun 6, 2015 · 14 comments
Open

API: get ElementTree #34

bofm opened this issue Jun 6, 2015 · 14 comments

Comments

@bofm
Copy link

bofm commented Jun 6, 2015

The api.extract function returns a generator of HtmlElement objects.
If you need to analyze the results of api.extract in relation with the HTML page, then it would be great to have a way to get the ElementTree object. This is required (for example) to get the XPath of an HtmlElement using etree.getpath(element) as described on http://lxml.de/xpathxslt.html#generating-xpath-expressions.

Currently I use the following lazy workaround:

from functools import partial
from libextract._compat import BytesIO
from libextract.core import parse_html, pipeline, select, measure, rank, finalise

def extract(document, encoding='utf-8', count=None):
    if isinstance(document, bytes):
        document = BytesIO(document)

    crank = partial(rank, count=count) if count else rank

    etree = parse_html(document, encoding=encoding)
    yield etree
    yield pipeline(
        select(etree),
        (measure, crank, finalise)
        )

r = requests.get(url)
gen_extract = extract(r.content)
tree = g.next()
textnodes = g.next()
data_element = textnodes.next()  # <Element table at 0x36f1f60>
rows = data_element.iterfind('tr')
for row in rows:
    row_xpath = tree.getpath(row)
    print row_xpath

# /html/body/div[2]/div[1]/div[2]/table/tr[1]
# /html/body/div[2]/div[1]/div[2]/table/tr[2]
# /html/body/div[2]/div[1]/div[2]/table/tr[3]
# ...
@eugene-eeo
Copy link
Contributor

What do you think of returning a Result object? So you can do the following:

>>> r = extract(doc)
>>> r.tree
<ElementTree instance at 0x...>
>>> r.nodes
[node1, node2, node3]

Do you want to write a PR implementing the functionality? :)

@rodricios
Copy link
Contributor

Hi @eugene-eeo, give me until Monday to look into what you're proposing. The internship's been keeping me busy, but I'll squeeze this in somehow 😅

@bofm
Copy link
Author

bofm commented Jun 7, 2015

Straightforward: Extracted = namedtuple('Extracted', 'nodes, tree').
https://github.com/bofm/libextract/blob/nodes-and-tree/libextract/api.py
The tests should be modified for a PL.

@rodricios
Copy link
Contributor

@bofm: 👍 I'm in favor of using the namedtuple approach, and returning the tree alongside the HtmlElements

@eugene-eeo
Copy link
Contributor

I don't see a reason why not, but I feel that a Result object is more intuitive as one can override some methods to allow the user to iterate over it:

>>> r = extract(doc)
>>> r.tree
<lxml.ElementTree>
>>> list(r)
[<Node>]

But once again I think this suggestion boils down into how minimalist the library would be. I am personally in favour of the Result object approach since it helps the user a little more. A nice compromise would probably be to inherit from the namedtuple and add our own __iter__ method.

@rodricios
Copy link
Contributor

@eugene-eeo: 👍 I am ok with this. While "minimalism" is cliché, it fits well with libextract.

I don't think we require anything more than a namedtuple inheritance at most, given that we aren't really providing anything more than an algorithm, at least at the moment.

@bofm
Copy link
Author

bofm commented Jun 8, 2015

I don't think it's a good idea to override __iter__ method. Given an object of a namedtuple subclass, it is not obvious that iteration over this object produces nodes. It is not big overhead to add one line of code nodes = r.nodes.

@eugene-eeo
Copy link
Contributor

While it is not a big overhead, imagine if the whole Python language were designed so that whenever you needed to iterate over some object you had to do:

for item in obj.iter:
    pass

I think that kind of illustrates my point :) Also the advantage is that it is more intuitive (depending on what you name it, I'm going with Result but if we agree on Extracted that's fine), and allows users to write quite expressive code:

extracted = extract(doc)
for item in extracted:
    print '#{0}'.format(item['id'])

Inheriting also allows us to add some docstrings in a nicer way-

class Result(namedtuple('Result', ['nodes', 'tree'])):
    """
    Describe the klass.
    """
    def __iter__(self):
        return iter(self.nodes)

@bofm
Copy link
Author

bofm commented Jun 8, 2015

I'm afraid somebody might fall into this

class Result(namedtuple('Result', ['nodes', 'tree'])):
    """
    Describe the klass.
    """
    def __iter__(self):
        return iter(self.nodes)

r = Result(('node1', 'node2'), 'tree')
print r
nodes, tree = r
print 'nodes:', nodes
print 'tree:', tree


# Result(nodes=('node1', 'node2'), tree='tree')
# nodes: node1
# tree: node2

after which he would need to go to the sources or the docs to realize that the __iter__ was overridden.

@eugene-eeo
Copy link
Contributor

Fair enough 👍

I'd advocate for inheritance just to add the docstring as there seems to be no nice way of adding it currently... correct me if I'm wrong.

@bofm
Copy link
Author

bofm commented Jun 8, 2015

The docstring is not a problem.

#python tip: How to customize a named tuple docstring: Grid = namedtuple('Grid', ['x', 'y']) Grid.x = property(Grid.x.fget, doc='abscissa')

— raymondh (@raymondh) April 26, 2015
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

@bofm
Copy link
Author

bofm commented Jun 8, 2015

Oh, that was for the attributes, not for the class. Btw, __doc__ is writable, but only in Python 3. So yes, the subclass is the only easy way.

@rodricios
Copy link
Contributor

What's the consensus on this? namedtuple subclass but no __iter__ override?

@eugene-eeo
Copy link
Contributor

Yup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants