GitHub - NLPchina/python-readability: fast python port of arc90's readability tool, updated to match latest readability.js!

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

Usage

>> import requests
>> from readability import Document
>>
>> response = requests.get('http://example.com')
>> doc = Document(response.text)
>> doc.title()
>> 'Example Domain'
>> doc.summary()
>> u'<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>'

Change Log

0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
0.4 Added Videos loading and allowed more images per paragraph
0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

Latest readability.js
Ruby port by starrhorne and iterationlabs
Python port by gfxmonk
Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml
"BR to P" fix from readability.js which improves quality for smaller texts
Github users contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
readability		readability
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
Makefile		Makefile
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python-readability

Installation

Usage

Change Log

Licensing

Thanks to

About

Releases

Packages

Languages

NLPchina/python-readability

Folders and files

Latest commit

History

Repository files navigation

python-readability

Installation

Usage

Change Log

Licensing

Thanks to

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages