Releases: AndyTheFactory/newspaper4k
Minor bug fix
Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed.
Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.
Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)
Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website.
Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection.
We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.
We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset
- You can now install languages that need special packages as optional dependencies
- Google News full integrated in the scraping process.
- You can now pickle sources and articles - easier to save and recover scraping
- Bumped minimum python version support to Python 3.8
Version 0.9.2 some major changes in document parsing
- You can now us the module as a command line interface (CLI). Usage:
python -m newspaper --url https://www.test.com
. More information in the documentation. - I have added an evaluation script against a dataset from scrapinghub. This will help keeping track of future improvements.
- Better handling of multithreaded requests. The previous version had a bug that could lead to a deadlock. I implemented ThreadPoolExecutor from the concurrent.futures module, which is more stable. The previously
news_pool
was replaced with afetch_news()
function. - Caching is now much more flexible. You can disable it completely or for one request.
- You can now use
newspaper.article()
function for convenience. It will create, download and parse an article in one step. It takes all the parameters of theArticle
class. - protected sites by cloudflare are better detected and raise an exception. The reason will be in the exception message.
Version 0.9.1 code refactoring and bugfixes
New feature:
- version bump(
f7107be
) - tests: Add test case for(
592f6f6
) - parse: added possibility to follow "read more" links in articles(
0720de1
) - Allow to pass any requests parameter to the Article constructor. You can now pass verify=False in order to ignore certificate errors (issue #462)(
5ff5d27
) - parse: extended data parsing of json-ld metadata (issue #518)(
fc413af
) - tests: added script to create test cases(
9df8c16
) - parse: added tag for date detection issue #835(
41152eb
) - parse: added og:regDate to known date tags(
dc35e29
) - tests: convert unittest to pytest(
45c4e8d
)
Bugs fixed:
- typing annotation for set python 3.8(
895343f
) - parse: improve meta tag content for articles and pubdate(
37bb0b7
) - parse: 📝 improved author detection. improved video links detection(
23c547f
) - parse: ensured that clean_doc/doc to clean_top_node are on the same DOM. And doc/top_node on the same DOM.(
6874d05
) - small changes, replace os.path with pathlib(
5598d95
) - parse: use one file of stopwords for english, the one in the standard folder #503(
6bdf813
) - parse: better author parsing based on issue #493(
f93a9c2
) - parse: make the url date parsing stricter. Issue #514(
0cc1e83
) - parse: replace \n with space in sentence split (Issue #506)(
3ccb87c
) - parsing: catch url errors resulting resulting from parsed image links(
9140a04
) - correct python versions in pipeline(
7e671df
) - gitignore update(
8855f00
)
First release after the fork
First release after the fork. This release is based on the 0.1.7 release of the original newspaper3k project. I jumped versions such that it is clear that this is a fork and not the original project.
New feature:
- tests: starting moving tests to pytest(
f294a01
) (by Andrei) - parser: add yoast schema parse for date extraction(
39a5cff
) (by Andrei)
Bugs fixed:
- docs: update README.md(
d5f9209
) (by Andrei) - feed_url parsing, issue #915(
ec2d474
) (by Andrei) - better content detection. added and tag as candidate for content parent_node(
447a429
) (by Andrei) - close pickle files - PR #938(
d7608da
) (by Andrei) - parsing: improved publication date extraction(
4d137eb
) (by Andrei) - some linter errors, whitespaces and spelling(
79553f6
) (by Andrei)