Releases · AndyTheFactory/newspaper4k

18 Mar 21:56

AndyTheFactory

0.9.3.1

9989040

Minor bug fix Latest

Latest

Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed.

Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.

Assets 2

18 Mar 00:10

AndyTheFactory

0.9.3

741fcb3

Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)

Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website.
Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection.
We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.

We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset

You can now install languages that need special packages as optional dependencies
Google News full integrated in the scraping process.
You can now pickle sources and articles - easier to save and recover scraping
Bumped minimum python version support to Python 3.8

Assets 2

14 Jan 11:36

AndyTheFactory

0.9.2

97fdcb0

Version 0.9.2 some major changes in document parsing

You can now us the module as a command line interface (CLI). Usage: python -m newspaper --url https://www.test.com. More information in the documentation.
I have added an evaluation script against a dataset from scrapinghub. This will help keeping track of future improvements.
Better handling of multithreaded requests. The previous version had a bug that could lead to a deadlock. I implemented ThreadPoolExecutor from the concurrent.futures module, which is more stable. The previously news_pool was replaced with a fetch_news() function.
Caching is now much more flexible. You can disable it completely or for one request.
You can now use newspaper.article() function for convenience. It will create, download and parse an article in one step. It takes all the parameters of the Article class.
protected sites by cloudflare are better detected and raise an exception. The reason will be in the exception message.

Assets 2

08 Nov 13:40

AndyTheFactory

0.9.1

c261786

Version 0.9.1 code refactoring and bugfixes

New feature:

version bump(f7107be)
tests: Add test case for(592f6f6)
parse: added possibility to follow "read more" links in articles(0720de1)
Allow to pass any requests parameter to the Article constructor. You can now pass verify=False in order to ignore certificate errors (issue #462)(5ff5d27)
parse: extended data parsing of json-ld metadata (issue #518)(fc413af)
tests: added script to create test cases(9df8c16)
parse: added tag for date detection issue #835(41152eb)
parse: added og:regDate to known date tags(dc35e29)
tests: convert unittest to pytest(45c4e8d)

Bugs fixed:

typing annotation for set python 3.8(895343f)
parse: improve meta tag content for articles and pubdate(37bb0b7)
parse: 📝 improved author detection. improved video links detection(23c547f)
parse: ensured that clean_doc/doc to clean_top_node are on the same DOM. And doc/top_node on the same DOM.(6874d05)
small changes, replace os.path with pathlib(5598d95)
parse: use one file of stopwords for english, the one in the standard folder #503(6bdf813)
parse: better author parsing based on issue #493(f93a9c2)
parse: make the url date parsing stricter. Issue #514(0cc1e83)
parse: replace \n with space in sentence split (Issue #506)(3ccb87c)
parsing: catch url errors resulting resulting from parsed image links(9140a04)
correct python versions in pipeline(7e671df)
gitignore update(8855f00)

Assets 2

29 Oct 23:27

AndyTheFactory

0.9.0

c11f950

First release after the fork

First release after the fork. This release is based on the 0.1.7 release of the original newspaper3k project. I jumped versions such that it is clear that this is a fork and not the original project.

New feature:

tests: starting moving tests to pytest(f294a01) (by Andrei)
parser: add yoast schema parse for date extraction(39a5cff) (by Andrei)

Bugs fixed:

docs: update README.md(d5f9209) (by Andrei)
feed_url parsing, issue #915(ec2d474) (by Andrei)
better content detection. added and
tag as candidate for content parent_node(447a429) (by Andrei)
close pickle files - PR #938(d7608da) (by Andrei)
parsing: improved publication date extraction(4d137eb) (by Andrei)
some linter errors, whitespaces and spelling(79553f6) (by Andrei)

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature:

Bugs fixed:

New feature:

Bugs fixed:

Releases: AndyTheFactory/newspaper4k

Minor bug fix

Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)

Version 0.9.2 some major changes in document parsing

Version 0.9.1 code refactoring and bugfixes

New feature:

Bugs fixed:

First release after the fork

New feature:

Bugs fixed: