From 66a2b0d603b5d9b14f765020baa3ee98573f4388 Mon Sep 17 00:00:00 2001 From: santhoshse7en Date: Sun, 3 Nov 2024 12:05:20 +0530 Subject: [PATCH] codebase revamp & bug fixes --- CODE_OF_CONDUCT.md | 83 ++++----- LICENSE | 29 ++- README.md | 103 +++++------ newsfetch/example/__init__.py | 1 + newsfetch/example/sample.json | 74 ++++---- newsfetch/google.py | 91 +++------- newsfetch/helpers.py | 127 +++----------- newsfetch/news.py | 293 ++++++++++++++++--------------- newsfetch/news_please_handler.py | 90 ++++++++++ newsfetch/newspaper_handler.py | 105 +++++++++++ newsfetch/soup_handler.py | 124 +++++++++++++ requirements.txt | 19 +- setup.py | 43 +++-- 13 files changed, 700 insertions(+), 482 deletions(-) create mode 100644 newsfetch/example/__init__.py create mode 100644 newsfetch/news_please_handler.py create mode 100644 newsfetch/newspaper_handler.py create mode 100644 newsfetch/soup_handler.py diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md index 07c2566..5a32d65 100644 --- a/CODE_OF_CONDUCT.md +++ b/CODE_OF_CONDUCT.md @@ -2,75 +2,60 @@ ## Our Pledge -In the interest of fostering an open and welcoming environment, we as -contributors and maintainers pledge to making participation in our project and -our community a harassment-free experience for everyone, regardless of age, body -size, disability, ethnicity, sex characteristics, gender identity and expression, -level of experience, education, socio-economic status, nationality, personal -appearance, race, religion, or sexual identity and orientation. +We, as contributors and maintainers, pledge to foster an open and welcoming environment in our project and community. +We are committed to ensuring that participation is a harassment-free experience for everyone, regardless of age, +body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, +socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. ## Our Standards -Examples of behavior that contributes to creating a positive environment -include: +We strive to create a positive environment through behaviors such as: -* Using welcoming and inclusive language -* Being respectful of differing viewpoints and experiences -* Gracefully accepting constructive criticism -* Focusing on what is best for the community -* Showing empathy towards other community members +* Using inclusive and welcoming language +* Respecting differing viewpoints and experiences +* Accepting constructive criticism graciously +* Prioritizing the community's best interests +* Showing empathy towards fellow community members -Examples of unacceptable behavior by participants include: +Unacceptable behaviors include: -* The use of sexualized language or imagery and unwelcome sexual attention or - advances -* Trolling, insulting/derogatory comments, and personal or political attacks -* Public or private harassment -* Publishing others' private information, such as a physical or electronic - address, without explicit permission -* Other conduct which could reasonably be considered inappropriate in a - professional setting +* Using sexualized language or imagery, and unwelcome sexual advances +* Trolling, derogatory comments, or personal/political attacks +* Harassment, whether public or private +* Sharing others' private information without explicit permission +* Any conduct that could be considered inappropriate in a professional context ## Our Responsibilities -Project maintainers are responsible for clarifying the standards of acceptable -behavior and are expected to take appropriate and fair corrective action in -response to any instances of unacceptable behavior. +Project maintainers are responsible for defining acceptable behavior standards and are expected to take fair and +appropriate action in response to any instances of unacceptable behavior. -Project maintainers have the right and responsibility to remove, edit, or -reject comments, commits, code, wiki edits, issues, and other contributions -that are not aligned to this Code of Conduct, or to ban temporarily or -permanently any contributor for other behaviors that they deem inappropriate, -threatening, offensive, or harmful. +Maintainers have the authority to remove, edit, or reject comments, commits, code, wiki edits, issues, and +contributions that do not align with this Code of Conduct. They can also temporarily or permanently ban contributors +for behaviors deemed inappropriate, threatening, offensive, or harmful. ## Scope -This Code of Conduct applies both within project spaces and in public spaces -when an individual is representing the project or its community. Examples of -representing a project or community include using an official project e-mail -address, posting via an official social media account, or acting as an appointed -representative at an online or offline event. Representation of a project may be -further defined and clarified by project maintainers. +This Code of Conduct applies within project spaces and public spaces when individuals represent the project or +community. Representation includes using an official project email address, posting on official social media accounts, +or acting as appointed representatives at events. The definition of representation may be further clarified by +project maintainers. ## Enforcement -Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported by contacting the project team at santhoshse7en@gmail.com. All -complaints will be reviewed and investigated and will result in a response that -is deemed necessary and appropriate to the circumstances. The project team is -obligated to maintain confidentiality with regard to the reporter of an incident. -Further details of specific enforcement policies may be posted separately. +Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project +team at santhoshse7en@gmail.com. All complaints will be reviewed and investigated, resulting in an appropriate +response to the circumstances. The project team will maintain confidentiality regarding the identity of the reporter. +Specific enforcement policies may be outlined separately. -Project maintainers who do not follow or enforce the Code of Conduct in good -faith may face temporary or permanent repercussions as determined by other -members of the project's leadership. +Project maintainers who fail to follow or enforce the Code of Conduct in good faith may face temporary or permanent +consequences as determined by other members of the project's leadership. ## Attribution -This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, -available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html +This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, +available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html. [homepage]: https://www.contributor-covenant.org -For answers to common questions about this code of conduct, see -https://www.contributor-covenant.org/faq +For answers to common questions about this code of conduct, see https://www.contributor-covenant.org/faq. \ No newline at end of file diff --git a/LICENSE b/LICENSE index 79e30dc..c0ac81a 100644 --- a/LICENSE +++ b/LICENSE @@ -2,20 +2,19 @@ MIT License Copyright (c) [2019] [M Santhosh Kumar] -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated +documentation files (the "Software"), to deal in the Software without restriction, including without limitation the +rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit +persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions of +the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO +THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. IN NO EVENT SHALL +THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF +CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS +IN THE SOFTWARE. + -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. diff --git a/README.md b/README.md index 7db2038..1b04681 100644 --- a/README.md +++ b/README.md @@ -6,16 +6,15 @@ -news-fetch is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely. News-fetch combines the power of multiple state-of-the-art libraries and tools, such as [news-please](https://github.com/fhamborg/news-please) - [Felix Hamborg](https://www.linkedin.com/in/felixhamborg/) and [Newspaper3K](https://github.com/codelucas/newspaper/) - [Lucas (欧阳象) Ou-Yang](https://www.linkedin.com/in/lucasouyang/). This package consists of both features provided by Felix's work and Lucas' work. - -I built this to reduce most of NaN or '' or [] or 'None' values while scraping for some news websites. Platform-independent and written in Python 3. Programmers and developers can very easily use this package to access the news data to their programs. +**news-fetch** is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can recursively follow internal hyperlinks and read RSS feeds to fetch both recent and archived articles. You only need to provide the root URL of the news website to crawl it completely. News-fetch combines the power of multiple state-of-the-art libraries and tools, including [news-please](https://github.com/fhamborg/news-please) by [Felix Hamborg](https://www.linkedin.com/in/felixhamborg/) and [Newspaper3K](https://github.com/codelucas/newspaper/) by [Lucas (欧阳象) Ou-Yang](https://www.linkedin.com/in/lucasouyang/). This package leverages features from both of these works. +I built this tool to minimize NaN or empty values when scraping data from various news websites. It's platform-independent and written in Python 3, making it easy for programmers and developers to access news data for their applications. | Source | Link | -| --- | --- | -| PyPI: | https://pypi.org/project/news-fetch/ | -| Repository: | https://santhoshse7en.github.io/news-fetch/ | -| Documentation: | https://santhoshse7en.github.io/news-fetch_doc/ (**Not Yet Created!**) | +| -------------- | ---------------------------------------------------------------------- | +| PyPI: | [https://pypi.org/project/news-fetch/](https://pypi.org/project/news-fetch/) | +| Repository: | [https://santhoshse7en.github.io/news-fetch/](https://santhoshse7en.github.io/news-fetch/) | +| Documentation: | [https://santhoshse7en.github.io/news-fetch_doc/](https://santhoshse7en.github.io/news-fetch_doc/) (**Not Yet Created!**) | ## Dependencies @@ -27,69 +26,71 @@ I built this to reduce most of NaN or '' or [] or 'None' values while scraping f - [chromedriver-binary](https://pypi.org/project/chromedriver-binary/) - [pandas](https://pypi.org/project/pandas/) -## Extracted information -news-fetch extracts the following attributes from news articles. Also, have a look at an [examplary JSON file](https://github.com/santhoshse7en/news-fetch/blob/master/newsfetch/example/sample.json) extracted by news-please. -* headline -* name(s) of author(s) -* publication date -* publication -* category -* source_domain -* article -* summary -* keyword -* url -* language - -## Dependencies Installation - -Use the package manager [pip](https://pip.pypa.io/en/stable/) to install following -```bash -pip install -r requirements.txt -``` +## Extracted Information -## Usage +news-fetch extracts the following attributes from news articles. You can also check out an [example JSON file](https://github.com/santhoshse7en/news-fetch/blob/master/newsfetch/example/sample.json) generated by news-please. -Download it by clicking the green download button here on [Github](https://github.com/santhoshse7en/news-fetch/archive/master.zip). To extract URLs from a targeted website, call the google_search function. You only need to parse the keyword and newspaper link argument. +- Headline +- Author(s) +- Publication date +- Publication +- Category +- Source domain +- Article content +- Summary +- Keywords +- URL +- Language -```python ->>> from newsfetch.google import google_search ->>> google = google_search('Alcoholics Anonymous', 'https://timesofindia.indiatimes.com/') -``` +## Dependency Installation -Use the `URLs` attribute to get the links of all the news articles scraped. +Use the package manager [pip](https://pip.pypa.io/en/stable/) to install the required dependencies: -```python ->>> google.urls +```bash +pip install -r requirements.txt ``` -**Directory of google search results urls** - -![google](https://user-images.githubusercontent.com/47944792/88402193-68a56d00-cde8-11ea-8f26-9f7bf19359b2.PNG) +## Usage +You can download it by clicking the green download button on [Github](https://github.com/santhoshse7en/news-fetch/archive/master.zip). -To scrape all the news details, call the newspaper function +To scrape all the news details, use the `newspaper` function: ```python ->>> from newsfetch.news import newspaper ->>> news = newspaper('https://www.bbc.co.uk/news/world-48810070') -``` +from newsfetch.news import Newspaper -**Directory of news** +news = Newspaper(url='https://www.thehindu.com/news/cities/Madurai/aa-plays-a-pivotal-role-in-helping-people-escape-from-the-grip-of-alcoholism/article67716206.ece') +print(news.headline) +# Output: 'AA plays a pivotal role in helping people escape from the grip of alcoholism' +``` -![newsdir](https://user-images.githubusercontent.com/47944792/60564817-c058dc80-9d7e-11e9-9b3e-d0b5a903d972.PNG) +To extract URLs from a targeted website, call the `GoogleSearchNewsURLExtractor` by providing the keyword and newspaper link as arguments: ```python ->>> news.headline - -'g20 summit: trump and xi agree to restart us china trade talks' +from newsfetch.google import GoogleSearchNewsURLExtractor + +google = GoogleSearchNewsURLExtractor(keyword='Alcoholics Anonymous', news_domain='https://timesofindia.indiatimes.com/') +print(google.urls) +""" +['https://timesofindia.indiatimes.com/city/pune/pune-takes-a-stand-against-alcoholism-experts-collaborate-with-alcoholics-anonymous/articleshow/114438466.cms', +'https://timesofindia.indiatimes.com/city/mumbai/we-have-lost-jobs-homes-alcoholics-anonymous/articleshow/96824383.cms', +'https://timesofindia.indiatimes.com/city/gurgaon/gurgaons-alcoholics-open-up-about-their-road-to-recovery/articleshow/45080744.cms', +'https://timesofindia.indiatimes.com/city/goa/alcoholism-is-illness-not-issue-of-weak-willpower-say-experts/articleshow/105320008.cms', +'https://timesofindia.indiatimes.com/city/bhopal/alcoholism-is-an-illness-bhopal-aa-silver-jubilee-celebration/articleshow/106849014.cms', +'https://timesofindia.indiatimes.com/city/ahmedabad/alcoholics-anonymous-switches-to-online-sessions/articleshow/76144639.cms', +'https://timesofindia.indiatimes.com/city/kochi/keralites-trying-to-kick-alcoholism-alcoholics-anonymous/articleshow/13977818.cms', +'https://timesofindia.indiatimes.com/city/chandigarh/alcoholics-anonymous-turned-their-lives-around/articleshow/18239.cms', +'https://timesofindia.indiatimes.com/city/mumbai/like-air-india-flyer-alcoholics-anonymous-members-reap-whirlwind-of-job-loss-broken-homes/articleshow/96820403.cms', +'https://timesofindia.indiatimes.com/city/nagpur/alcoholics-anonymous-meet-promotes-one-day-at-a-time/articleshow/50538092.cms'] +""" ``` ## Contributing -Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. +Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change. -Please make sure to update tests as appropriate. +Make sure to update tests as appropriate. ## License -[MIT](https://choosealicense.com/licenses/mit/) +This project is licensed under the [MIT](https://choosealicense.com/licenses/mit/) License. + diff --git a/newsfetch/example/__init__.py b/newsfetch/example/__init__.py new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/newsfetch/example/__init__.py @@ -0,0 +1 @@ + diff --git a/newsfetch/example/sample.json b/newsfetch/example/sample.json index 4f0f84e..c378713 100644 --- a/newsfetch/example/sample.json +++ b/newsfetch/example/sample.json @@ -1,37 +1,39 @@ { - "headline": "Facebook is spending $5.7 billion to capitalize on India's internet boom", - "author": ["Sherisse Pham", "Cnn Business"], - "date_publish": "2020-04-22 04:33:39", - "date_modify": "None", - "date_download": "2020-05-29 22:11:27", - "image_url": "en", - "filename": "https://www.cnn.com/2020/04/22/tech/facebook-india-reliance-jio/index.html.json", - "description": "Facebook is investing $5.7 billion into Jio Platforms, the digital technology arm of Indian billionaire Mukesh Ambani's sprawling conglomerate Reliance Industries.", - "publication": "CNN", - "category": "tech", - "source_domain": "www.cnn.com", - "article": "Hong Kong (CNN Business) Facebook (FB) is spending billions of dollars for a stake in India's largest mobile operator, and teaming up with the country's richest man to tap into an internet boom. The deal, announced Wednesday, will see the US company invest $5.7 billion for a 9.99% stake in Jio Platforms, the digital technology arm of Indian billionaire Mukesh Ambani's sprawling conglomerate Reliance Industries. Jio Platforms has several services under its umbrella, including Reliance Jio, the mobile network that has taken India by storm since launching less than four years ago, racking up 388 million users, putting his brother and rival out of business and forcing other local operators to merge. Apps where users can stream movies, shop online and read news also fall under Jio Platforms. The tie-up includes a commercial partnership with WhatsApp that potentially paves the way for Facebook to make money from the messaging service's 400 million users in India. The partnership comes at a key time for tech in India. The market is growing, but it's getting tougher for global firms to profit because of shifting regulations , making it all the more important for companies like Facebook to forge key alliances if they want to cash in. It is also a huge investment at a time when the global economy is teetering on the edge.", - "summary": "Hong Kong (CNN Business) Facebook (FB) is spending billions of dollars for a stake in India's largest mobile operator, and teaming up with the country's richest man to tap into an internet boom. The deal, announced Wednesday, will see the US company invest $5.7 billion for a 9.99% stake in Jio Platforms, the digital technology arm of Indian billionaire Mukesh Ambani's sprawling conglomerate Reliance Industries. Apps where users can stream movies, shop online and read news also fall under Jio Platforms. The tie-up includes a commercial partnership with WhatsApp that potentially paves the way for Facebook to make money from the messaging service's 400 million users in India. It is also a huge investment at a time when the global economy is teetering on the edge.", - "keyword": [ - "mobile", - "billion", - "platforms", - "57", - "partnership", - "boom", - "indias", - "stake", - "users", - "jio", - "internet", - "services", - "facebook", - "million", - "capitalize", - "reliance", - "spending" - ], - "title_page": "None", - "title_rss": "None", - "url": "https://www.cnn.com/2020/04/22/tech/facebook-india-reliance-jio/index.html" -} + "headline": "Three militants, including wanted LeT 'commander', killed in Kashmir gunfights", + "author": [ + "Authors" + ], + "date_publish": "2024-11-02 07:31:00", + "date_modify": "None", + "date_download": "2024-11-03 10:19:11", + "language": "en", + "image_url": "https://th-i.thgim.com/public/incoming/bl0vi9/article68823115.ece/alternates/LANDSCAPE_1200/PTI11_02_2024_000186A.jpg", + "filename": "https://www.thehindu.com/news/national/jammu-and-kashmir/militants-killed-in-encounter-in-jks-anantnag/article68822224.ece.json", + "description": "Two militants killed in Anantnag encounter, one foreigner, one local; operation ongoing, details awaited; another encounter in Srinagar.", + "publication": "The Hindu", + "category": "Jammu and Kashmir", + "source_domain": "www.thehindu.com", + "source_favicon_url": "https://www.thehindu.com/favicon.ico", + "article": "Three militants, including wanted Lashkar-e-Taiba (LeT) 'commander' Usman Lashkari, were killed in two separate anti-militancy operations in Kashmir on Saturday (November 2, 2024). Two policemen and two CRPF jawans were also injured in the operations. A gunbattle raged in the old city's Khanyar area on Saturday (November 3, 2024) morning when a quick response team (QRT) of the security forces worked upon a tip-off about the presence of a militant in a congested locality. \"Security forces acted swiftly on an intelligence input. A coordinated cordon-and-search operation was launched around the suspected hideout. Security personnel were fired upon when the house where the militant was hiding was approached,\" Inspector General of Police V.K. Birdi said. The fierce gunbattle rattled the old city after a gap of three years. At least three houses, which caught fire during the operation, were damaged in the operation. \"A full-scale operation lasted until late in the evening. During the intense exchange (of fire), the forces successfully eliminated a foreign terrorist, later identified as Usman Lashkari of the LeT,\" IGP Birdi said. Also read: J&K terror attacks: 'Not an issue of security lapses, security forces giving befitting reply,' says Rajnath Singh He said slain Lashkari was active in the Valley for several months. \"The slain was also linked to the killing of Inspector Masroor Mir in the Eidgah area,\" IGP Birdi said. A large cache of arms and ammunition was recovered from the encounter site, the police said. Four security personnel, including two policemen and two CRPF jawans were also injured in the encounter. Officials said all the injured were shifted to hospital and their condition is \"stable\". In a separate operation in south Kashmir, the Army engaged a group of militants in a firefight in Halkan Gali, Anantnag. The Army said security forces observed suspicious movement and the terrorists were challenged. \"Terrorists opened indiscriminate fire. Troops effectively retaliated, which resulted in the elimination of two terrorists,\" the Army said. The anti-militancy operation continued throughout the day in the areas as the Army suspected presence of more militants in the area. The Army said another group of militants was engaged in a firefight on Friday evening in north Kashmir. \"On November 1, 2024, suspicious movement was spotted in Panar area of Bandipora by alert troops. On being challenged, terrorists opened indiscriminate fire and escaped into the jungle,\" the Army said. In another incident, a soldier died in Srinagar's Rawalpora area. Officials said initial reports suggested that the soldier died in \"accidental fire\". The incident took place when the soldier was part of a road opening party in Rawalpora. J&K has witnessed eight major militancy related incidents in the past one month, including four attacks on non locals in Kashmir. Fifteen people, including 10 civilians, two soldiers and two militants, were killed in these militancy related incidents. National Conference president Dr. Farooq Abdullah demanded a probe into stepped up incidents of militancy in J&K, especially after a new government took charge. \"These attackers from the last few days should be caught alive so that we can identify who is behind them,\" Dr. Abdullah said. J&K Pradesh Congress Committee (JKPCC) president and MLA Tariq Hameed Karra alleged that there seems to be a significant gap in the intelligence grid, which must be addressed. \"The atmosphere being created now seems to differ from the context of these attacks. Those responsible for this violence do not desire peace in J&K. The timing of the attacks has raised suspicions. They are happening after a successful electoral process involving all stakeholders,\" Mr. Karra said. CPI(M) leader and MLA Kulgam M.Y. Tarigami expressed concern over the situation. \"People of Jammu and Kashmir desire peace. There is a need for lasting peace in the region so that people can carry on with their daily lives safely and peacefully,\" Mr. Tarigami.", + "summary": "Three militants, including wanted Lashkar-e-Taiba (LeT) 'commander' Usman Lashkari, were killed in two separate anti-militancy operations in Kashmir on Saturday (November 2, 2024). Two policemen and two CRPF jawans were also injured in the operations. A gunbattle raged in the old city's Khanyar area on Saturday (November 3, 2024) morning when a quick response team (QRT) of the security forces worked upon a tip-off about the presence of a militant in a congested locality.", + "keyword": [ + "attacks", + "army", + "killed", + "operation", + "militants", + "kashmir", + "security", + "let", + "area", + "wanted", + "gunfights", + "terrorists", + "including", + "forces", + "commander" + ], + "title_page": null, + "title_rss": null, + "url": "https://www.thehindu.com/news/national/jammu-and-kashmir/militants-killed-in-encounter-in-jks-anantnag/article68822224.ece" +} \ No newline at end of file diff --git a/newsfetch/google.py b/newsfetch/google.py index 0ac2c11..6d4621e 100644 --- a/newsfetch/google.py +++ b/newsfetch/google.py @@ -1,75 +1,40 @@ -from newsfetch.helpers import (get_chrome_web_driver, get_web_driver_options, - set_automation_as_head_less, - set_browser_as_incognito, - set_ignore_certificate_error) -from newsfetch.utils import (BeautifulSoup, Options, UserAgent, - chromedriver_binary, get, re, sys, time, - webdriver) +from selenium import webdriver -class google_search: - - def __init__(self, keyword, newspaper_url): +class GoogleSearchNewsURLExtractor: + """Extracts news article URLs from Google search results based on a keyword and site.""" + def __init__(self, keyword, news_domain): + """Initialize with the search keyword and the target newspaper URL.""" self.keyword = keyword - self.newspaper_url = newspaper_url + self.news_domain = news_domain + self.urls = [] # List to store extracted URLs - random_headers = {'User-Agent': UserAgent().random, - 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'} + # Prepare the search term for Google + self.search_term = f"'{self.keyword}' site:{self.news_domain}" + search_url = f"https://www.google.com/search?q={"+".join(self.search_term.split())}" - self.search_term = '"{}" site:{}'.format(self.keyword, self.newspaper_url) + # Set up the web driver options + options = webdriver.ChromeOptions() + options.add_argument("--headless") # Run in headless mode (no UI) + options.add_argument("--ignore-certificate-errors") # Ignore SSL certificate errors + options.add_argument("--incognito") # Open in incognito mode - url = "https://www.google.com/search?q={}".format('+'.join(self.search_term.split())) - - options = get_web_driver_options() - set_automation_as_head_less(options) - set_ignore_certificate_error(options) - set_browser_as_incognito(options) - driver = get_chrome_web_driver(options) - driver.get(url) - - url_list = [] + # Start the web driver + driver = webdriver.Chrome(options=options) + driver.get(search_url) try: + # Find and collect all news article links on the current page + links = driver.find_elements("xpath", "//div[@class='yuRUbf']/div/span/a") + url_list = [link.get_attribute("href") for link in links] - if len(driver.find_elements_by_xpath('//div[@id="result-stats"]')) != 0: - - results = driver.find_elements_by_xpath( - '//div[@id="result-stats"]')[0].text - results = results[:results.find('results')] - max_pages = round( - int(int(''.join(i for i in results if i.isdigit())) / 10)) - - if max_pages != 0: - - index = 0 - - while True: - try: - index += 1 - links = driver.find_elements_by_xpath( - '//div[@class="yuRUbf"]/a') - linky = [link.get_attribute('href') for link in links] - url_list.extend(linky) - try: - driver.find_element_by_xpath('//*[@id="pnnext"]/span[2]').click() - except: - break - time.sleep(2) - sys.stdout.write('\r No.of pages parsed : %s\r' % (str(index))) - sys.stdout.flush() - except: - continue - - driver.quit() - else: - raise ValueError( - 'Your search - %s - did not match any documents.' % str(self.search_term)) - + # Filter out unwanted URLs (e.g., PDFs or XMLs) and remove duplicates url_list = list(dict.fromkeys(url_list)) - url_list = [url for url in url_list if '.pdf' not in url] - self.urls = [url for url in url_list if '.xml' not in url] + self.urls = [url for url in url_list if ".pdf" not in url and ".xml" not in url] + + except Exception as e: + raise ValueError(f"An error occurred during the search: {e}") - except: - raise ValueError( - 'Your search - %s - did not match any documents.' % str(self.search_term)) + finally: + driver.quit() # Ensure the driver is closed at the end diff --git a/newsfetch/helpers.py b/newsfetch/helpers.py index 3243cc5..756412c 100644 --- a/newsfetch/helpers.py +++ b/newsfetch/helpers.py @@ -1,110 +1,37 @@ -from newsfetch.utils import chromedriver_binary, json, pd, unidecode, webdriver +import re +from collections import Counter -""" -The Below functions contains error handling, unidecode, digits extraction, dataframe cleaning -""" +from unidecode import unidecode -errors = {'None': None, 'list': [], 'dict': {}} +def unicode(text: str) -> str: + return unidecode(text).strip() -def catch(default, func, handle=lambda e: e, *args, **kwargs): - try: - return func(*args, **kwargs) - except: - return errors[default] +def clean_text(article): + """Clean the article text by removing extra whitespace and newlines.""" + # Remove extra whitespace and newlines + cleaned_article = re.sub(r'\s+', ' ', article).strip() + return cleaned_article -def unicode(text: str) -> bool: - return unidecode.unidecode(text).strip() +def extract_keywords(article): + """Extract keywords from the article.""" + # Split the article into words + words = re.findall(r'\b\w+\b', article.lower()) + # Count frequency of each word + word_counts = Counter(words) + # Filter out common stopwords (you can expand this list) + stopwords = {'and', 'the', 'is', 'in', 'of', 'to', 'a', 'for', 'was', 'that', 'on', 'as', 'with', 'it', 'this', + 'are', 'by', 'an'} + keywords = {word: count for word, count in word_counts.items() if word not in stopwords} -def news_article(text): - return unicode(' '.join(text.replace('’', '').split())) + # Return keywords sorted by frequency + return sorted(keywords.items(), key=lambda x: x[1], reverse=True) -def digits(text: str) -> bool: - return int(''.join(i for i in text if i.isdigit())) - - -def dataframe_data(df): - return df.dropna(how='all').reset_index(drop=True) - - -def get_chrome_web_driver(options): - return webdriver.Chrome(chrome_options=options) - - -def get_web_driver_options(): - return webdriver.ChromeOptions() - - -def set_ignore_certificate_error(options): - options.add_argument('--ignore-certificate-errors') - - -def set_browser_as_incognito(options): - options.add_argument('--incognito') - - -def set_automation_as_head_less(options): - options.add_argument('--headless') - - -def author(soup): - i = 0 - while True: - meta = json.loads(soup.select( - 'script[type="application/ld+json"]')[i].text) - df = catch('None', lambda: pd.DataFrame(meta)) - meta_check = any(word in 'author' for word in list(meta.keys())) - authors = catch('None', lambda: meta.get('author') if meta_check == True else df['author'][0] if df != None else meta.get( - 'author')['name'] if meta_check == True else meta[0].get('author')['name'] if type(meta) == list else 'N/A') - if '' != authors or i == 3: - break - i += 1 - return author - - -def date(soup): - i = 0 - while True: - meta = json.loads(soup.select( - 'script[type="application/ld+json"]')[i].text) - df = catch('None', lambda: pd.DataFrame(meta)) - meta_check = any(word in 'datePublished' for word in list(meta.keys())) - date = catch('None', lambda: meta.get('datePublished') if meta_check == - True else df['datePublished'][0] if df != None else meta[0].get('datePublished') if type(meta) == list else 'N/A') - if '' != date or i == 3: - break - i += 1 - return date - - -def category(soup): - i = 0 - while True: - meta = json.loads(soup.select( - 'script[type="application/ld+json"]')[i].text) - df = catch('None', lambda: pd.DataFrame(meta)) - meta_check = any(word in '@type' for word in list(meta.keys())) - category = catch('None', meta.get('@type') if meta_check == - True else df['@type'][0] if len(df) != 0 else 'N/A') - if '' != category or i == 3: - break - i += 1 - - -def publisher(soup): - i = 0 - while True: - meta = json.loads(soup.select( - 'script[type="application/ld+json"]')[i].text) - df = catch('None', lambda: pd.DataFrame(meta)) - meta_check = any(word in 'publisher' for word in list(meta.keys())) - publisher = catch('None', lambda: meta.get('publisher') if meta_check == True else df['publisher'][0] if df != None else meta.get( - 'publisher')['name'] if meta_check == True else meta[0].get('publisher')['name'] if type(meta) == list else 'N/A') - if '' != publisher or i == 3: - break - i += 1 - - return publisher +def summarize_article(article, max_sentences=3): + """Summarize the article by extracting the first few sentences.""" + sentences = re.split(r'(?<=[.!?]) +', article) + summary = ' '.join(sentences[:max_sentences]) # Take the first few sentences + return summary diff --git a/newsfetch/news.py b/newsfetch/news.py index 3e704d0..71c6cf9 100644 --- a/newsfetch/news.py +++ b/newsfetch/news.py @@ -1,141 +1,152 @@ -from newsfetch.helpers import (author, catch, category, date, news_article, - publisher, unicode) -from newsfetch.utils import Article, BeautifulSoup, NewsPlease, get, unquote - - -class newspaper: - - def __init__(self, uri: str) -> bool: - - self.uri = uri - - """ - :return: Initializing the values with 'None', In case if the below values not able to extracted from the target uri - """ - - # NewsPlease Scraper - newsplease = catch( - 'None', lambda: NewsPlease.from_url(self.uri, timeout=6)) - - # Newspaper3K Scraper - article = catch('None', lambda: Article(self.uri, timeout=6)) - catch('None', lambda: article.download()) - catch('None', lambda: article.parse()) - catch('None', lambda: article.nlp()) - - soup = catch('None', lambda: BeautifulSoup(get(self.uri).text, 'lxml')) - - if all([newsplease, article, soup]) == None: - raise ValueError( - "Sorry, the page you are looking for doesn't exist'") - - """ - :returns: The News Article - """ - self.article = catch('None', lambda: news_article(article.text) if article.text != - None else news_article(newsplease.maintext) if newsplease.maintext != None else 'None') - - """ - :returns: The News Authors - """ - self.authors = catch('list', lambda: newsplease.authors if len(newsplease.authors) != 0 else article.authors if len( - article.authors) != 0 else unicode([author(soup)]) if author(soup) != None else ['None']) - - """ - :returns: The News Published, Modify, Download Date - """ - self.date_publish = catch('None', lambda: str(newsplease.date_publish) if str(newsplease.date_publish) != 'None' else article.meta_data[ - 'article']['published_time'] if article.meta_data['article']['published_time'] != None else date(soup) if date(soup) != None else 'None') - - self.date_modify = catch('None', lambda: str(newsplease.date_modify)) - - self.date_download = catch( - 'None', lambda: str(newsplease.date_download)) - - """ - :returns: The News Image URL - """ - self.image_url = catch('None', lambda: newsplease.image_url) - - """ - :returns: The News filename - """ - self.filename = catch('None', lambda: unquote(newsplease.filename)) - - """ - :returns: The News title page - """ - self.title_page = catch('None', lambda: newsplease.title_page) - - """ - :returns: The News title rss - """ - self.title_rss = catch('None', lambda: newsplease.title_rss) - - """ - :returns: The News Language - """ - self.language = catch('None', lambda: newsplease.language) - - """ - :returns: The News Publisher - """ - self.publication = catch('None', lambda: article.meta_data['og']['site_name'] if article.meta_data['og']['site_name'] != None else publisher( - soup) if publisher(soup) != None else self.uri.split('/')[2] if self.uri.split('/')[2] != None else 'None') - - """ - :returns: The News Category - """ - meta_check = any(word in 'section' or 'category' for word in list( - article.meta_data.keys())) - self.category = catch('None', lambda: article.meta_data['category'] if meta_check == True and article.meta_data['category'] != {} else article.meta_data['section'] if meta_check == - True and article.meta_data['section'] != {} else article.meta_data['article']['section'] if meta_check == True and article.meta_data['article']['section'] != {} else category(soup) if category(soup) != None else 'None') - - """ - :returns: headlines - """ - self.headline = catch('None', lambda: unicode(article.title) if article.title != None else unicode( - newsplease.title) if newsplease.title != None else 'None') - - """ - :returns: keywords - """ - self.keywords = catch('list', lambda: article.keywords) - - """ - :returns: summary - """ - self.summary = catch('None', lambda: news_article(article.summary)) - - """ - :returns: source domain - """ - self.source_domain = catch('None', lambda: newsplease.source_domain) - - """ - :returns: description - """ - self.description = catch('None', lambda: news_article(article.meta_description) if article.meta_description != '' else news_article( - article.meta_data['description']) if article.meta_data['description'] != {} else news_article(newsplease.description) if newsplease.description != None else None) - - """ - :returns: serializable_dict - """ - self.get_dict = catch('dict', lambda: {'headline': self.headline, - 'author': self.authors, - 'date_publish': self.date_publish, - 'date_modify': self.date_modify, - 'date_download': self.date_download, - 'language': self.language, - 'image_url': self.image_url, - 'filename': self.filename, - 'description': self.description, - 'publication': self.publication, - 'category': self.category, - 'source_domain': self.source_domain, - 'article': self.article, - 'summary': self.summary, - 'keyword': self.keywords, - 'title_page': self.title_page, - 'title_rss': self.title_rss, - 'url': self.uri}) +from newspaper_handler import ArticleHandler +from news_please_handler import NewsPleaseHandler +from soup_handler import SoupHandler + +class Newspaper: + """Class to scrape and extract information from a news article.""" + + def __init__(self, url: str) -> None: + """Initialize the Newspaper object with the given URL.""" + self.url = url + self.__news_please = NewsPleaseHandler(url) + self.__article = ArticleHandler(url) + self.__soup = SoupHandler(url) + + # Validate initialization + self.__validate_initialization() + + # If article is available, download and parse it + if self.__article.is_valid(): + self.__article.download_and_parse() + + # Extract data attributes + self.headline = self.__extract_headline() + self.article = self.__extract_article() + self.authors = self.__extract_authors() + self.date_publish = self.__extract_date_publish() + self.date_modify = self.__extract_date_modify() + self.date_download = self.__extract_date_download() + self.image_url = self.__extract_image_url() + self.filename = self.__extract_filename() + self.title_page = self.__extract_title_page() + self.title_rss = self.__extract_title_rss() + self.language = self.__extract_language() + self.publication = self.__extract_publication() + self.category = self.__extract_category() + self.keywords = self.__extract_keywords() + self.summary = self.__extract_summary() + self.source_domain = self.__extract_source_domain() + self.source_favicon_url = self.__extract_source_favicon_url() + self.description = self.__extract_description() + + self.get_dict = self.__serialize() + + def __validate_initialization(self): + """Raise an error if no valid data is found.""" + if not (self.__news_please.is_valid() or self.__article.is_valid() or self.__soup.is_valid()): + raise ValueError("Sorry, the page you are looking for doesn't exist.") + + @staticmethod + def __extract(*sources): + """Generic method to extract the first valid value from provided sources.""" + for source in sources: + value = source # Accessing the property directly + if value: + return value + return None + + def __extract_authors(self): + """Extract the authors from the article or the news source.""" + return self.__extract(self.__news_please.authors, self.__article.authors, self.__soup.authors) + + def __extract_date_publish(self): + """Extract the publication date of the article.""" + return self.__extract(self.__news_please.date_publish, self.__article.date_publish, self.__soup.date_publish) + + def __extract_date_modify(self): + """Extract the modification date of the article.""" + return self.__news_please.date_modify + + def __extract_date_download(self): + """Extract the date the article was downloaded.""" + return self.__news_please.date_download + + def __extract_image_url(self): + """Extract the URL of the article's image.""" + return self.__news_please.image_url + + def __extract_filename(self): + """Extract the filename of the article.""" + return self.__news_please.filename + + def __extract_article(self): + """Extract the article content.""" + return self.__extract(self.__news_please.article, self.__article.article) + + def __extract_title_page(self): + """Extract the title of the article page.""" + return self.__news_please.title_page + + def __extract_title_rss(self): + """Extract the RSS title of the article.""" + return self.__news_please.title_rss + + def __extract_language(self): + """Extract the language of the article.""" + return self.__news_please.language + + def __extract_publication(self): + """Extract the publication name of the article.""" + return self.__extract(self.__article.publication, self.__soup.publisher) + + def __extract_category(self): + """Extract the category of the article.""" + return self.__extract(self.__article.category, self.__soup.category) + + def __extract_headline(self): + """Extract the headline of the article.""" + return self.__extract(self.__news_please.headline, self.__article.headline) + + def __extract_keywords(self): + """Extract the keywords associated with the article.""" + return self.__article.keywords or [] + + def __extract_summary(self): + """Extract the summary of the article.""" + return self.__article.summary + + def __extract_source_domain(self): + """Extract the source domain of the article.""" + return self.__news_please.source_domain + + def __extract_source_favicon_url(self): + """Extract the source favicon URL of the article.""" + return self.__article.meta_favicon + + def __extract_description(self): + """Extract the description of the article.""" + return self.__extract(self.__news_please.summary, self.__article.summary) + + def __serialize(self): + """Return a dictionary representation of the article's data.""" + return { + "headline": self.headline, + "author": self.authors, + "date_publish": self.date_publish, + "date_modify": self.date_modify, + "date_download": self.date_download, + "language": self.language, + "image_url": self.image_url, + "filename": self.filename, + "description": self.description, + "publication": self.publication, + "category": self.category, + "source_domain": self.source_domain, + "source_favicon_url": self.source_favicon_url, + "article": self.article, + "summary": self.summary, + "keyword": self.keywords, + "title_page": self.title_page, + "title_rss": self.title_rss, + "url": self.url + } diff --git a/newsfetch/news_please_handler.py b/newsfetch/news_please_handler.py new file mode 100644 index 0000000..05c5123 --- /dev/null +++ b/newsfetch/news_please_handler.py @@ -0,0 +1,90 @@ +from urllib.parse import unquote + +from newsplease import NewsPlease +from helpers import clean_text, unicode + + +class NewsPleaseHandler: + """Handle interactions with the NewsPlease library.""" + + def __init__(self, url: str): + self.url = url + self.__news_please = self.__safe_execute(lambda: NewsPlease.from_url(self.url, timeout=6)) + + @staticmethod + def __safe_execute(func): + """Executes a function and returns None if it raises an exception.""" + try: + return func() + except Exception: + # Optional: log the exception here + return None + + def is_valid(self) -> bool: + """Check if the NewsPlease instance is valid.""" + return self.__news_please is not None + + @property + def authors(self) -> list: + """Return authors from NewsPlease instance.""" + return self.__news_please.authors if self.is_valid() else [] + + @property + def date_publish(self) -> str: + """Return publication date from NewsPlease instance.""" + return str(self.__news_please.date_publish) if self.is_valid() else None + + @property + def date_modify(self) -> str: + """Return modification date from NewsPlease instance.""" + return str(self.__news_please.date_modify) if self.is_valid() else None + + @property + def date_download(self) -> str: + """Return download date from NewsPlease instance.""" + return str(self.__news_please.date_download) if self.is_valid() else None + + @property + def image_url(self) -> str: + """Return image URL from NewsPlease instance.""" + return self.__news_please.image_url if self.is_valid() else None + + @property + def filename(self) -> str: + """Return filename from NewsPlease instance.""" + return unquote(self.__news_please.filename) if self.is_valid() else None + + @property + def title_page(self) -> str: + """Return title page from NewsPlease instance.""" + return self.__news_please.title_page if self.is_valid() else None + + @property + def title_rss(self) -> str: + """Return RSS title from NewsPlease instance.""" + return self.__news_please.title_rss if self.is_valid() else None + + @property + def language(self) -> str: + """Return language from NewsPlease instance.""" + return self.__news_please.language if self.is_valid() else None + + @property + def summary(self) -> str: + """Return description from NewsPlease instance.""" + return self.__news_please.description if self.is_valid() else None + + @property + def article(self) -> str: + """Return cleaned article text from the NewsPlease instance.""" + return unicode(clean_text(self.__news_please.maintext)) if self.is_valid() else None + + @property + def source_domain(self) -> str: + """Return source domain from NewsPlease instance.""" + return self.__news_please.source_domain if self.is_valid() else None + + @property + def headline(self) -> str: + """Return headline from NewsPlease instance.""" + return unicode(self.__news_please.title) if self.is_valid() else None diff --git a/newsfetch/newspaper_handler.py b/newsfetch/newspaper_handler.py new file mode 100644 index 0000000..13db798 --- /dev/null +++ b/newsfetch/newspaper_handler.py @@ -0,0 +1,105 @@ +from newspaper import Article + +from helpers import clean_text, extract_keywords, summarize_article, unicode + + +class ArticleHandler: + """Handle interactions with the Article class.""" + + def __init__(self, url: str): + self.url = url + self.__article = self.__initialize_article() + + def __initialize_article(self): + """Initialize the Article instance.""" + return self.__safe_execute(lambda: Article(self.url)) + + @staticmethod + def __safe_execute(func): + """Executes a function and returns None if it raises an exception.""" + try: + return func() + except Exception: + # You might want to log the exception here + return None + + def is_valid(self): + """Check if the Article instance is valid.""" + return self.__article is not None + + def download_and_parse(self): + """Download and parse the article.""" + if self.is_valid(): + self.__safe_execute(self.__article.download) + self.__safe_execute(self.__article.parse) + self.__safe_execute(self.__article.nlp) + + @property + def authors(self): + """Return authors from the Article instance.""" + return self.__article.authors if self.is_valid() else [] + + @property + def date_publish(self): + """Return publication date from the Article instance.""" + return self.__article.meta_data.get("published_time") if self.is_valid() else None + + @property + def keywords(self): + """Return keywords from the Article instance.""" + if not self.is_valid(): + return [] + + keywords = self.__process_keywords(self.__article.keywords) + + if not keywords: + return extract_keywords(self.article) + + return keywords + + @staticmethod + def __process_keywords(keywords, max_keywords=None): + """Process keywords to remove duplicates and limit the number.""" + unique_keywords = list(set(keywords)) + return unique_keywords[:max_keywords] if max_keywords is not None else unique_keywords + + @property + def summary(self): + """Return summary from the Article instance.""" + summary = self.__article.summary if self.is_valid() else None + + if not summary: + return unicode(summarize_article(self.article)) + + return unicode(summary) + + @property + def article(self): + """Return cleaned article text from the Article instance.""" + if self.is_valid(): + return unicode(clean_text(self.__article.text)) + + @property + def publication(self): + """Return publication name from the Article instance.""" + return self.__article.meta_data.get("og", {}).get("site_name") if self.is_valid() else None + + @property + def category(self): + """Return category from the Article instance.""" + if not self.is_valid(): + return None + + return (self.__article.meta_data.get("category") or + self.__article.meta_data.get("section") or + self.__article.meta_data.get("article", {}).get("section")) or None + + @property + def headline(self): + """Return title from the Article instance.""" + return unicode(self.__article.title) if self.is_valid() else None + + @property + def meta_favicon(self): + """Return meta favicon from the Article instance.""" + return self.__article.meta_favicon if self.is_valid() else None diff --git a/newsfetch/soup_handler.py b/newsfetch/soup_handler.py new file mode 100644 index 0000000..909f3c1 --- /dev/null +++ b/newsfetch/soup_handler.py @@ -0,0 +1,124 @@ +import json +from bs4 import BeautifulSoup +from requests import get + + +class SoupHandler: + """Handle interactions with BeautifulSoup for HTML parsing.""" + + def __init__(self, url: str): + """Initialize the SoupHandler with a given URL.""" + self.url = url + self.__soup = self.__safe_execute(lambda: BeautifulSoup(get(self.url).text, "lxml")) + + @staticmethod + def __safe_execute(func): + """Executes a function and returns None if it raises an exception.""" + try: + return func() + except Exception: + return None + + def is_valid(self) -> bool: + """Check if the BeautifulSoup instance is valid.""" + return self.__soup is not None + + def extract_metadata(self, metadata_type: str): + """Extract specified metadata from the HTML soup ("author", "date", "category", or "publisher").""" + if metadata_type not in ["author", "date", "category", "publisher"]: + raise ValueError("metadata_type must be 'author', 'date', 'category', or 'publisher'.") + + if not self.is_valid(): + return "N/A" # Return if the soup is not valid + + meta_elements = self.__soup.select("script[type='application/ld+json']") + for i in range(min(3, len(meta_elements))): # Limit to 3 attempts + try: + meta = json.loads(meta_elements[i].text) + result = self.__extract_meta(meta, metadata_type) + if result != "N/A": + return result # Return if found + except (json.JSONDecodeError, IndexError): + continue # Skip to the next element if there"s an error + + return "N/A" # Default return value if nothing is found + + @property + def authors(self): + """Extract author information from the HTML soup using JSON-LD data.""" + return self.extract_metadata("author") + + @property + def date_publish(self): + """Extract the publication date from the HTML soup using JSON-LD data.""" + return self.extract_metadata("date") + + @property + def category(self): + """Extract the category from the HTML soup using JSON-LD data.""" + return self.extract_metadata("category") + + @property + def publisher(self): + """Extract the publisher from the HTML soup using JSON-LD data.""" + return self.extract_metadata("publisher") + + def __extract_meta(self, meta, metadata_type): + """Extract specific metadata from the JSON-LD.""" + if metadata_type == "author": + return self.__extract_authors(meta) + elif metadata_type == "date": + return self.__extract_date(meta) + elif metadata_type == "category": + return self.__extract_category(meta) + elif metadata_type == "publisher": + return self.__extract_publisher(meta) + return "N/A" # Default if type doesn"t match + + @staticmethod + def __extract_authors(meta): + """Extract author information from the metadata.""" + authors = [] + if "author" in meta: + if isinstance(meta["author"], list): + authors = [a.get("name") for a in meta["author"] if a.get("name")] + elif isinstance(meta["author"], dict): + authors = [meta["author"].get("name", "N/A")] + elif isinstance(meta["author"], str): + authors = [meta["author"]] + + return authors if authors else ["N/A"] # Return a list, default to "N/A" if empty + + @staticmethod + def __extract_date(meta): + """Extract the publication date from the metadata.""" + if "datePublished" in meta: + if isinstance(meta, dict): + return meta["datePublished"] + elif isinstance(meta, list) and meta: + return meta[0].get("datePublished", "N/A") + + return "N/A" # Return "N/A" if no date found + + @staticmethod + def __extract_category(meta): + """Extract the category from the metadata.""" + key = "@type" + if key in meta: + if isinstance(meta, dict): + return meta[key] + elif isinstance(meta, list) and meta: + return meta[0].get(key, "N/A") + + return "N/A" # Return "N/A" if no category found + + @staticmethod + def __extract_publisher(meta): + """Extract the publisher from the metadata.""" + if "publisher" in meta: + if isinstance(meta["publisher"], dict): + return meta["publisher"].get("name", "N/A") + elif isinstance(meta["publisher"], str): + return meta["publisher"] + + return "N/A" # Return "N/A" if no publisher found diff --git a/requirements.txt b/requirements.txt index fd2311c..b9579b4 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,9 +1,10 @@ -beautifulsoup4 -selenium -chromedriver-binary -pandas -pattern -fake_useragent -setuptools -twine -unidecode +setuptools==75.3.0 +pandas==2.2.3 +requests==2.32.3 +bs4==0.0.2 +beautifulsoup4==4.12.3 +newspaper3k==0.2.8 +news-please==1.6.13 +Unidecode==1.3.8 +selenium==4.26.1 +twine==5.1.1 \ No newline at end of file diff --git a/setup.py b/setup.py index 7450d7a..e66e42a 100644 --- a/setup.py +++ b/setup.py @@ -10,32 +10,39 @@ # Always prefer setuptools over distutils import setuptools -keywords = ['Newspaper', "news-fetch", "without-api", "google_scraper", 'news_scraper', 'bs4', 'lxml', 'news-crawler', - 'news-extractor', 'crawler', 'extractor', 'news', 'news-websites', 'elasticsearch', 'json', 'python', 'nlp', 'data-gathering', - 'news-archive', 'news-articles', 'commoncrawl', 'extract-articles', 'extract-information', 'news-scraper', 'spacy'] +keywords = [ + 'Newspaper3K', "news-fetch", "without-api", "google_scraper", 'news_scraper', + 'bs4', 'lxml', 'news-crawler', 'news-extractor', 'crawler', 'extractor', + 'news', 'news-websites', 'elasticsearch', 'json', 'python', 'nlp', + 'data-gathering', 'news-archive', 'news-articles', 'commoncrawl', + 'extract-articles', 'extract-information', 'news-scraper', 'spacy' +] setuptools.setup( name="news-fetch", - version="0.2.8", + version="0.2.9", author="M Santhosh Kumar", author_email="santhoshse7en@gmail.com", - description="news-fetch is an open source easy-to-use news extractor and basic nlp (cleaning_text, keywords, summary) comes handy that just works", + description="news-fetch is an open-source, easy-to-use news extractor with basic NLP features (cleaning text, keywords, summary) that just works.", long_description=open('README.md').read(), long_description_content_type="text/markdown", url="https://santhoshse7en.github.io/news-fetch/", keywords=keywords, - install_requires=['beautifulsoup4', 'pandas', 'selenium', 'news-please', 'newspaper3k', - 'fake_useragent', 'chromedriver-binary', 'unidecode', 'cchardet'], + install_requires=[ + 'beautifulsoup4', 'pandas', 'selenium', 'news-please', 'newspaper3k', + 'fake_useragent', 'chromedriver-binary', 'unidecode', 'cchardet' + ], packages=setuptools.find_packages(), - classifiers=['Development Status :: 4 - Beta', - 'Intended Audience :: End Users/Desktop', - 'Intended Audience :: Developers', - 'Intended Audience :: System Administrators', - 'License :: OSI Approved :: MIT License', - 'Operating System :: OS Independent', - 'Programming Language :: Python', - 'Topic :: Communications :: Email', - 'Topic :: Office/Business', - 'Topic :: Software Development :: Bug Tracking', - ], + classifiers=[ + 'Development Status :: 4 - Beta', + 'Intended Audience :: End Users/Desktop', + 'Intended Audience :: Developers', + 'Intended Audience :: System Administrators', + 'License :: OSI Approved :: MIT License', + 'Operating System :: OS Independent', + 'Programming Language :: Python', + 'Topic :: Communications :: Email', + 'Topic :: Office/Business', + 'Topic :: Software Development :: Bug Tracking', + ], )