Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Dexter Scrapers #70

Open
wants to merge 60 commits into
base: master
Choose a base branch
from
Open

Conversation

Francoisvt04
Copy link

Dexter Crawlers Changelog from Assemble

1.The following crawlers have been added:
howwemadeitinafrica, savca, rhodesunimathewblog, worldstage, classicfm, afp, naijanews, dailytrustnp, newteleonline, thepoint, dailytimes, thenation, mediamaxnet, leadership, theinterview, rsaparliament, guardian, nationaldailyng, nta, acdivoca, thisdaylive, channelafrica, nan, nigeriatoday, businessdayonline, standardmediaktnnews, globaltimescn, nationalmirror, monitorke, newsverge, sundiatapost, agrilinks, businessdailyafrica, thebusinesspost, theguardianuk, independentng, thenerveafrica, amehnews, sunnewsonline, seedmagazine, hallmarknews, destinyconnect, economist, washingtonpost, amabhungane, africainvestor, outrepreneurs, cnbcafrica, planintl, bloomberg

2.In document_processor.py:
The crawler classes were registered under the DocumentProcessor and DocumentProcessorNT classes.

3.In medium.py:
The Mediums for each of the crawlers where added under the create_defaults class method and added a url exception for mathewnyaungwa.blogspot.co.za under is_tld_exception class method and added a sub_domain_exception_list in for_url class method to handle blogspot.co.za.

4.In country.py
Added country codes for the newly added crawlers in the create_defaults class method.

5.Had to update the tld name list to include some of the newly added country codes.
These where the commands I ran to update the list:

  • from tld.utils import update_tld_names
  • update_tld_names()

@Francoisvt04
Copy link
Author

Hey Matt, these are the new crawlers MMA asked for. Please review them along with with the change log notes I added and give feed back as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant