Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Dexter Scrapers #70

Open
wants to merge 60 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
b78dc47
Initial setup, had to make some fixes to fdi and seeds to make it run…
Feb 8, 2018
999b72d
Added savca crawler
Feb 9, 2018
5b66506
Added Rhodes University MathewYaungwaBlog crawler
Feb 9, 2018
ba0b277
Added crawler for worldstage
Feb 12, 2018
657f62a
Added crawlers for classicfm and afp
Feb 12, 2018
aa541d4
Added a crawler for naijanews
Feb 12, 2018
b2ed25b
Removed test code
Feb 12, 2018
1167ed6
Added dailytrustnp crawler
Feb 12, 2018
eebd480
Changed dailytrustnp crawler author logic
Feb 12, 2018
759e13c
Added newteleonline crawler and removed test code from dailytrust cra…
Feb 13, 2018
fa3d26c
Added a crawler for thepoint
Feb 13, 2018
656cff3
Added dailytimes crawler
Feb 13, 2018
205001f
Added thenation crawler
Feb 13, 2018
d7ec48a
Added mediamaxnet crawler and removed test code from thenation crawler
Feb 13, 2018
7bd8593
Added leadership crawler
Feb 13, 2018
b2e7b99
Added theinterview crawler
Feb 13, 2018
8ab1ed9
Added rsaparliament crawler
Feb 14, 2018
5fafdbd
Added guardian crawler and renamed interview and rsaparliament crawle…
Feb 14, 2018
55a93b3
Added nationaldailyng crawler
Feb 14, 2018
e59cd09
Added nta crawler
Feb 14, 2018
ff1e2ca
Removed test code
Feb 15, 2018
0949bcf
Added acdivoca crawler
Feb 15, 2018
aecc297
Added thisdaylive crawler
Feb 15, 2018
bf9be20
Added channelafrica crawler and removed test code from thisdaylive
Feb 15, 2018
bc24133
Added nan crawler
Feb 15, 2018
d1c550b
Added nigeriatoday crawler
Feb 15, 2018
65f212a
Updated classicfm crawler
Feb 15, 2018
6a3d897
Removed test code from classicfm crawler
Feb 15, 2018
8911a02
Added businessdayonline crawler
Feb 15, 2018
c83fb12
Removed test code from businessdayonline crawler
Feb 15, 2018
06d9b11
Added standardmediaktnnews crawler
Feb 16, 2018
d8c143d
Added globaltimescn crawler
Feb 16, 2018
c3d34c6
Changed logic on rsaparliament crawler logic for dates and authors
Feb 16, 2018
5f65807
Added nationalmirror crawler
Feb 19, 2018
daf5ca6
Added monitorke crawler
Feb 19, 2018
254967f
Added newsverge crawler
Feb 19, 2018
a7471c9
Added sundiatapost crawler
Feb 19, 2018
21ca760
Removed test code from sundiatapost crawler
Feb 19, 2018
205b021
Added agrilinks crawler
Feb 19, 2018
f1f0365
Removed test code from agrilinks crawler
Feb 19, 2018
8245a2b
Added businessdailyafrica crawler
Feb 19, 2018
45cf819
Added thebusinesspost crawler and removed test code from businessdail…
Feb 20, 2018
59f58d8
Added theguardianuk crawler
Feb 20, 2018
16fd749
Added independentng crawler and removed test code from theguardianuk …
Feb 20, 2018
8238dcb
Removed test code from independentng crawler
Feb 20, 2018
7db1d13
Added thenerveafrica crawler
Feb 20, 2018
c09fe69
Added amehnews crawler
Feb 20, 2018
7175bef
Added sunnewsonline crawler
Feb 20, 2018
cc3d63d
Added seedmagazine crawler
Feb 20, 2018
71626d8
Added hallmarksnews crawler
Feb 20, 2018
01fe8c4
Added destinyconnect crawler
Feb 20, 2018
a3a7b0e
Added economist crawler
Feb 20, 2018
bb20978
Added washingtonpost crawler
Feb 20, 2018
29664a1
Added amabhungane crawler
Feb 20, 2018
1e02f08
Added africainvestor crawler
Feb 21, 2018
c834e16
Added outrepreneurs crawler
Feb 21, 2018
405eb72
Added cbncafrica crawler
Feb 21, 2018
3d6767f
Added planintl crawler
Feb 21, 2018
1fd4849
Added bloomberg crawler
Feb 21, 2018
0ffebbc
Changed document text and summary logic in nationaldailyng crawler
Feb 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
283 changes: 283 additions & 0 deletions dexter.core
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
%!PS-Adobe-3.0
%%Creator: (ImageMagick)
%%Title: (dexter.core)
%%CreationDate: (2018-02-13T12:57:01+02:00)
%%BoundingBox: 2462 297 2496 298
%%HiResBoundingBox: 2462 297 2496 298
%%DocumentData: Clean7Bit
%%LanguageLevel: 1
%%Orientation: Portrait
%%PageOrder: Ascend
%%Pages: 1
%%EndComments

%%BeginDefaults
%%EndDefaults

%%BeginProlog
%
% Display a color image. The image is displayed in color on
% Postscript viewers or printers that support color, otherwise
% it is displayed as grayscale.
%
/DirectClassPacket
{
%
% Get a DirectClass packet.
%
% Parameters:
% red.
% green.
% blue.
% length: number of pixels minus one of this color (optional).
%
currentfile color_packet readhexstring pop pop
compression 0 eq
{
/number_pixels 3 def
}
{
currentfile byte readhexstring pop 0 get
/number_pixels exch 1 add 3 mul def
} ifelse
0 3 number_pixels 1 sub
{
pixels exch color_packet putinterval
} for
pixels 0 number_pixels getinterval
} bind def

/DirectClassImage
{
%
% Display a DirectClass image.
%
systemdict /colorimage known
{
columns rows 8
[
columns 0 0
rows neg 0 rows
]
{ DirectClassPacket } false 3 colorimage
}
{
%
% No colorimage operator; convert to grayscale.
%
columns rows 8
[
columns 0 0
rows neg 0 rows
]
{ GrayDirectClassPacket } image
} ifelse
} bind def

/GrayDirectClassPacket
{
%
% Get a DirectClass packet; convert to grayscale.
%
% Parameters:
% red
% green
% blue
% length: number of pixels minus one of this color (optional).
%
currentfile color_packet readhexstring pop pop
color_packet 0 get 0.299 mul
color_packet 1 get 0.587 mul add
color_packet 2 get 0.114 mul add
cvi
/gray_packet exch def
compression 0 eq
{
/number_pixels 1 def
}
{
currentfile byte readhexstring pop 0 get
/number_pixels exch 1 add def
} ifelse
0 1 number_pixels 1 sub
{
pixels exch gray_packet put
} for
pixels 0 number_pixels getinterval
} bind def

/GrayPseudoClassPacket
{
%
% Get a PseudoClass packet; convert to grayscale.
%
% Parameters:
% index: index into the colormap.
% length: number of pixels minus one of this color (optional).
%
currentfile byte readhexstring pop 0 get
/offset exch 3 mul def
/color_packet colormap offset 3 getinterval def
color_packet 0 get 0.299 mul
color_packet 1 get 0.587 mul add
color_packet 2 get 0.114 mul add
cvi
/gray_packet exch def
compression 0 eq
{
/number_pixels 1 def
}
{
currentfile byte readhexstring pop 0 get
/number_pixels exch 1 add def
} ifelse
0 1 number_pixels 1 sub
{
pixels exch gray_packet put
} for
pixels 0 number_pixels getinterval
} bind def

/PseudoClassPacket
{
%
% Get a PseudoClass packet.
%
% Parameters:
% index: index into the colormap.
% length: number of pixels minus one of this color (optional).
%
currentfile byte readhexstring pop 0 get
/offset exch 3 mul def
/color_packet colormap offset 3 getinterval def
compression 0 eq
{
/number_pixels 3 def
}
{
currentfile byte readhexstring pop 0 get
/number_pixels exch 1 add 3 mul def
} ifelse
0 3 number_pixels 1 sub
{
pixels exch color_packet putinterval
} for
pixels 0 number_pixels getinterval
} bind def

/PseudoClassImage
{
%
% Display a PseudoClass image.
%
% Parameters:
% class: 0-PseudoClass or 1-Grayscale.
%
currentfile buffer readline pop
token pop /class exch def pop
class 0 gt
{
currentfile buffer readline pop
token pop /depth exch def pop
/grays columns 8 add depth sub depth mul 8 idiv string def
columns rows depth
[
columns 0 0
rows neg 0 rows
]
{ currentfile grays readhexstring pop } image
}
{
%
% Parameters:
% colors: number of colors in the colormap.
% colormap: red, green, blue color packets.
%
currentfile buffer readline pop
token pop /colors exch def pop
/colors colors 3 mul def
/colormap colors string def
currentfile colormap readhexstring pop pop
systemdict /colorimage known
{
columns rows 8
[
columns 0 0
rows neg 0 rows
]
{ PseudoClassPacket } false 3 colorimage
}
{
%
% No colorimage operator; convert to grayscale.
%
columns rows 8
[
columns 0 0
rows neg 0 rows
]
{ GrayPseudoClassPacket } image
} ifelse
} ifelse
} bind def

/DisplayImage
{
%
% Display a DirectClass or PseudoClass image.
%
% Parameters:
% x & y translation.
% x & y scale.
% label pointsize.
% image label.
% image columns & rows.
% class: 0-DirectClass or 1-PseudoClass.
% compression: 0-none or 1-RunlengthEncoded.
% hex color packets.
%
gsave
/buffer 512 string def
/byte 1 string def
/color_packet 3 string def
/pixels 768 string def

currentfile buffer readline pop
token pop /x exch def
token pop /y exch def pop
x y translate
currentfile buffer readline pop
token pop /x exch def
token pop /y exch def pop
currentfile buffer readline pop
token pop /pointsize exch def pop
/Times-Roman findfont pointsize scalefont setfont
x y scale
currentfile buffer readline pop
token pop /columns exch def
token pop /rows exch def pop
currentfile buffer readline pop
token pop /class exch def pop
currentfile buffer readline pop
token pop /compression exch def pop
class 0 gt { PseudoClassImage } { DirectClassImage } ifelse
grestore
showpage
} bind def
%%EndProlog
%%Page: 1 1
%%PageBoundingBox: 2462 297 2496 298
DisplayImage
2462 297
34 1
12
34 1
0
0
B2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BB
B2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BB
B2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BBB2B4BB

%%PageTrailer
%%Trailer
%%EOF
4 changes: 4 additions & 0 deletions dexter/models/country.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ def create_defaults(cls):
Germany|de
United Kingdom (Great Britain)|gb
Kenya|ke
Nigeria|ng
France|fr
United States of America|us
China|cn
"""

countries = []
Expand Down
4 changes: 2 additions & 2 deletions dexter/models/fdi.py
Original file line number Diff line number Diff line change
Expand Up @@ -775,7 +775,7 @@ class Involvements2(db.Model):
__tablename__ = "involvements2"

id = Column(Integer, primary_key=True)
name = Column(String(50), index=True, nullable=False, unique=True)
name = Column(String(128), index=True, nullable=False, unique=True)

def __repr__(self):
return "<Involvements2='%s'>" % (self.name)
Expand Down Expand Up @@ -878,7 +878,7 @@ class Involvements3(db.Model):
__tablename__ = "involvements3"

id = Column(Integer, primary_key=True)
name = Column(String(50), index=True, nullable=False, unique=True)
name = Column(String(128), index=True, nullable=False, unique=True)

def __repr__(self):
return "<Involvements3='%s'>" % (self.name)
Expand Down
59 changes: 56 additions & 3 deletions dexter/models/medium.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ def is_tld_exception(cls, url):
"""
url_exceptions = [
'thecitizen.co.tz',
'dailynews.co.tz'
'dailynews.co.tz',
'mathewnyaungwa.blogspot.co.za'
]
for ex in url_exceptions:
# check if it exists in the url add buffer for [https://www.] characters at start
Expand All @@ -51,10 +52,12 @@ def is_tld_exception(cls, url):

@classmethod
def for_url(cls, url):
sub_domain_exception_list = [
'blogspot.co.za'
]
domain = get_tld(url, fail_silently=True)
# fail silently

if domain is None:
if domain is None or domain in sub_domain_exception_list:
domain = cls.is_tld_exception(url)

if domain is None:
Expand Down Expand Up @@ -175,6 +178,56 @@ def create_defaults(cls):
The East African|online|theeastafrican.co.ke||ke
Daily News (Tanzania)|online|dailynews.co.tz||tz
Daily News (Zimbabwe)|online|dailynews.co.zw||tz
SAVCA|online|savca.co.za||za
How We Made It In Africa|online|howwemadeitinafrica.com||za
Rhodes University (MathewYaungwaBlog)|online|mathewnyaungwa.blogspot.co.za||za
World Stage|online|worldstagegroup.com||ng
Classic FM|online|classic97.net||ng
Agence France Presse|online|afp.com||fr
Naija News Agency|online|naijanewsagency.com||ng
Daily Trust Newspaper|online|dailytrust.com.ng||ng
Daily Telegraph New Telegraph Online|online|newtelegraphonline.com||ng
The Point|online|thepointng.com||ng
The Daily Times|online|dailytimes.ng||ng
The Nation Online|online|thenationonlineng.net||ng
Media Max Network|online|mediamaxnetwork.co.ke||ke
Leadership|online|leadership.ng||ng
The Interview|online|theinterview.com.ng||ng
RSA Parliament|online|parliament.gov.za||za
Guardian|online|guardian.ng||ng
Naitional Daily Nigeria|online|nationaldailyng.com||ng
Nigerian Television Authority|online|nta.ng||ng
ACDIVOCA|online|acdivoca.org||us
This Day Live|online|thisdaylive.com||ng
Channel Africa|online|channelafrica.co.za||za
News Agency Of Nigeria|online|nan.ng||ng
Nigeria Today|online|nigeriatoday.ng||ng
Business Day Online|online|businessdayonline.com||ng
Standard Media KTN News|online|standardmedia.co.ke/ktnnews||ke
Global Times China|online|globaltimes.cn||cn
National Mirror|online|nationalmirroronline.net||ng
Monitor Kenya|online|monitor.co.ke||ke
Newsverge|online|newsverge.com||ng
Sundiata Post|online|sundiatapost.com||ng
Agrilinks|online|agrilinks.org||us
Business Daily Africa|online|businessdailyafrica.com||ke
The Business Post|online|thebusinesspost.ng||ng
The Guardian UK|online|theguardian.com||gb
Independent NG|online|independent.ng||ng
The Nerve Africa|online|thenerveafrica.com||ng
Ameh News|online|amehnews.com||ng
Sun News Online|online|sunnewsonline.com||ng
Seed Magazine|online|seedmagazine.co.ke||ke
Business Hallmark News|online|hallmarknews.com||ng
Destiny Connect|online|destinyconnect.com||za
The Economist|online|economist.com||us
Washington Post|online|washingtonpost.com||us
Ama Bhungane|online|amabhungane.co.za||za
Africa Investor|online|africainvestor.com||za
Outrepreneurs|online|outrepreneurs.com||ng
CNBC Africa|online|cnbcafrica.com||za
Plan International|online|plan-international.org||gb
Bloomberg|online|bloomberg.com||za
"""

mediums = []
Expand Down
Loading