Skip to content
This repository has been archived by the owner on Nov 18, 2022. It is now read-only.

ShawonAshraf/karwanbazaar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

karwanbazaar

a scrapy based crawler to get articles from দৈনিক মতিকণ্ঠ, a satire news site in Bangla

running locally

git clone https://github.com/ShawonAshraf/karwanbazaar.git
cd karwanbazaar

# conda env
conda env create -f ghochu.yml
source activate karwanbazaar

# run
python main.py

spiders

there are 4 spiders in the pipeline which need to run sequentially since each spider is dependent on the output from the others.

karwanbazaar/spiders
├── archives.py
├── article_content.py
├── article_urls.py
├── index.py

running order of the spiders:

{
    0: index,
    1: archives,
    2: article_urls,
    3: article_content
}

all the spiders generate output files (jsonl, txt, html) which are saved in the output directory.

final output

articles.jsonl contains the final output with all the posts in jsonl format.

this is the format of one line of the articles.jsonl file:

{
  "article_id": "article id",
  "title": "title of the article", 
  "content": "content of the article"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages