Modularize main script for scraping different years websites #16

josix · 2021-08-24T16:07:53Z

The script main.py is used for scraping PyConTW 2016-2020 websites. Currently this script contains lots of global variables and shared functions which will increase the difficulty to keep flexible for scraping the official website in the future. It is hard to maintain/develop. It will be great if we could make this script more structural like separating different year parsing detail into different handler or module, keep the same crawling processing in a base class, etc. Any ideas about this enhancement are welcome.

The text was updated successfully, but these errors were encountered:

Darkborderman · 2021-10-23T16:38:28Z

Hello @josix I have some suggestion about this issue.

Select different year scrapper using dict and/or split each year scrap into function

Pseudo code:

try:
    SCRAP_YEAR[year]()
except KeyError:
    logger.error("Specified year does not exist!")

Pseudo structure:

/common/    <- sharing modules between scrappers
   /scrap.py  <- used to store web scrap function like `getcssimg` 
   /dataio.py <- used to store file IO functions like `writefile`
/websites/
   /year2018.py <- Scrap function for 2018
   /year2019.py <- Scrap function for 2019
   /utilities.py <- Scrap handler or base class, etc. 
/main.py <- Main entrypoint

Encapsulate some check method (like path[6:8] == "zh" to get_lang(str))
Using typing library and add function documentation

About global variables like PYCON_URL, maybe pass it via parameter or base class construction.

Darkborderman · 2021-10-23T16:39:47Z

I'd like to pick up the issue if you're willing to assign it to me. Thanks!

josix · 2021-10-24T05:55:42Z

It looks GREAT! I would be appreciated if you could help on this.

josix · 2021-10-24T06:18:01Z

I guess we also need to reformat the code or add some linter to make the script conform PEP8 in the project. Introducing some reformat tools like black, isortor linter like pylint to this project will be helpful.

Darkborderman · 2021-10-24T12:58:14Z

I guess we also need to reformat the code or add some linter to make the script conform PEP8 in the project. Introducing some reformat tools like black, isortor linter like pylint to this project will be helpful.

Sure. I'll add those packages and resolve this issue with #22 .

Lee-W · 2021-10-24T13:47:47Z

If you're interested in following the convention from mail-handler, maybe you can give https://github.com/Lee-W/cookiecutter-python-template a try. It comes with all the tools you mention.

Darkborderman · 2021-10-24T16:36:38Z

If you're interested in following the convention from mail-handler, maybe you can give https://github.com/Lee-W/cookiecutter-python-template a try. It comes with all the tools you mention.

Thanks! I'll adopt some Coding style & testing packages from it.

Lee-W · 2021-10-25T00:58:28Z

In addition to that, please try cruft instead of using cookie-cutter directly. It's a tool that can help us get updates from the template easier.

josix · 2021-10-25T18:22:22Z

I think we could leave this issue simpler just for handling the modularity of the codebase. I'll create another two issues including improving coding style by introducing linter/reformatter and adding more tests for checking the reliability of the code.

Darkborderman · 2021-10-26T02:27:31Z

I think we could leave this issue simpler just for handling the modularity of the codebase. I'll create another two issues including improving coding style by introducing linter/reformatter and adding more tests for checking the reliability of the code.

+1 for this, this issue can be split into smaller issues.

Refactor crawler function according to issue #16

josix added the enhancement New feature or request label Aug 24, 2021

josix self-assigned this Aug 26, 2021

josix assigned Darkborderman and unassigned josix Oct 24, 2021

josix added this to the Enhance Code Quality milestone Oct 25, 2021

josix changed the title ~~Refactor main script for scraping different years websites~~ Modularize main script for scraping different years websites Oct 25, 2021

Darkborderman mentioned this issue Nov 3, 2021

Refactor crawler function according to issue #16 #36

Merged

5 tasks

josix closed this as completed in #36 Nov 8, 2021

josix added a commit that referenced this issue Nov 8, 2021

Merge pull request #36 from Darkborderman/refactor/modulize

824cc09

Refactor crawler function according to issue #16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularize main script for scraping different years websites #16

Modularize main script for scraping different years websites #16

josix commented Aug 24, 2021 •

edited

Loading

Darkborderman commented Oct 23, 2021 •

edited

Loading

Darkborderman commented Oct 23, 2021

josix commented Oct 24, 2021

josix commented Oct 24, 2021

Darkborderman commented Oct 24, 2021 •

edited

Loading

Lee-W commented Oct 24, 2021

Darkborderman commented Oct 24, 2021

Lee-W commented Oct 25, 2021

josix commented Oct 25, 2021

Darkborderman commented Oct 26, 2021

Modularize main script for scraping different years websites #16

Modularize main script for scraping different years websites #16

Comments

josix commented Aug 24, 2021 • edited Loading

Darkborderman commented Oct 23, 2021 • edited Loading

Darkborderman commented Oct 23, 2021

josix commented Oct 24, 2021

josix commented Oct 24, 2021

Darkborderman commented Oct 24, 2021 • edited Loading

Lee-W commented Oct 24, 2021

Darkborderman commented Oct 24, 2021

Lee-W commented Oct 25, 2021

josix commented Oct 25, 2021

Darkborderman commented Oct 26, 2021

josix commented Aug 24, 2021 •

edited

Loading

Darkborderman commented Oct 23, 2021 •

edited

Loading

Darkborderman commented Oct 24, 2021 •

edited

Loading