Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularize main script for scraping different years websites #16

Closed
josix opened this issue Aug 24, 2021 · 10 comments · Fixed by #36
Closed

Modularize main script for scraping different years websites #16

josix opened this issue Aug 24, 2021 · 10 comments · Fixed by #36
Assignees
Labels
enhancement New feature or request

Comments

@josix
Copy link
Contributor

josix commented Aug 24, 2021

The script main.py is used for scraping PyConTW 2016-2020 websites. Currently this script contains lots of global variables and shared functions which will increase the difficulty to keep flexible for scraping the official website in the future. It is hard to maintain/develop. It will be great if we could make this script more structural like separating different year parsing detail into different handler or module, keep the same crawling processing in a base class, etc. Any ideas about this enhancement are welcome.

@josix josix added the enhancement New feature or request label Aug 24, 2021
@josix josix self-assigned this Aug 26, 2021
@Darkborderman
Copy link
Member

Darkborderman commented Oct 23, 2021

Hello @josix I have some suggestion about this issue.

  • Select different year scrapper using dict and/or split each year scrap into function

Pseudo code:

try:
    SCRAP_YEAR[year]()
except KeyError:
    logger.error("Specified year does not exist!")

Pseudo structure:

/common/    <- sharing modules between scrappers
   /scrap.py  <- used to store web scrap function like `getcssimg` 
   /dataio.py <- used to store file IO functions like `writefile`
/websites/
   /year2018.py <- Scrap function for 2018
   /year2019.py <- Scrap function for 2019
   /utilities.py <- Scrap handler or base class, etc. 
/main.py <- Main entrypoint
  • Encapsulate some check method (like path[6:8] == "zh" to get_lang(str))

  • Using typing library and add function documentation

About global variables like PYCON_URL, maybe pass it via parameter or base class construction.

@Darkborderman
Copy link
Member

I'd like to pick up the issue if you're willing to assign it to me. Thanks!

@josix
Copy link
Contributor Author

josix commented Oct 24, 2021

It looks GREAT! I would be appreciated if you could help on this.

@josix josix assigned Darkborderman and unassigned josix Oct 24, 2021
@josix
Copy link
Contributor Author

josix commented Oct 24, 2021

I guess we also need to reformat the code or add some linter to make the script conform PEP8 in the project. Introducing some reformat tools like black, isortor linter like pylint to this project will be helpful.

@Darkborderman
Copy link
Member

Darkborderman commented Oct 24, 2021

I guess we also need to reformat the code or add some linter to make the script conform PEP8 in the project. Introducing some reformat tools like black, isortor linter like pylint to this project will be helpful.

Sure. I'll add those packages and resolve this issue with #22 .

@Lee-W
Copy link
Member

Lee-W commented Oct 24, 2021

If you're interested in following the convention from mail-handler, maybe you can give https://github.com/Lee-W/cookiecutter-python-template a try. It comes with all the tools you mention.

@Darkborderman
Copy link
Member

If you're interested in following the convention from mail-handler, maybe you can give https://github.com/Lee-W/cookiecutter-python-template a try. It comes with all the tools you mention.

Thanks! I'll adopt some Coding style & testing packages from it.

@Lee-W
Copy link
Member

Lee-W commented Oct 25, 2021

In addition to that, please try cruft instead of using cookie-cutter directly. It's a tool that can help us get updates from the template easier.

@josix josix added this to the Enhance Code Quality milestone Oct 25, 2021
@josix josix changed the title Refactor main script for scraping different years websites Modularize main script for scraping different years websites Oct 25, 2021
@josix
Copy link
Contributor Author

josix commented Oct 25, 2021

I think we could leave this issue simpler just for handling the modularity of the codebase. I'll create another two issues including improving coding style by introducing linter/reformatter and adding more tests for checking the reliability of the code.

@Darkborderman
Copy link
Member

I think we could leave this issue simpler just for handling the modularity of the codebase. I'll create another two issues including improving coding style by introducing linter/reformatter and adding more tests for checking the reliability of the code.

+1 for this, this issue can be split into smaller issues.

@josix josix closed this as completed in #36 Nov 8, 2021
josix added a commit that referenced this issue Nov 8, 2021
Refactor crawler function according to issue #16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants