README.md

This is the README file forlihkg-keyword-scrapperby @christinesfkao.

Adapted from: Ho, J.C. & Or, N.H.K. (2020). LIHKGr: An application for scraping LIHKG.

Last updated: Nov 2020

Synopsis

python3 lihkg-scrapper.py [keyword]
RScript lihkg.R [keyword]

LIHKG, aka 連登, is the most popular internet forum in Hong Kong in 2019.

lihkg-scrapper.pycan scrape the url of threads whose title include specific keywords; then dump them into LIHKGr withlihkg.Rto download contents of the thread.

Environment

Feel free to set up according to your preferences. The following is what I used.

MacOS Catalina 10.15.7 (x86_64-apple-darwin17.0)
Firefox Browser 83.0 (64-bit)
Python 3.9.0
R version 4.0.3

Before running

Decide on the keyword(s) you're going to search for
- forpython3 lihkg-scrapper.py [keyword], the keyword would be read in assys.argv[1]in my module and put on the search bar during the automation process
- forRscript lihkg.R [keyword], read in with commandArgs(trailingOnly = true)
Check your environment settings from Selenium documents on Python
- install Python bindings for selenium:pip3 install selenium
- download (and install) the web broswer driver that you have chosen
- no need for JAVA server for this scrapper
Put the downloaded geckodriver for Firefox (or the driver for your preferred browser) under your desired directory
- preferred $PATH setting method: Special thanks to @shouko's advice
Change the constants in lihkg-scrapper.py according to your needs:
- PATHas your desired directory
- AccountUSERNAMEandPASSWORDfor LIHKG (Preferred: apply for a LIHKG account before scrapping!)
- Or you could choose to input these rather sensitive information manually

Outputs

The Python script outputs thread ids into a.txtfile, one id for each line
- lihkg-scrapper.py scrapes the url of threads, but outputs include only the thread ids
- e.g. in https://lihkg.com/thread/#######/page/1, only ####### is left in the output
The R script outputs contents of the threads into a.xlsxfile
- the ids saved in the.txtfile are read in as a vector and thrown into LIHKGr
- lihkg.Rdownload contents of the thread in to an.xlsxfile.

To-dos

I haven't finished the code on R for logging in yet. Perhaps the entire process can then be done in a single R script. PRs welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
lihkg-scrapper.py		lihkg-scrapper.py
lihkg.R		lihkg.R
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README.md

Directory

Synopsis

Environment

Before running

Outputs

To-dos

About

Releases

Packages

Languages

christinesfkao/lihkg-keyword-scrapper

Folders and files

Latest commit

History

Repository files navigation

README.md

Directory

Synopsis

Environment

Before running

Outputs

To-dos

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages