Skip to content

Scrape the url of LIHKG threads whose title include specific keywords, and download contents of the thread.

Notifications You must be signed in to change notification settings

christinesfkao/lihkg-keyword-scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

README.md

This is the README file forlihkg-keyword-scrapperby @christinesfkao.

Adapted from: Ho, J.C. & Or, N.H.K. (2020). LIHKGr: An application for scraping LIHKG.

Last updated: Nov 2020

Directory

lihkg-scrapper.py	
lihkg.R	 
README.md

Synopsis

python3 lihkg-scrapper.py [keyword]
RScript lihkg.R [keyword]

LIHKG, aka 連登, is the most popular internet forum in Hong Kong in 2019.

lihkg-scrapper.pycan scrape the url of threads whose title include specific keywords; then dump them into LIHKGr withlihkg.Rto download contents of the thread.

Environment

Feel free to set up according to your preferences. The following is what I used.

  • MacOS Catalina 10.15.7 (x86_64-apple-darwin17.0)
  • Firefox Browser 83.0 (64-bit)
  • Python 3.9.0
  • R version 4.0.3

Before running

  1. Decide on the keyword(s) you're going to search for

    • forpython3 lihkg-scrapper.py [keyword], the keyword would be read in assys.argv[1]in my module and put on the search bar during the automation process
    • forRscript lihkg.R [keyword], read in with commandArgs(trailingOnly = true)
  2. Check your environment settings from Selenium documents on Python

    • install Python bindings for selenium:pip3 install selenium
    • download (and install) the web broswer driver that you have chosen
    • no need for JAVA server for this scrapper
  3. Put the downloaded geckodriver for Firefox (or the driver for your preferred browser) under your desired directory

    • preferred $PATH setting method: Special thanks to @shouko's advice
  4. Change the constants in lihkg-scrapper.py according to your needs:

    • PATHas your desired directory
    • AccountUSERNAMEandPASSWORDfor LIHKG (Preferred: apply for a LIHKG account before scrapping!)
    • Or you could choose to input these rather sensitive information manually

Outputs

  • The Python script outputs thread ids into a.txtfile, one id for each line
  • The R script outputs contents of the threads into a.xlsxfile
    • the ids saved in the.txtfile are read in as a vector and thrown into LIHKGr
    • lihkg.Rdownload contents of the thread in to an.xlsxfile.

To-dos

I haven't finished the code on R for logging in yet. Perhaps the entire process can then be done in a single R script. PRs welcome!

About

Scrape the url of LIHKG threads whose title include specific keywords, and download contents of the thread.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published