Skip to content

Releases: DEENUU1/job-scraper

v0.8-27.04.2024

27 Apr 13:31
Compare
Choose a tag to compare
  • Add missing https:// for objects from indeed, justjoinit and nofluffjob

v0.7-24.04.2024

24 Apr 20:36
Compare
Choose a tag to compare
  • Add tag for each url to facilitate subsequent filtering of offers
    Updated config.json format:
    {
      "url": "https://pl.indeed.com/praca?l=Zdu%C5%84ska+Wola%2C+%C5%82%C3%B3dzkie&radius=10&sort=date&vjk=adc0ec0fd20bd577",
      "tag": "Zduńska Wola"
    },
    {
      "url": "https://pl.indeed.com/jobs?q=python&sort=date&fromage=3&vjk=a986e0ad45696cf8",
      "tag": "IT"
    },
  • Add alembic library to make migrations in existing SQLite database
  • In local server add filtering by tag name

server2

v0.6-11.04.2024

11 Apr 12:44
Compare
Choose a tag to compare
  • Delete sort_by option
  • Add option to check all offers with 1 click
  • Bug fix with pagination and other filters

v0.5-10.04.2024

10 Apr 20:43
Compare
Choose a tag to compare
  • Add option to save data to SQLite database
  • Create simple FastAPI + Jinja template local web application to browse and filter scraped data

server

Config set up (config.json)

  • export_type Here you can type "excel", "googlesheet" or "db" if you choose "excel" data will be saved locally in .xlsx file, if you want to save data in Google Sheet choose "googlesheet" and if you want to use SQLite database + local web application to browse and filter data choose "db"

Without docker

Install requirements

pip install -r requirements.txt

Run scrapers

python main.py

# On windows you can run `run.ps1` powershell script

If you set "db" in your config file you can run local server

python server.py

# On windows you can run 'server.ps1' powershell script

With docker

I don't recommend to use Docker if you decided to save your data to SQLite.
I still need to refine this option, but for now I recommend using Docker in combination with Google Sheet or .XLSX files

Build image

docker build -t scraper .

Run scrapers

docker run scraper

v0.4-03.04.2024

10 Apr 12:37
Compare
Choose a tag to compare
  • Save data to .xlsx file

v0.3-25.03.2024

25 Mar 19:16
Compare
Choose a tag to compare
  • Skip offers by keywords in title
  • Fix theprotocol.it url duplications

v0.2-25.03.2024

25 Mar 18:40
Compare
Choose a tag to compare
  • Fixed an error with the creation of duplicate offers from Pracuj.pl and it. Pracuj.pl
  • Fixed an error related to incorrect validation of the time of adding the offer on the jooble.org website
  • Fixed an error related to incorrect validation of the time of adding an offer on indeed.com
  • Adding a .txt file that allows you to add links to offers that are to be omitted and not added to Google Sheet

v0.1-22.03.2024

24 Mar 14:06
Compare
Choose a tag to compare

Job scraper

A program that allows you to scrape job offers from many websites and save new offers in Google Sheet

Report Bug · Request Feature

google_sheet_results

Features

  1. Multi-Portal Job Scraper:

    • The project is designed to scrape job postings from various job portals.
    • Implements the Strategy Pattern for scrapers, allowing flexibility in choosing the scraping method based on the website's structure.
    • Utilizes either requests + BeautifulSoup or Selenium, with Selenium capable of scrolling pages and handling pop-up windows.
  2. Data Management and Storage:

    • Scraped data is efficiently stored to prevent duplication.
    • Integrated with Google Sheets for seamless data storage and accessibility.
  3. Customizable Scraping Parameters:

    • Users can set specific links for supported job portals along with filters and sorting preferences for tailored scraping.
    • Time-based Filtering:
    • Provides an option to set a maximum age for job postings, preventing the scraping of listings older than the specified timeframe (e.g., not scraping job postings older than 3 days).
  4. Flexible Configuration:

    • Users can configure the scraper to their preferences, enabling selective scraping based on categories or other criteria specified by the user.
  5. Automated Maintenance:

    • The application handles cookie consent pop-ups automatically, ensuring uninterrupted scraping experience.

Supported websites and url configuration

!INFO!

I recommend that each link comes from the first page (pagination) and that it complies with the recommendations below. I have provided examples of correct and incorrect links for each page

bulldogjob
# bulldogjob.pl url must ends with "/page,"

https://bulldogjob.pl/companies/jobs/page,  # valid
https://bulldogjob.pl/companies/jobs/s/skills,Python,JavaScript/page, #valid

https://bulldogjob.pl/companies/jobs/s/skills,Python,JavaScript # invalid
indeed
# indeed.com url have to include some parameters

https://pl.indeed.com/jobs?q=&l=Warszawa%2C+mazowieckie&from=searchOnHP&vjk=1593bca04b48ed8a # valid (choose Warsaw as a location)

https://pl.indeed.com/ # invalid

it.pracuj.pl
https://it.pracuj.pl/praca # valid
https://it.pracuj.pl/praca?itth=50%2C75 # valid
jooble
# Here you need to add some filters on the website, then copy url and scroll few times
# and then change `?p=` value to for example 10000 

https://pl.jooble.org/SearchResult?p=10000&rgns=Warszawa # valid
https://pl.jooble.org/SearchResult?rgns=Warszawa # invalid

nofluffjobs
https://nofluffjobs.com/pl # valid
https://nofluffjobs.com/pl/.NET?page=1&criteria=seniority%3Dtrainee,junior # valid
olx
# Scraping data from OLX is a little more difficult
# First you need to go to https://www.olx.pl/praca/ and choose all filters that you need 
# Then click the right mouse button and go to Devtools
# Go to Network tab and refresh the page
# Scroll to the end and go to page 2 (pagination)
# Scroll to the end again and now in the Network tab search for a JSON with url like this "https://www.olx.pl/api/v1/offers/?offset=40&...."
# In my example it looks like this https://www.olx.pl/api/v1/offers/?offset=40&limit=40&category_id=4&filter_refiners=spell_checker&sl=18c34ade124x23bc10a5
# Then click links and go to previous 
# Cope this link from your browser and add to config.json file
theprotocol
https://theprotocol.it/filtry/java;t/trainee,assistant;p # valid
https://theprotocol.it/praca # valid 
useme
https://useme.com/pl/jobs/category/programowanie-i-it,35/ # valid
https://useme.com/pl/jobs/category/multimedia,36/ # valid
https://useme.com/pl/jobs/category/serwisy-internetowe,34/ # valid
https://useme.com/pl/jobs/category/serwisy-internetowe,34/sklepy-internetowe,97/ # valid

Technologies:

  • Python
    • Requests
    • BeautifulSoup4
    • Selenium
  • Docker
  • Google Sheet

Installation

Clone repository

git clone https://github.com/DEENUU1/job-scraper.git

Set up your Google Account

  1. Go to Google Console
  2. Create or choose existing project Tutorial
  3. Go to Navigation Menu and select APIs & Services and then Credentials
  4. Click CREATE CREDENTIALS and choose Service Account
  5. Give some random name and click Done
  6. Copy e-mail of the created account
  7. Then click on the pencil button to the left of the trash icon
  8. Go to Keys and click ADD KEY and then Create new key
  9. Choose JSON format and then Create
  10. Rename downloaded file to credentials.json and copy it to the main direction of this project (the same directory where main.py is located)
  11. Go back to Google Console and search for Google Seet API
  12. Enable this API
  13. Create new Google Sheet
  14. In Google Sheet click on Share and copy here the email you copied earlier
  15. Choose access for all people with link and copy this link
  16. Add link to config.json in field url

Config set up

  • url is dedicated for Google Sheet
  • max_offer_duration_days you can set here null or some integer number (for example 5) If the value is an integer, offers downloaded from websites will not be older than the number of days you specify
  • websites here you can add multiple urls from which you want to scrape job offers

Without docker

Install requirements

pip install -r requirements.txt

Run script

python main.py

With docker

Build image

docker build -t scraper .

Run script

docker run scraper

.exe file

  1. Get .exe file from assets/main.rar
  2. Unpack main.rar
  3. Inside main directory (where main.exe is located) add credentiials.json file and configure config.json

Authors

License

See LICENSE.txt for more information.