Releases · DEENUU1/job-scraper

27 Apr 13:31

DEENUU1

v0.8-27.04.2024

0e42ced

v0.8-27.04.2024 Latest

Latest

Add missing https:// for objects from indeed, justjoinit and nofluffjob

Assets 2

24 Apr 20:36

DEENUU1

v0.7-24.04.2024

97b091f

v0.7-24.04.2024

Add tag for each url to facilitate subsequent filtering of offers
Updated config.json format:

    {
      "url": "https://pl.indeed.com/praca?l=Zdu%C5%84ska+Wola%2C+%C5%82%C3%B3dzkie&radius=10&sort=date&vjk=adc0ec0fd20bd577",
      "tag": "Zduńska Wola"
    },
    {
      "url": "https://pl.indeed.com/jobs?q=python&sort=date&fromage=3&vjk=a986e0ad45696cf8",
      "tag": "IT"
    },

Add alembic library to make migrations in existing SQLite database
In local server add filtering by tag name

Assets 2

11 Apr 12:44

DEENUU1

v0.6-11.04.2024

14f78f9

v0.6-11.04.2024

Delete sort_by option
Add option to check all offers with 1 click
Bug fix with pagination and other filters

Assets 2

10 Apr 20:43

DEENUU1

v0.5-10.04.2024

d3aa3d4

v0.5-10.04.2024

Add option to save data to SQLite database
Create simple FastAPI + Jinja template local web application to browse and filter scraped data

Config set up (config.json)

export_type Here you can type "excel", "googlesheet" or "db" if you choose "excel" data will be saved locally in .xlsx file, if you want to save data in Google Sheet choose "googlesheet" and if you want to use SQLite database + local web application to browse and filter data choose "db"

Without docker

Install requirements

pip install -r requirements.txt

Run scrapers

python main.py

# On windows you can run `run.ps1` powershell script

If you set "db" in your config file you can run local server

python server.py

# On windows you can run 'server.ps1' powershell script

With docker

I don't recommend to use Docker if you decided to save your data to SQLite.
I still need to refine this option, but for now I recommend using Docker in combination with Google Sheet or .XLSX files

Build image

docker build -t scraper .

Run scrapers

docker run scraper

Assets 2

10 Apr 12:37

DEENUU1

v0.4-03.04.2024

3748e63

v0.4-03.04.2024

Save data to .xlsx file

Assets 2

25 Mar 19:16

DEENUU1

v0.3-25.03.2024

f39109f

v0.3-25.03.2024

Skip offers by keywords in title
Fix theprotocol.it url duplications

Assets 2

25 Mar 18:40

DEENUU1

v0.2-25.03.2024

f39109f

v0.2-25.03.2024

Fixed an error with the creation of duplicate offers from Pracuj.pl and it. Pracuj.pl
Fixed an error related to incorrect validation of the time of adding the offer on the jooble.org website
Fixed an error related to incorrect validation of the time of adding an offer on indeed.com
Adding a .txt file that allows you to add links to offers that are to be omitted and not added to Google Sheet

Assets 2

24 Mar 14:06

DEENUU1

v0.1-24.03.2024

7db75e2

v0.1-22.03.2024

Job scraper

A program that allows you to scrape job offers from many websites and save new offers in Google Sheet

Report Bug · Request Feature

Features

Multi-Portal Job Scraper:
- The project is designed to scrape job postings from various job portals.
- Implements the Strategy Pattern for scrapers, allowing flexibility in choosing the scraping method based on the website's structure.
- Utilizes either requests + BeautifulSoup or Selenium, with Selenium capable of scrolling pages and handling pop-up windows.
Data Management and Storage:
- Scraped data is efficiently stored to prevent duplication.
- Integrated with Google Sheets for seamless data storage and accessibility.
Customizable Scraping Parameters:
- Users can set specific links for supported job portals along with filters and sorting preferences for tailored scraping.
- Time-based Filtering:
- Provides an option to set a maximum age for job postings, preventing the scraping of listings older than the specified timeframe (e.g., not scraping job postings older than 3 days).
Flexible Configuration:
- Users can configure the scraper to their preferences, enabling selective scraping based on categories or other criteria specified by the user.
Automated Maintenance:
- The application handles cookie consent pop-ups automatically, ensuring uninterrupted scraping experience.

Supported websites and url configuration

!INFO!

I recommend that each link comes from the first page (pagination) and that it complies with the recommendations below. I have provided examples of correct and incorrect links for each page

bulldogjob

# bulldogjob.pl url must ends with "/page,"

https://bulldogjob.pl/companies/jobs/page,  # valid
https://bulldogjob.pl/companies/jobs/s/skills,Python,JavaScript/page, #valid

https://bulldogjob.pl/companies/jobs/s/skills,Python,JavaScript # invalid

indeed

# indeed.com url have to include some parameters

https://pl.indeed.com/jobs?q=&l=Warszawa%2C+mazowieckie&from=searchOnHP&vjk=1593bca04b48ed8a # valid (choose Warsaw as a location)

https://pl.indeed.com/ # invalid

it.pracuj.pl

https://it.pracuj.pl/praca # valid
https://it.pracuj.pl/praca?itth=50%2C75 # valid

jooble

# Here you need to add some filters on the website, then copy url and scroll few times
# and then change `?p=` value to for example 10000 

https://pl.jooble.org/SearchResult?p=10000&rgns=Warszawa # valid
https://pl.jooble.org/SearchResult?rgns=Warszawa # invalid

nofluffjobs

https://nofluffjobs.com/pl # valid
https://nofluffjobs.com/pl/.NET?page=1&criteria=seniority%3Dtrainee,junior # valid

olx

# Scraping data from OLX is a little more difficult
# First you need to go to https://www.olx.pl/praca/ and choose all filters that you need 
# Then click the right mouse button and go to Devtools
# Go to Network tab and refresh the page
# Scroll to the end and go to page 2 (pagination)
# Scroll to the end again and now in the Network tab search for a JSON with url like this "https://www.olx.pl/api/v1/offers/?offset=40&...."
# In my example it looks like this https://www.olx.pl/api/v1/offers/?offset=40&limit=40&category_id=4&filter_refiners=spell_checker&sl=18c34ade124x23bc10a5
# Then click links and go to previous 
# Cope this link from your browser and add to config.json file

theprotocol

https://theprotocol.it/filtry/java;t/trainee,assistant;p # valid
https://theprotocol.it/praca # valid

useme

https://useme.com/pl/jobs/category/programowanie-i-it,35/ # valid
https://useme.com/pl/jobs/category/multimedia,36/ # valid
https://useme.com/pl/jobs/category/serwisy-internetowe,34/ # valid
https://useme.com/pl/jobs/category/serwisy-internetowe,34/sklepy-internetowe,97/ # valid

Technologies:

Python
- Requests
- BeautifulSoup4
- Selenium
Docker
Google Sheet

Installation

Clone repository

git clone https://github.com/DEENUU1/job-scraper.git

Set up your Google Account

Go to Google Console
Create or choose existing project Tutorial
Go to Navigation Menu and select APIs & Services and then Credentials
Click CREATE CREDENTIALS and choose Service Account
Give some random name and click Done
Copy e-mail of the created account
Then click on the pencil button to the left of the trash icon
Go to Keys and click ADD KEY and then Create new key
Choose JSON format and then Create
Rename downloaded file to credentials.json and copy it to the main direction of this project (the same directory where main.py is located)
Go back to Google Console and search for Google Seet API
Enable this API
Create new Google Sheet
In Google Sheet click on Share and copy here the email you copied earlier
Choose access for all people with link and copy this link
Add link to config.json in field url

Config set up

url is dedicated for Google Sheet
max_offer_duration_days you can set here null or some integer number (for example 5) If the value is an integer, offers downloaded from websites will not be older than the number of days you specify
websites here you can add multiple urls from which you want to scrape job offers

Without docker

Install requirements

pip install -r requirements.txt

Run script

python main.py

With docker

Build image

docker build -t scraper .

Run script

docker run scraper

.exe file

Get .exe file from assets/main.rar
Unpack main.rar
Inside main directory (where main.exe is located) add credentiials.json file and configure config.json

Authors

@DEENUU1

License

See LICENSE.txt for more information.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Config set up (config.json)

Without docker

Install requirements

Run scrapers

If you set "db" in your config file you can run local server

With docker

Build image

Run scrapers

Job scraper

Features

Supported websites and url configuration

!INFO!

Technologies:

Installation

Clone repository

Set up your Google Account

Config set up

Without docker

Install requirements

Run script

With docker

Build image

Run script

.exe file

Authors

License

Releases: DEENUU1/job-scraper

v0.8-27.04.2024

v0.7-24.04.2024

v0.6-11.04.2024

v0.5-10.04.2024

Config set up (config.json)

Without docker

Install requirements

Run scrapers

If you set "db" in your config file you can run local server

With docker

Build image

Run scrapers

v0.4-03.04.2024

v0.3-25.03.2024

v0.2-25.03.2024

v0.1-22.03.2024

Job scraper

Features

Supported websites and url configuration

!INFO!

Technologies:

Installation

Clone repository

Set up your Google Account

Config set up

Without docker

Install requirements

Run script

With docker

Build image

Run script

.exe file

Authors

License