Releases: DEENUU1/job-scraper
Releases · DEENUU1/job-scraper
v0.8-27.04.2024
- Add missing
https://
for objects from indeed, justjoinit and nofluffjob
v0.7-24.04.2024
- Add tag for each url to facilitate subsequent filtering of offers
Updated config.json format:
{
"url": "https://pl.indeed.com/praca?l=Zdu%C5%84ska+Wola%2C+%C5%82%C3%B3dzkie&radius=10&sort=date&vjk=adc0ec0fd20bd577",
"tag": "Zduńska Wola"
},
{
"url": "https://pl.indeed.com/jobs?q=python&sort=date&fromage=3&vjk=a986e0ad45696cf8",
"tag": "IT"
},
- Add
alembic
library to make migrations in existing SQLite database - In local server add filtering by tag name
v0.6-11.04.2024
- Delete sort_by option
- Add option to check all offers with 1 click
- Bug fix with pagination and other filters
v0.5-10.04.2024
- Add option to save data to SQLite database
- Create simple FastAPI + Jinja template local web application to browse and filter scraped data
Config set up (config.json)
export_type
Here you can type "excel", "googlesheet" or "db" if you choose "excel" data will be saved locally in .xlsx file, if you want to save data in Google Sheet choose "googlesheet" and if you want to use SQLite database + local web application to browse and filter data choose "db"
Without docker
Install requirements
pip install -r requirements.txt
Run scrapers
python main.py
# On windows you can run `run.ps1` powershell script
If you set "db" in your config file you can run local server
python server.py
# On windows you can run 'server.ps1' powershell script
With docker
I don't recommend to use Docker if you decided to save your data to SQLite.
I still need to refine this option, but for now I recommend using Docker in combination with Google Sheet or .XLSX files
Build image
docker build -t scraper .
Run scrapers
docker run scraper
v0.4-03.04.2024
- Save data to .xlsx file
v0.3-25.03.2024
- Skip offers by keywords in title
- Fix theprotocol.it url duplications
v0.2-25.03.2024
- Fixed an error with the creation of duplicate offers from Pracuj.pl and it. Pracuj.pl
- Fixed an error related to incorrect validation of the time of adding the offer on the jooble.org website
- Fixed an error related to incorrect validation of the time of adding an offer on indeed.com
- Adding a .txt file that allows you to add links to offers that are to be omitted and not added to Google Sheet
v0.1-22.03.2024
Job scraper
A program that allows you to scrape job offers from many websites and save new offers in Google Sheet
Report Bug
·
Request Feature
Features
-
Multi-Portal Job Scraper:
- The project is designed to scrape job postings from various job portals.
- Implements the Strategy Pattern for scrapers, allowing flexibility in choosing the scraping method based on the website's structure.
- Utilizes either requests + BeautifulSoup or Selenium, with Selenium capable of scrolling pages and handling pop-up windows.
-
Data Management and Storage:
- Scraped data is efficiently stored to prevent duplication.
- Integrated with Google Sheets for seamless data storage and accessibility.
-
Customizable Scraping Parameters:
- Users can set specific links for supported job portals along with filters and sorting preferences for tailored scraping.
- Time-based Filtering:
- Provides an option to set a maximum age for job postings, preventing the scraping of listings older than the specified timeframe (e.g., not scraping job postings older than 3 days).
-
Flexible Configuration:
- Users can configure the scraper to their preferences, enabling selective scraping based on categories or other criteria specified by the user.
-
Automated Maintenance:
- The application handles cookie consent pop-ups automatically, ensuring uninterrupted scraping experience.
Supported websites and url configuration
!INFO!
I recommend that each link comes from the first page (pagination) and that it complies with the recommendations below. I have provided examples of correct and incorrect links for each page
bulldogjob
# bulldogjob.pl url must ends with "/page,"
https://bulldogjob.pl/companies/jobs/page, # valid
https://bulldogjob.pl/companies/jobs/s/skills,Python,JavaScript/page, #valid
https://bulldogjob.pl/companies/jobs/s/skills,Python,JavaScript # invalid
indeed
# indeed.com url have to include some parameters
https://pl.indeed.com/jobs?q=&l=Warszawa%2C+mazowieckie&from=searchOnHP&vjk=1593bca04b48ed8a # valid (choose Warsaw as a location)
https://pl.indeed.com/ # invalid
it.pracuj.pl
https://it.pracuj.pl/praca # valid
https://it.pracuj.pl/praca?itth=50%2C75 # valid
jooble
# Here you need to add some filters on the website, then copy url and scroll few times
# and then change `?p=` value to for example 10000
https://pl.jooble.org/SearchResult?p=10000&rgns=Warszawa # valid
https://pl.jooble.org/SearchResult?rgns=Warszawa # invalid
nofluffjobs
https://nofluffjobs.com/pl # valid
https://nofluffjobs.com/pl/.NET?page=1&criteria=seniority%3Dtrainee,junior # valid
olx
# Scraping data from OLX is a little more difficult
# First you need to go to https://www.olx.pl/praca/ and choose all filters that you need
# Then click the right mouse button and go to Devtools
# Go to Network tab and refresh the page
# Scroll to the end and go to page 2 (pagination)
# Scroll to the end again and now in the Network tab search for a JSON with url like this "https://www.olx.pl/api/v1/offers/?offset=40&...."
# In my example it looks like this https://www.olx.pl/api/v1/offers/?offset=40&limit=40&category_id=4&filter_refiners=spell_checker&sl=18c34ade124x23bc10a5
# Then click links and go to previous
# Cope this link from your browser and add to config.json file
theprotocol
https://theprotocol.it/filtry/java;t/trainee,assistant;p # valid
https://theprotocol.it/praca # valid
useme
https://useme.com/pl/jobs/category/programowanie-i-it,35/ # valid
https://useme.com/pl/jobs/category/multimedia,36/ # valid
https://useme.com/pl/jobs/category/serwisy-internetowe,34/ # valid
https://useme.com/pl/jobs/category/serwisy-internetowe,34/sklepy-internetowe,97/ # valid
Technologies:
- Python
- Requests
- BeautifulSoup4
- Selenium
- Docker
- Google Sheet
Installation
Clone repository
git clone https://github.com/DEENUU1/job-scraper.git
Set up your Google Account
- Go to Google Console
- Create or choose existing project Tutorial
- Go to Navigation Menu and select APIs & Services and then Credentials
- Click CREATE CREDENTIALS and choose Service Account
- Give some random name and click Done
- Copy e-mail of the created account
- Then click on the pencil button to the left of the trash icon
- Go to Keys and click ADD KEY and then Create new key
- Choose JSON format and then Create
- Rename downloaded file to
credentials.json
and copy it to the main direction of this project (the same directory where main.py is located) - Go back to Google Console and search for Google Seet API
- Enable this API
- Create new Google Sheet
- In Google Sheet click on Share and copy here the email you copied earlier
- Choose access for all people with link and copy this link
- Add link to
config.json
in fieldurl
Config set up
url
is dedicated for Google Sheetmax_offer_duration_days
you can set here null or some integer number (for example 5) If the value is an integer, offers downloaded from websites will not be older than the number of days you specifywebsites
here you can add multiple urls from which you want to scrape job offers
Without docker
Install requirements
pip install -r requirements.txt
Run script
python main.py
With docker
Build image
docker build -t scraper .
Run script
docker run scraper
.exe file
- Get .exe file from
assets/main.rar
- Unpack
main.rar
- Inside
main
directory (wheremain.exe
is located) addcredentiials.json
file and configureconfig.json
Authors
License
See LICENSE.txt
for more information.