A program that allows you to scrape job offers from many websites and save new offers in Google Sheet, .XLSX file or SQLite database with local web application server support
Report Bug
ยท
Request Feature
-
Multi-Portal Job Scraper:
- The project is designed to scrape job postings from various job portals.
- Implements the Strategy Pattern for scrapers, allowing flexibility in choosing the scraping method based on the website's structure.
- Utilizes either requests + BeautifulSoup or Selenium, with Selenium capable of scrolling pages and handling pop-up windows.
-
Data Management and Multi storage options:
- Scraped data is efficiently stored to prevent duplication.
- The application allows you to save data to Google Sheet, an .XLSX file or an SQLite database. By selecting the SQLite database, you can run a local server written in FastAPI to browse and filter the saved data
-
Customizable Scraping Parameters:
- Users can set specific links for supported job portals along with filters and sorting preferences for tailored scraping.
- Time-based Filtering:
- Provides an option to set a maximum age for job postings, preventing the scraping of listings older than the specified timeframe (e.g., not scraping job postings older than 3 days).
-
Flexible Configuration:
- Users can configure the scraper to their preferences, enabling selective scraping based on categories or other criteria specified by the user.
-
Automated Maintenance:
- The application handles cookie consent pop-ups automatically, ensuring uninterrupted scraping experience.
I recommend that each link comes from the first page (pagination) and that it complies with the recommendations below. I have provided examples of correct and incorrect links for each page
bulldogjob
# bulldogjob.pl url must ends with "/page,"
https://bulldogjob.pl/companies/jobs/page, # valid
https://bulldogjob.pl/companies/jobs/s/skills,Python,JavaScript/page, #valid
https://bulldogjob.pl/companies/jobs/s/skills,Python,JavaScript # invalid
indeed
# indeed.com url have to include some parameters
https://pl.indeed.com/jobs?q=&l=Warszawa%2C+mazowieckie&from=searchOnHP&vjk=1593bca04b48ed8a # valid (choose Warsaw as a location)
https://pl.indeed.com/ # invalid
it.pracuj.pl
https://it.pracuj.pl/praca # valid
https://it.pracuj.pl/praca?itth=50%2C75 # valid
jooble
# Here you need to add some filters on the website, then copy url and scroll few times
# and then change `?p=` value to for example 10000
https://pl.jooble.org/SearchResult?p=10000&rgns=Warszawa # valid
https://pl.jooble.org/SearchResult?rgns=Warszawa # invalid
nofluffjobs
https://nofluffjobs.com/pl # valid
https://nofluffjobs.com/pl/.NET?page=1&criteria=seniority%3Dtrainee,junior # valid
If script looped, please check this issue #2
olx
# Scraping data from OLX is a little more difficult
# First you need to go to https://www.olx.pl/praca/ and choose all filters that you need
# Then click the right mouse button and go to Devtools
# Go to Network tab and refresh the page
# Scroll to the end and go to page 2 (pagination)
# Scroll to the end again and now in the Network tab search for a JSON with url like this "https://www.olx.pl/api/v1/offers/?offset=40&...."
# In my example it looks like this https://www.olx.pl/api/v1/offers/?offset=40&limit=40&category_id=4&filter_refiners=spell_checker&sl=18c34ade124x23bc10a5
# Then click links and go to previous
# Cope this link from your browser and add to config.json file
theprotocol
https://theprotocol.it/filtry/java;t/trainee,assistant;p # valid
https://theprotocol.it/praca # valid
useme
https://useme.com/pl/jobs/category/programowanie-i-it,35/ # valid
https://useme.com/pl/jobs/category/multimedia,36/ # valid
https://useme.com/pl/jobs/category/serwisy-internetowe,34/ # valid
https://useme.com/pl/jobs/category/serwisy-internetowe,34/sklepy-internetowe,97/ # valid
- Python
- Requests
- BeautifulSoup4
- Selenium
- FastAPI
- Docker
- Google Sheet
- Powershell
- SQLite
- Javascript
- HTML & CSS
git clone https://github.com/DEENUU1/job-scraper.git
Set up your Google Account (you can skip this part if you want to save data locally in .xlsx file or in SQLite database)
- Go to Google Console
- Create or choose existing project Tutorial
- Go to Navigation Menu and select APIs & Services and then Credentials
- Click CREATE CREDENTIALS and choose Service Account
- Give some random name and click Done
- Copy e-mail of the created account
- Then click on the pencil button to the left of the trash icon
- Go to Keys and click ADD KEY and then Create new key
- Choose JSON format and then Create
- Rename downloaded file to
credentials.json
and copy it to the main direction of this project (the same directory where main.py is located) - Go back to Google Console and search for Google Seet API
- Enable this API
- Create new Google Sheet
- In Google Sheet click on Share and copy here the email you copied earlier
- Choose access for all people with link and copy this link
- Add link to
config.json
in fieldurl
url
is dedicated for Google Sheetkeywords_to_pass
List of keywords after which offers are to be skippedexport_type
Here you can type "excel", "googlesheet" or "db" if you choose "excel" data will be saved locally in .xlsx file, if you want to save data in Google Sheet choose "googlesheet" and if you want to use SQLite database + local web application to browse and filter data choose "db"max_offer_duration_days
you can set here null or some integer number (for example 5) If the value is an integer, offers downloaded from websites will not be older than the number of days you specifywebsites
here you can add multiple urls from which you want to scrape job offers, each website can have tag (string) to facilitate subsequent filtering of offers.
{
"url": "https://pl.indeed.com/praca?l=Zdu%C5%84ska+Wola%2C+%C5%82%C3%B3dzkie&radius=10&sort=date&vjk=adc0ec0fd20bd577",
"tag": "Zduลska Wola"
},
{
"url": "https://pl.indeed.com/jobs?q=python&sort=date&fromage=3&vjk=a986e0ad45696cf8",
"tag": "IT"
},
Here you can add links for job offers that should be skipped and not added to your Google Sheet document
It should looks like this, each url in new line.
https://www.pracuj.pl/praca/konsultant-wdrozeniowiec-systemu-obiegu-dokumentow-warszawa-poloneza-93,oferta,1003226213?s=4a77b1b9&searchId=MTcxMTM3NTM0NDY5NS4yNjcz
https://www.pracuj.pl/praca/mlodszy-analityk-biznesowy-warszawa-dzielna-60,oferta,1003211869?s=4a77b1b9&searchId=MTcxMTM3NTM0NDY5NS4yNjcz
https://www.pracuj.pl/praca/junior-devops-engineer-z-chmura-gcp-warszawa,oferta,1003220296?s=4a77b1b9&searchId=MTcxMTM3NTM0NDY5NS4yNjcz
pip install -r requirements.txt
If you choose "db" as an export type apply migrations
alembic upgrade head
python main.py
# On windows you can run `run.ps1` powershell script
python server.py
# On windows you can run 'server.ps1' powershell script
I don't recommend to use Docker if you decided to save your data to SQLite. I still need to refine this option, but for now I recommend using Docker in combination with Google Sheet or .XLSX files
docker build -t scraper .
docker run scraper
See LICENSE.txt
for more information.