Courtesy of Dall-E by OpenAI π
re-employment-kraken
scrapes (job) sites, remembers what it saw and notifies downstream systems of any new sightings.
- π re-employment-kraken
- Scrape search results from multiple websites via different 'strategies'
- Able to use multiple search queries
- Handles pagination of search results (if crawlable)
- Keeps track of what it has seen (helpfully brings its own 'database')
- Sends notifications to:
stdout
- Your Mac OS notification center
- Slack
- Telegram chat with your bot
- E-Mail (not yet implemented, good first issue, see #3)
- Creates cards on Kanban boards in:
- Runs anywhere you can run Node.js and
cron
jobs
I am a freelancer looking for a new project, and I realised that cycling through many different job sites each day will probably not be fun. Automating things on the other hand? Lots of fun! π
I am a techie looking for a freelance gig (project) in the European/German market, so this is why I picked these sites. So, so far there are strategies to scrape the following recruitment companies' job sites.
- β freelancermap.de
- β freelance.de
- β Hays
- β Michael Page
- β Austin Fraser
- β top itservices
β οΈ Darwin Recruitment (results not crawlable, see "Known Issues")- π«
xing(requires JS to run, see "Known Issues") - π«
SOLCOM(search query not settbale via URL path, see "Known Issues") - π«
Constaff(search query not settbale via URL path, see "Known Issues") - π«
Gulp(requires JS to run, see "Known Issues") - π«
Avantgarde Experts(WAF, see "Known Issues") - π«
Progressive Recruitment(Cloudflare WAF, see "Known Issues") - π«
Computer Futures(Cloudflare WAF, see "Known Issues") - π«
etengo(cumbersome search engine, see "Known Issues")
Of course you can use it to scrape other sites too, because your situation may be different and these sites may not be useful to you. Just get a friend who has some dev chops to help you write some strategies - it's really easy, I promise!
Actually though... you can use it to scrape anything!
You've been bouncing between the same 6 sites for weeks to find a sweet deal for that new used car you've been eyeing? re-employment-kraken
to the rescue! Want to be first in line, when a popular part is back in stock on one of your favourite bicycle supply sites? re-employment-kraken
has your back!
π
Ideally, you should run re-employment-kraken
on a server somewhere so it can keep running 24/7. But honestly, just running it on your laptop is probably good enough. It will just pick up any changes on the target sites as soon you open the lid.
First though, you will probably want to write some strategies for your use case. Clone the repo:
git clone [email protected]:uschtwill/re-employment-kraken.git && cd re-employment-kraken.git
Install dependencies:
npm install
Have a look at config.js
and enable the options and scraping and notification strategies that you want to use. You will need an .env
file with secrets for some of them - have a look at .example.env
to see what's available.
Writing strategies is easy.
Basically you just have to inspect the source code of the site you want to scrape and find the CSS classes and IDs ('selectors') to tell re-employment-kraken
where to look for the relevant information.
Specifically you are interested in the HTML making up a single search result.
The CSS selector identifying one of these goes into the getSingleResult
function. Furthermore you will need to specify selectors to get the title (getResultTitle
) and the link to the detail page of that result (getResultHref
).
re-employment-kraken
uses the cheerio
package to scrape the HTML and anything related to the DOM, so for some more involved cases it can be useful to check out their docs ("Traversing the DOM").
But just having a look at the example and the existing strategies should give you a good idea of what is possible and how to get started. Suffice to say, that these getters are just normal functions, so you can do pretty much anything in there.
So how do you actually use it? Assuming NodeJS is installed, you can simply execute:
npm start
This runs the scraper once and exits. See Running Periodically if you want to fetch more frequently.
When deploying the application to a server, using a container is preferable since it brings all the required dependencies and runs in isolation.
For this purpose, you can find Dockerfile
and compose.yml
in the repository.
Assuming you are in the project's root repository and Docker is installed, you can build the container image like this:
docker build -t re-employment-kraken:latest .
In order to run the container successfully, you need to provide the following files as a volume:
- Required: Your configuration file
.env
. - Conditional: The directory that contains your SQLite database file as specified in
DATABASE_FILE_PATH
. If you are starting with a fresh DB, the DB file does not need to exist yet. It will be created automatically. However, the target directory must be mounted to preserve the database between container runs.
The easiest way to run the container is to use the included compose.yml
file which assumes default paths. Otherwise, you can use the file as a template for configuring your volumes.
docker compose up re-employment-kraken
This runs the scraper once and exits. See Running Periodically if you want to fetch more frequently.
To run the application regularly (which makes it useful), create a cron
job. You can also do this on your laptop.
Open your crontab
with:
crontab -e
Copy paste this in there, but change the path and cron expression as needed:
For running natively every hour:
0 * * * * cd /absolute/path/to/the/directory && node index.js >> cron.log 2>&1
For running with Docker every hour:
0 * * * * cd /absolute/path/to/the/directory && docker compose up re-employment-kraken >> cron.log 2>&1
Quick explanation: 0 * * * *
makes it run every hour at minute zero, see cron syntax. And >> cron.log 2>&1
logs both stdout
and stderr
to the cron.log
file. You can adapt the cron expression as needed. However, be careful not to run it too frequently as you might experience rate limiting or other blocking behavior otherwise.
Being able to inspect the logs is nice, because honestly, you may have to fiddle a bit to get this line right - it really depends on your system. I may write a script that does this reliably at some point, but at the moment I don't even know if anyone will use this ever... so yeah.
If the crontab user doesn't have node
in it's path for instance, use which node
to find the path to your node binary and substitute in the whole path in lieu of just node
in the crontab
.
You'll figure it out. π
SQLite is used to handle persistence and deduplication. A single database file named re-employment-kraken.db
is written to the application's root directory when DATABASE_ENABLED
is active. If you want to preserve previously seen jobs, please keep this file intact and consider a backup strategy. However, if you want to have a fresh start, feel free to delete the file or turn DATABASE_ENABLED
off. In the latter case, an in-memory SQLite instance will be used for deduplicating jobs during a single application run.
See this standalone document for guidance on how to set up the Notion integration. If you want to customize your Notion integration (other properties etc), have a look at the "Links" section below.
See the official Telegram documentation on how to create a new bot via BotFather
. Configure the token provided during bot creation in your .env
file and set your Telegram user ID accordingly. If you don't know your user ID, send a message to userinfobot
. Finally, start a chat with your newly created bot as users need to initiate bot contact before they can receive any messages. Note that the bot you created will not react to your messages. Instead, it will send you new projects that have been found while running this software.
Some sites are protected from bots by technology like the Cloudflare WAF, which uses various measures to keep scrapers and crawlers out. There are some ways to sidestep protection like this, but it certainly complicates things and I am also not too sure about the legality of doing so.
See #4
Some sites need JS to be enabled to work properly. Solution could be the same as for WAFs, see #4.
For some sites the search query can not be set via the URL path.
This crawler so far depends on search queries being settable via the URL path. It also helps if pagination is implemented in a standard way. Right now, from where I am standing, if it's a fancy search engine implementation, it's just not worth the time to write custom code just for that single case.
Some sites implement search result pagination in a non standard way. One such example is a site injecting the URL while running the click handler when clicking the "next page" button instead of just using a standard html link. This would need some extra effort to account for. Not today.
In this case re-employment-kraken
will only fetch the results from the first page. Depending on how narrow or broad the search queries are, and how often you crawl, this may or may not be a problem.