Skip to content

re-employment-kraken scrapes (job) sites, remembers what it saw and notifies downstream systems of any new sightings.

Notifications You must be signed in to change notification settings

uschtwill/re-employment-kraken

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

53 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ™ re-employment-kraken

dall-e vision of re-employment-kraken

Courtesy of Dall-E by OpenAI 😍

re-employment-kraken scrapes (job) sites, remembers what it saw and notifies downstream systems of any new sightings.

Table of Content

Features

  • Scrape search results from multiple websites via different 'strategies'
  • Able to use multiple search queries
  • Handles pagination of search results (if crawlable)
  • Keeps track of what it has seen (helpfully brings its own 'database')
  • Sends notifications to:
    • stdout
    • Your Mac OS notification center
    • Slack
    • Telegram chat with your bot
    • E-Mail (not yet implemented, good first issue, see #3)
  • Creates cards on Kanban boards in:
    • Notion
    • Trello (not yet implemented, good first issue, see #2)
    • Jira (not yet implemented, good first issue, see #1)
  • Runs anywhere you can run Node.js and cron jobs

Background

I am a freelancer looking for a new project, and I realised that cycling through many different job sites each day will probably not be fun. Automating things on the other hand? Lots of fun! 😍

I am a techie looking for a freelance gig (project) in the European/German market, so this is why I picked these sites. So, so far there are strategies to scrape the following recruitment companies' job sites.

Of course you can use it to scrape other sites too, because your situation may be different and these sites may not be useful to you. Just get a friend who has some dev chops to help you write some strategies - it's really easy, I promise!

Actually though... you can use it to scrape anything!

You've been bouncing between the same 6 sites for weeks to find a sweet deal for that new used car you've been eyeing? re-employment-kraken to the rescue! Want to be first in line, when a popular part is back in stock on one of your favourite bicycle supply sites? re-employment-kraken has your back!

πŸ™

Usage

Getting Started

Ideally, you should run re-employment-kraken on a server somewhere so it can keep running 24/7. But honestly, just running it on your laptop is probably good enough. It will just pick up any changes on the target sites as soon you open the lid.

First though, you will probably want to write some strategies for your use case. Clone the repo:

git clone [email protected]:uschtwill/re-employment-kraken.git && cd re-employment-kraken.git

Install dependencies:

npm install

Have a look at config.js and enable the options and scraping and notification strategies that you want to use. You will need an .env file with secrets for some of them - have a look at .example.env to see what's available.

Writing Strategies

Writing strategies is easy.

Basically you just have to inspect the source code of the site you want to scrape and find the CSS classes and IDs ('selectors') to tell re-employment-kraken where to look for the relevant information.

Specifically you are interested in the HTML making up a single search result.

The CSS selector identifying one of these goes into the getSingleResult function. Furthermore you will need to specify selectors to get the title (getResultTitle) and the link to the detail page of that result (getResultHref).

re-employment-kraken uses the cheerio package to scrape the HTML and anything related to the DOM, so for some more involved cases it can be useful to check out their docs ("Traversing the DOM").

But just having a look at the example and the existing strategies should give you a good idea of what is possible and how to get started. Suffice to say, that these getters are just normal functions, so you can do pretty much anything in there.

Running Natively

So how do you actually use it? Assuming NodeJS is installed, you can simply execute:

npm start

This runs the scraper once and exits. See Running Periodically if you want to fetch more frequently.

Running with Docker

When deploying the application to a server, using a container is preferable since it brings all the required dependencies and runs in isolation. For this purpose, you can find Dockerfile and compose.yml in the repository.

Assuming you are in the project's root repository and Docker is installed, you can build the container image like this:

docker build -t re-employment-kraken:latest .

In order to run the container successfully, you need to provide the following files as a volume:

  • Required: Your configuration file .env.
  • Conditional: The directory that contains your SQLite database file as specified in DATABASE_FILE_PATH. If you are starting with a fresh DB, the DB file does not need to exist yet. It will be created automatically. However, the target directory must be mounted to preserve the database between container runs.

The easiest way to run the container is to use the included compose.yml file which assumes default paths. Otherwise, you can use the file as a template for configuring your volumes.

docker compose up re-employment-kraken

This runs the scraper once and exits. See Running Periodically if you want to fetch more frequently.

Running Periodically

To run the application regularly (which makes it useful), create a cron job. You can also do this on your laptop.

Open your crontab with:

crontab -e

Copy paste this in there, but change the path and cron expression as needed:

For running natively every hour:

0 * * * * cd /absolute/path/to/the/directory && node index.js >> cron.log 2>&1

For running with Docker every hour:

0 * * * * cd /absolute/path/to/the/directory && docker compose up re-employment-kraken >> cron.log 2>&1

Quick explanation: 0 * * * * makes it run every hour at minute zero, see cron syntax. And >> cron.log 2>&1 logs both stdout and stderr to the cron.log file. You can adapt the cron expression as needed. However, be careful not to run it too frequently as you might experience rate limiting or other blocking behavior otherwise.

Being able to inspect the logs is nice, because honestly, you may have to fiddle a bit to get this line right - it really depends on your system. I may write a script that does this reliably at some point, but at the moment I don't even know if anyone will use this ever... so yeah.

If the crontab user doesn't have node in it's path for instance, use which node to find the path to your node binary and substitute in the whole path in lieu of just node in the crontab.

You'll figure it out. πŸ˜…

Miscellaneous

Regarding Persistence/State

SQLite is used to handle persistence and deduplication. A single database file named re-employment-kraken.db is written to the application's root directory when DATABASE_ENABLED is active. If you want to preserve previously seen jobs, please keep this file intact and consider a backup strategy. However, if you want to have a fresh start, feel free to delete the file or turn DATABASE_ENABLED off. In the latter case, an in-memory SQLite instance will be used for deduplicating jobs during a single application run.

Setting up the Notion Integration

See this standalone document for guidance on how to set up the Notion integration. If you want to customize your Notion integration (other properties etc), have a look at the "Links" section below.

Setting up the Telegram Bot Integration

See the official Telegram documentation on how to create a new bot via BotFather. Configure the token provided during bot creation in your .env file and set your Telegram user ID accordingly. If you don't know your user ID, send a message to userinfobot. Finally, start a chat with your newly created bot as users need to initiate bot contact before they can receive any messages. Note that the bot you created will not react to your messages. Instead, it will send you new projects that have been found while running this software.

Known Issues

Cloudflare Web Application Firewall (WAF)

Some sites are protected from bots by technology like the Cloudflare WAF, which uses various measures to keep scrapers and crawlers out. There are some ways to sidestep protection like this, but it certainly complicates things and I am also not too sure about the legality of doing so.

See #4

Requires JS to Run

Some sites need JS to be enabled to work properly. Solution could be the same as for WAFs, see #4.

Search Query not Settable via URL Path

For some sites the search query can not be set via the URL path.

Cumbersome Search Engines

This crawler so far depends on search queries being settable via the URL path. It also helps if pagination is implemented in a standard way. Right now, from where I am standing, if it's a fancy search engine implementation, it's just not worth the time to write custom code just for that single case.

Search Results not Crawlable

Some sites implement search result pagination in a non standard way. One such example is a site injecting the URL while running the click handler when clicking the "next page" button instead of just using a standard html link. This would need some extra effort to account for. Not today.

In this case re-employment-kraken will only fetch the results from the first page. Depending on how narrow or broad the search queries are, and how often you crawl, this may or may not be a problem.

Links

About

re-employment-kraken scrapes (job) sites, remembers what it saw and notifies downstream systems of any new sightings.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published