Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Latest commit

 

History

History
133 lines (107 loc) · 4.87 KB

README.md

File metadata and controls

133 lines (107 loc) · 4.87 KB

crawler

Crawler browses the student portfolio websites acting as user and stores the scraped data to the database for the research purposes.

Get started

install dependencies

npm i

start crawl

npm run start

continue last crawl

CRAWLEE_PURGE_ON_START=0 npm start

Accessing scraped data

Crawler stores data to portfolio_pages table in our Supabase Postgres database.

column description
url crawled url used as unique key
data scraped data in jsonb
created_at timestamp of first creation
portfolio_id uuid of the portfolio
updated_at timestamp of last update (last crawl ended)

Scraped data structure

The scraped data are stored as json data structure, see its documentation.

Querying the data

Crawled data are stored as jsonb type in the postgres database. There is variety of build-in functions and operators to query the json or even index its parameters.

Examples

get all pages with successfully parsed post-content (the where clause excludes pages such as 404, search, archive, tag etc.)

select * from portfolio_pages where (data->>'wordpress-pagetypes')::jsonb ?|	array['single', 'page'] and json_typeof(data->'post-content') != 'null' and json_typeof(data->'post-content'->'tree') != 'null';

get the plaintext of scraped posts

select data->'post-content'->'text' as text from portfolio_pages where (data->>'wordpress-pagetypes')::jsonb ?|	array['single', 'page'] and json_typeof(data->'post-content') != 'null' and json_typeof(data->'post-content'->'tree') != 'null';

get the title, description and open graph image of scraped posts

select data->>'title' as title, data->>'description' as description, data->'og'->'image'->'imageValue' as og_image from portfolio_pages where (data->>'wordpress-pagetypes')::jsonb ?|	array['single', 'page'] and json_typeof(data->'post-content') != 'null' and json_typeof(data->'post-content'->'tree') != 'null';

get the number of posts with and without parsed post-content

below are the stats from last crawl

metric count
total_crawled 3837
total_single 1537
total_page 472
single_without_tree 0
single_with_tree 1537
single_with_tree_published_at 1442
page_without_tree 16
page_with_tree 456
with_with_tree_published_at 368
select
  count(*) as total_crawled,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['single']
    ) as total_single,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['page']
    ) as total_page,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['single']
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') = 'null'
    ) as single_without_tree,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['single']
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') != 'null'
    ) as single_with_tree,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['single']
      and json_typeof(data->'published-at') != 'null'
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') != 'null'
    ) as single_with_tree_published_at,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['page']
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') = 'null'
    ) as page_without_tree,
  sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['page']
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') != 'null'
    ) as page_with_tree,
    sum(1) filter (
    where (
      data->>'wordpress-pagetypes')::jsonb ?&	array['page']
      and json_typeof(data->'published-at') != 'null'
      and json_typeof(data->'post-content') != 'null'
      and json_typeof(data->'post-content'->'tree') != 'null'
    ) as with_with_tree_published_at
from
  portfolio_pages