Crawl history does not reflect real number of papers ingested in CiteSeer #26

fanchyna · 2015-01-23T19:12:24Z

In the crawl web site, the crawl history page shows the proportion of documents "In System", "Crawled" and "Fail to Convert", but the "In System" documents just means documents are extracted, but not necessarily mean they are ingested, i.e., documents may in the waiting list. And because of the significant speed difference between ingestion and extraction, the waiting list can be long. Therefore, we need the fourth parameter reflecting the real number of papers ingested. This can be done in three steps
(1) add a new flag in the "state" field in citeseerx_crawl.main_crawl_document table to indicate ingested papers;
(2) update view.py, adding "ingested_count" and calculate it in some way (either dynamically from the production database, or from the crawling, or from a database dump);
(3) update template, adding "ingested_count" in the displayed graph.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl history does not reflect real number of papers ingested in CiteSeer #26

Crawl history does not reflect real number of papers ingested in CiteSeer #26

fanchyna commented Jan 23, 2015

Crawl history does not reflect real number of papers ingested in CiteSeer #26

Crawl history does not reflect real number of papers ingested in CiteSeer #26

Comments

fanchyna commented Jan 23, 2015