Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl history does not reflect real number of papers ingested in CiteSeer #26

Open
fanchyna opened this issue Jan 23, 2015 · 0 comments

Comments

@fanchyna
Copy link
Contributor

In the crawl web site, the crawl history page shows the proportion of documents "In System", "Crawled" and "Fail to Convert", but the "In System" documents just means documents are extracted, but not necessarily mean they are ingested, i.e., documents may in the waiting list. And because of the significant speed difference between ingestion and extraction, the waiting list can be long. Therefore, we need the fourth parameter reflecting the real number of papers ingested. This can be done in three steps
(1) add a new flag in the "state" field in citeseerx_crawl.main_crawl_document table to indicate ingested papers;
(2) update view.py, adding "ingested_count" and calculate it in some way (either dynamically from the production database, or from the crawling, or from a database dump);
(3) update template, adding "ingested_count" in the displayed graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant