###Run Command### bin/nutch crawl urls -dir crawlDir -solr http://10.208.36.48:8983/solr/mycollection -depth 1 -topN 10
###Need to Check
- Value of ParentScore is always '1.0f' in
getURLDetails(String url, String crawlDbPath)
method of classPrioritizer.java
. - Current implementation already supports collation of un_fetched URLs till now at any depth.
- What about One-Class classifier.
- Some error in stopping the classifier
classifier.classify("Shutdown");
. - Changed some code to support Incremental Classifier in an Efficient way and created one method
createIncTrainingFile(buffer.toString(),incTrainingFile);
to always create one separate file for training containing data regarding newly classified URLs. - Where are we changing the score of URLs being stored in CrawlDb, so that this score can be used by Generator to choose topN URLs in next iteration.
- We did not updated the
posUrlDetailBase
andnegUrlDetailBase
static variables for each depth. So I added following code.//** Updating Positive Feature Pool URLDetails.posUrlDetailBase.getAnchorTextWords().addAll(details.getAnchorTextWords()); URLDetails.posUrlDetailBase.getParentURLTokens().addAll(details.getParentURLTokens()); URLDetails.posUrlDetailBase.getUrlTokens().addAll(details.getUrlTokens()); //** Updating Negative Feature Pool URLDetails.negUrlDetailBase.getAnchorTextWords().addAll(urlDetails.getAnchorTextWords()); URLDetails.negUrlDetailBase.getParentURLTokens().addAll(urlDetails.getParentURLTokens()); URLDetails.negUrlDetailBase.getUrlTokens().addAll(urlDetails.getUrlTokens());
- Lot of
no link information. in Prioritizer
message pops-up even when we run crawler with-depth 2 -topN 10
. This shows that there is no parent(inlinks) information in crawldb.
###Tasks###
Date | Task | Target | Status |
---|---|---|---|
09 May | Understand seed url concept | 09 May | Pending |
09 May | Check Score gets reflected in crawldb | 09 May | Pending |
09 May | Working of Classifier | 09 May | Pending |
09 May | URL Score Computation | 10 May | Pending |
09 May | Positive and Negative Examples getting reflected | 10 May | Pending |
09 May | Applying Stop Word Removal | 11 May | Pending |
09 May | Testing | 11 May | Pending |
###Code Update###
-
Commented code line present in class
BaseSetLoader.java
and in methodupdatePosBaseSetUsingDepth(Configuration configuration)
// Old State details.setFinalScore(-1.0f); // New State details.setFinalScore(1.0f);
-
Commented following lines of code present in class
BaseSetLoader.java
and in methodupdatePosBaseSetUsingDepth(Configuration configuration)
ind = line.indexOf("]") + 1; line = line.substring(ind); line = line.trim();
- Added directory resorces/ in project root directory
- Few files like negativeSet.txt, positiveSet.txt were also copied.
- We have to create a train file for the initial training of Classifier (train.txt).
- appendTrainingData(buffer.toString(),trainingFile); writes the Postive and Negative examples onto train.txt file.
- Added directory ClassifierModels in project root directory
- No files were added.
- This directory is used to store the train.txt, model, model.idx files.
- train.txt file is created by step 1(c).
- model, model.idx files are created while actually training the Classifier i.e. classifier.learn(trainingFile);