###Run Command### bin/nutch crawl urls -dir crawlDir -solr http://10.208.36.48:8983/solr/mycollection -depth 1 -topN 10

###Need to Check

Value of ParentScore is always '1.0f' in getURLDetails(String url, String crawlDbPath) method of class Prioritizer.java.
Current implementation already supports collation of un_fetched URLs till now at any depth.
What about One-Class classifier.
Some error in stopping the classifier classifier.classify("Shutdown");.
Changed some code to support Incremental Classifier in an Efficient way and created one method createIncTrainingFile(buffer.toString(),incTrainingFile); to always create one separate file for training containing data regarding newly classified URLs.
Where are we changing the score of URLs being stored in CrawlDb, so that this score can be used by Generator to choose topN URLs in next iteration.

We did not updated the posUrlDetailBase and negUrlDetailBase static variables for each depth. So I added following code.

	//** Updating Positive Feature Pool
	URLDetails.posUrlDetailBase.getAnchorTextWords().addAll(details.getAnchorTextWords());
	URLDetails.posUrlDetailBase.getParentURLTokens().addAll(details.getParentURLTokens());
	URLDetails.posUrlDetailBase.getUrlTokens().addAll(details.getUrlTokens());
		
	//** Updating Negative Feature Pool
	URLDetails.negUrlDetailBase.getAnchorTextWords().addAll(urlDetails.getAnchorTextWords());
	URLDetails.negUrlDetailBase.getParentURLTokens().addAll(urlDetails.getParentURLTokens());
	URLDetails.negUrlDetailBase.getUrlTokens().addAll(urlDetails.getUrlTokens());

Lot of no link information. in Prioritizer message pops-up even when we run crawler with -depth 2 -topN 10. This shows that there is no parent(inlinks) information in crawldb.

###Tasks###

Date	Task	Target	Status
09 May	Understand seed url concept	09 May	Pending
09 May	Check Score gets reflected in crawldb	09 May	Pending
09 May	Working of Classifier	09 May	Pending
09 May	URL Score Computation	10 May	Pending
09 May	Positive and Negative Examples getting reflected	10 May	Pending
09 May	Applying Stop Word Removal	11 May	Pending
09 May	Testing	11 May	Pending

###Code Update###

Commented code line present in class BaseSetLoader.java and in method updatePosBaseSetUsingDepth(Configuration configuration)
```
	// Old State
	details.setFinalScore(-1.0f);
	
	// New State
	details.setFinalScore(1.0f);
```
Commented following lines of code present in class BaseSetLoader.java and in method updatePosBaseSetUsingDepth(Configuration configuration)
```
   	ind = line.indexOf("]") + 1;
 	line = line.substring(ind);
 	line = line.trim();
```

Steps

Added directory resorces/ in project root directory
- Few files like negativeSet.txt, positiveSet.txt were also copied.
- We have to create a train file for the initial training of Classifier (train.txt).
- appendTrainingData(buffer.toString(),trainingFile); writes the Postive and Negative examples onto train.txt file.
Added directory ClassifierModels in project root directory
- No files were added.
- This directory is used to store the train.txt, model, model.idx files.
- train.txt file is created by step 1(c).
- model, model.idx files are created while actually training the Classifier i.e. classifier.learn(trainingFile);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes_For_Running_Crawler.md

Changes_For_Running_Crawler.md

Steps

Files

Changes_For_Running_Crawler.md

Latest commit

History

Changes_For_Running_Crawler.md

File metadata and controls

Steps