forked from Norconex/collector-core
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTODO.txt
99 lines (69 loc) · 3.83 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
TODO:
==============
- Performance:
- Keep tack of counts by having counters in memory instead of querying
for count. And have maxXXX for different types instead of just
"maxDocuments" which can be ambiguous.
- Have options for tracking progress:
- Do not track
- Track only %
- Track detailed (what is now)
- Track full/verbose (adding counts for each states)
- Have Collector add these new default fields:
- Collector start date
- Crawler start date
- Document fetch date
- Collector Id
- Crawler Id
- Have new command line option for producing useful stats out of the crawl
store. Like the # of documents per each crawl state found in the store.
- Put back previous data store tests that now applies to CrawlReferenceService.
- Re-introduce CommitCommand? Or is it no longer applicable?
- Rename CrawlReference* to shorter CrawlObject (preffered), or CrawlItem.
- Create a MemoryDataStore for testing only
- Consider Lucene as a data store.
- Make sure command line arguments can also be specified as system properties
or env variables (or properties files?). Similar to spring boot?
- Add ability to have multiple crawlers talk to the same crawl store
for managing their queue (maybe MapReduce/Spark would be best?).
- Add support for event listeners at crawler level, which would only
get events for that crawler (since events are bubbling up, not down/sideways).
- AbstractCrawlerConfig.xsd has the anyComplexRequiredClassType "class" being
optional. See if we can make it required, except for self-closing tags.
- Rename RegexReferenceFilter here or the one in Importer module
to avoid sharing same name.
- Extract out crawl store implementations to reduce dependencies
(only keep one). and potentially redesign as well.
- Make sure validation reporting works fine (ignore vs throw exception)
- Remove the need for XXXConfigLoader?
- Add parent pom for managed dependencies.
- Add workdir to collector, and by default, have log, progress share it.
At crawler level, have all other paths relative to collector workdir,
or crawler workdir if that one is specified.
- Similar to above, maybe create a FileResource object and provide a way to
"register" it when classes need to write files, labeling them as "backup",
"delete", "keep" when crawler is done/starts. And have that managed
automatically by crawler/collector. Also have a flag on that object to
mention its scope, to say if it can be shared between threads, crawler,
all or else (multiple collectors??).
- Add <regex> to RegexReferenceFilter to match how it is done
in Importer.
- Rename RegexReferenceFilter to avoid confusion with class of the same name
in Importer.
- Add a new flag to start crawling "fresh" (deleting the crawl database).
- In Allow crawler to "expire" after configurable delay if
activeCount in AbstractCrawler#processNextReference is equal or less
than number of thread and the crawler has been running idle for too long.
- Consider adding metadata checksummer to AbstractCrawlerConfig
given there are already metadata filters. Add Metadata checksum stage as well?
- Refactor the whole approach of passing if new or modified to simplify it.
- Introduce full/incremental flag as part of collector framework
- Have document default value other than NEW (e.g. UNKNOWN, UNPROCESSED, etc)
- Consider using Hibernate for the JDBC data store, for both embedded and
client-server databases. Ship with no drivers
except maybe for testing (or 1 for convenience, like H2).
- Consider changing default base directory for logs and progress
(to avoid logs and progress appearing twice in path by default).
- Consider a way to merge documents by temporarily storing mergeable
docs in a queue until all mergable siblings are encountered.
Maybe this should be made a wrapping committer instead?