Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDX Indexing failing on weird data #105

Open
anjackson opened this issue Nov 17, 2022 · 33 comments
Open

CDX Indexing failing on weird data #105

anjackson opened this issue Nov 17, 2022 · 33 comments

Comments

@anjackson
Copy link
Contributor

anjackson commented Nov 17, 2022

The CDX backfill is hitting problems. When submitting to OutbackCDX, we see:

Exception: Failed with 400 Bad Request
At line: uk,gov,bracknell-forest,democratic)/mgmeetingattendance.aspx?id=3246 20180614210133 https://democratic.bracknell-forest.gov.uk/mgMeetingAttendance.aspx?ID=3246 - html> VSNDFFRW22AHHOIFLZDLLMITL3O2JTGO - - 2479 598530890 /heritrix/output/warcs/weekly/20180611080023/BL-20180614200107699-01321-63~ukwa-h3-pulse-weekly~8443.warc.gz
java.lang.NumberFormatException: For input string: "html>"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at outbackcdx.Capture.fromCdxLine(Capture.java:224)
        at outbackcdx.Webapp.post(Webapp.java:242)
        at outbackcdx.Webapp.lambda$new$3(Webapp.java:95)
        at outbackcdx.Web$Route.handle(Web.java:301)
        at outbackcdx.Web$Router.handle(Web.java:225)
        at outbackcdx.Webapp.handle(Webapp.java:584)
        at outbackcdx.Web$Server.serve(Web.java:52)
        at java.lang.Thread.run(Thread.java:745)

For comparison, a good CDX line looks like this:

uk,gov,bracknell-forest,democratic)/mglistdeclarationsofinterest.aspx?uid=1058 20180626012653 http://democratic.bracknell-forest.gov.uk/mgListDeclarationsOfInterest.aspx?UID=1058 text/html 200 PQLF3STCAARNAULYBZNGLDJ6VBN5NKZJ - - 5636 207792976 /heritrix/output/warcs/weekly/20180625080108/BL-20180626012250679-00163-63~ukwa-h3-pulse-weekly~8443.warc.gz

So, we can see that the content type is missed - and then a malformed content type is where the status code should be.

The WARC record from BL-20180614200107699-01321-63ukwa-h3-pulse-weekly8443.warc.gz at 598530890 compressed length 2479 looks like:

@anjackson
Copy link
Contributor Author

anjackson commented Nov 17, 2022

uk,gov,bracknell-forest,democratic)/mgmeetingattendance.aspx?id=3246 20180614210133 {"url": "https://democratic.bracknell-forest.gov.uk/mgMeetingAttendance.aspx?ID=3246", "status": "html>", "digest": "sha1:VSND
FFRW22AHHOIFLZDLLMITL3O2JTGO", "length": "2479", "offset": "598530890", "filename": "BL-20180614200107699-01321-63~ukwa-h3-pulse-weekly~8443.warc.gz"}
uk,gov,maidstone)/home/primary-services/council-and-democracy/primary-areas/your-councillors?sq_content_src=+dxjspwh0dhbzjtnbjtjgjtjgbwvldgluz3mubwfpzhn0b25llmdvdi51ayuyrmrvy3vtzw50cyuyrnm0otg5myuyrljlzmvyzw5jz
suymhrvjtiwq291bmnpbcuymep1bhklmjaymde2jtiwlsuymfryywluaw5nlnbkzizhbgw9mq== 20180614210134 {"url": "http://www.maidstone.gov.uk/home/primary-services/council-and-democracy/primary-areas/your-councillors?sq_cont
ent_src=%2BdXJsPWh0dHBzJTNBJTJGJTJGbWVldGluZ3MubWFpZHN0b25lLmdvdi51ayUyRmRvY3VtZW50cyUyRnM0OTg5MyUyRlJlZmVyZW5jZSUyMHRvJTIwQ291bmNpbCUyMEp1bHklMjAyMDE2JTIwLSUyMFRyYWluaW5nLnBkZiZhbGw9MQ%3D%3D", "mime": "applica
tion/pdf", "status": "200", "digest": "sha1:EQKO7KF6EX5OCMK3MGHA6LDLX2MLX5OW", "length": "44796", "offset": "598534660", "filename": "BL-20180614200107699-01321-63~ukwa-h3-pulse-weekly~8443.warc.gz"}

Noting that 598534660 - 598530890 = 3770 which is not consistent with the prior record length of 2479.

@anjackson
Copy link
Contributor Author

Hmm, worringly, also a failure from a different WARC.

At line: uk,co,faze3)/puzzles/other-puzzles/tiger-animal-tile-puzzle?limit=75&order=desc&sort=p.model 20191124171556 http://www.faze3.co.uk/puzzles/other-puzzles/tiger-animal-tile-puzzle?sort=p.model&am
p;order=DESC&limit=75 text/html [132.232.68.172:62319-40#APVH_cumbriasmuseumofmilitarylife.org] IKXUQDKY3JUSZDYH3U3DWLANEPLXPMED - - 6355 682961982 /heritrix/output/dc2019/20191117161727/warcs/BL-NPLD-20191124151629315-51162-106~npld-dc-heritrix3-worker-1~8443.warc.gz
java.lang.NumberFormatException: For input string: "[132.232.68.172:62319-40#APVH_cumbriasmuseumofmilitarylife.org]"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at outbackcdx.Capture.fromCdxLine(Capture.java:224)
        at outbackcdx.Webapp.post(Webapp.java:242)
        at outbackcdx.Webapp.lambda$new$3(Webapp.java:95)
        at outbackcdx.Web$Route.handle(Web.java:301)
        at outbackcdx.Web$Router.handle(Web.java:225)
        at outbackcdx.Webapp.handle(Webapp.java:584)
        at outbackcdx.Web$Server.serve(Web.java:52)
        at outbackcdx.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:848)
        at outbackcdx.NanoHTTPD$1$1.run(NanoHTTPD.java:207)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Ah, well this at least seems to be a less troubling problem:

WARC/1.0^M
WARC-Type: response^M
WARC-Target-URI: http://www.faze3.co.uk/puzzles/other-puzzles/tiger-animal-tile-puzzle?sort=p.model&order=DESC&limit=75^M
WARC-Date: 2019-11-24T17:15:56Z^M
WARC-IP-Address: 5.134.13.89^M
WARC-Payload-Digest: sha1:IKXUQDKY3JUSZDYH3U3DWLANEPLXPMED^M
WARC-Record-ID: <urn:uuid:1e67d4f9-0523-4d3c-8c59-d9267d8c801e>^M
Content-Type: application/http; msgtype=response^M
Content-Length: 24334^M
^M
078de2e92c074308c2dd2b334371fc68b955450e3] [132.232.68.172:62319-40#APVH_cumbriasmuseumofmilitarylife.org] Content len: 255, Request line: 'POST /xmlrpc.php HTTP/1.1'
2019-11-24 17:15:56.056244 [INFO] [6553] [132.232.68.172:62319-40#APVH_cumbriasmuseumofmilitarylife.org] File not found [/home/m111t4ry/public_html/403.shtml]
HTTP/1.0 200 OK^M
Connection: close^M
Expires: Thu, 19 Nov 1981 08:52:00 GMT^M
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0^M
Pragma: no-cache^M
Set-Cookie: language=en; expires=Tue, 24-Dec-2019 17:15:56 GMT; Max-Age=2592000; path=/; domain=www.faze3.co.uk^M
Set-Cookie: currency=GBP; expires=Tue, 24-Dec-2019 17:15:56 GMT; Max-Age=2592000; path=/; domain=www.faze3.co.uk^M
Content-Type: text/html; charset=utf-8^M
Date: Sun, 24 Nov 2019 17:15:56 GMT^M
Server: LiteSpeed^M
^M
<!DOCTYPE html>

@anjackson
Copy link
Contributor Author

Another one:

 uk,co,alexread)/wp-content/uploads/2015/03/site-img97.jpg 20191114112732 http://alexread.co.uk/wp-content/uploads/2015/03/site-img97.jpg application/x-www-form-urlencoded GÃ<82>^\^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@7:32 TISLVVVFSFEOPWCAYS2VXED25PUQOPKB - - 20715 160709208 /heritrix/output/dc2019/20191112120728/warcs/BL-NPLD-20191114101835417-46437-106~npld-dc-heritrix3-worker-1~8443.warc.gz

Could use some improved diagnostic tools for these records, e.g. are the WARC record length/digest headers consistent with the problematic payload? Or has something else gone wrong somehow? Are the GZ blocks either side of the broken one okay? etc.

@anjackson
Copy link
Contributor Author

And another!

 uk,co,topofthedogs)/1815396-liangechinus.wf 20191111065355 http://topofthedogs.co.uk/1815396-liangechinus.wf text/html +0000] 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 620 835614237 /heritrix/output/dc2019/20191108122506/warcs/BL-NPLD-20191111032353224-44964-106~npld-dc-heritrix3-worker-1~8443.warc.gz

Adding to the excluded set.

@anjackson
Copy link
Contributor Author

Another

 uk,co,castlegatestationers)/wp-content/uploads/2016/12/airfix-1.72-hawker-siddley-harrier-gr.1-starter-kit.jpg 20191023043029 http://www.castlegatestationers.co.uk/wp-content/uploads/2016/12/AIRFIX-1.7
2-HAWKER-SIDDLEY-HARRIER-GR.1-STARTER-KIT.jpg application/x-www-form-urlencoded +0100] 365TPC2NUX4WXGC2UNKPAAT3N2GP5VQ6 - - 457678 955471599 /heritrix/output/dc2019/20191016083028/warcs/BL-NPLD-20191022125549694-40000-106~npld-dc-heritrix3-worker-1~8443.warc.gz

@anjackson
Copy link
Contributor Author

Another

 uk,co,fancyratsforum)/viewtopic.php?&amp;f=42&amp;t=417 20191019112804 http://fancyratsforum.co.uk/viewtopic.php?f=42&amp;t=417&amp;sid=b2865ff6aa8911fa9a85c3daef097a6b application/x-www-form-urlencode
d 35.246.140.151 ZZXB3IJCLBPKNQVBQXSKI2BSDGKI6SL4 - - 10908 469226952 /heritrix/output/dc2019/20191016083028/warcs/BL-NPLD-20191019091941288-38651-106~npld-dc-heritrix3-worker-1~8443.warc.gz

@anjackson
Copy link
Contributor Author

Another

uk,co,naughtyjessica)/images/x.jpg 20191017020204 http://www.naughtyjessica.co.uk/images/x.jpg
application/xml -0400] MMHHHZKBVBVOHAUOBYBWCMMYS5NJOIRG - - 6589 618214173 /heritrix/output/dc2019/20191016083028/warcs/BL-NPLD-20191016234040463-37582-106~npld-dc-heritrix3-worker-1~8443.warc.gz

@anjackson
Copy link
Contributor Author

 uk,org,colossusrebuild)/documents/cryptdict/page46.htm 20191010055304 http://www.colossusrebuild.org.uk/documents/cryptdict/page46.htm application/x-www-form-urlencoded +0100] 5VXFQHA7WAZRBBSIFWCXXZQIMEBAMDEX - - 2739 209596167 /heritrix/output/dc2019/20190929212337/warcs/BL-NPLD-20191010014642203-35755-71~npld-dc-heritrix3-worker-1~8443.warc.gz

@anjackson
Copy link
Contributor Author

A different type of error:

+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-yAhWWaHHry.66.232.91~8443.warc.gz hdfs:///heritrix/output/warcs/quarterly/20161001111030/BL-2016101
6184849538-01094-2025~194.66.232.91~8443.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 53, in process_index
_entry
    self._write_line(output, index, record, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 284, in _write_li
ne
    ts = iso_date_to_timestamp(dt)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/warcio/timeutils.py&quot;, line 155, in iso_date_t
o_timestamp
    return datetime_to_timestamp(iso_date_to_datetime(string))
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/warcio/timeutils.py&quot;, line 52, in iso_date_to
_datetime
    nums = DATE_TIMESPLIT.split(string)
TypeError: expected string or bytes-like object

while reading input from hdfs:///heritrix/output/warcs/quarterly/20161001111030/BL-20161016184849538-01094-2025~194.66.232.91~8443.warc.gz

@anjackson
Copy link
Contributor Author

anjackson commented Nov 26, 2022

uk,co,crossrider)/details/7872bef351e4ab6f0eb452e1e423f13b527edec7/corel+draw+x7+32+64 20181205003832 http://crossrider.co.uk/details/7872BEF351E4AB6F0EB452E1E423F13B527EDEC7/Corel+Draw+X7+32+64 text/html ; P7TF657XV6UAZ2OL5Y74SCAHPMCPNC5Z - - 6739 964876266 /heritrix/output/dc2018/20181015150658-h3-7/warcs/BL-20181204153758815-01915-15398~h3-7~8443.warc.gz
uk,co,crossrider)/details/813faa7ef751282e82964f95f2a0d1c6187c3139/rust+1971+9+03+2017 20181204235607 http://crossrider.co.uk/details/813FAA7EF751282E82964F95F2A0D1C6187C3139/Rust+1971+9+03+2017 text/html ; 5XBZC7IKNPJITW3PF7ZARO24CEMYXXPB - - 6363 853398359 /heritrix/output/dc2018/20181015150658-h3-7/warcs/BL-20181204153758963-01917-15398~h3-7~8443.warc.gz

@anjackson
Copy link
Contributor Author

Mapper failure:

+ exec
+ cd /mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work
++ cut -f 2
+ INPUT_URI=hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743715-00042-5029~opera~8445.warc.gz
++ basename hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743715-00042-5029~opera~8445.warc.gz
++ sed -e &apos;s/^[^.]*//&apos;
+ FILE_EXT=.warc.gz
++ mktemp ./input-XXXXXXXXXX.warc.gz
+ INPUT_PATH=./input-kuU6z99tFf.warc.gz
+ rm ./input-kuU6z99tFf.warc.gz
+ case $INPUT_URI in
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743715-00042-5029~opera~8445.warc.gz ./input-kuU6z99tFf.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-kuU6z99tFf.warc.gz hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743715
-00042-5029~opera~8445.warc.gz
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 713995418
    Remainder: b&apos;\t\tan&gt; fore e_8_y-_8)eiBefospan&gt;e&quot;m&gt;tion_&gt;&amp;p  8s-(e )-room&quot; tb8e_ &lt;opvk\n&apos;
Replacing spaces in invalid WARC-Target-URI:                                            &quot;  an&gt;                  i=dj&apos;bc8  bmibp
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 248, in process_o
ne
    for record in wrap_it:
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/bufferiter.py&quot;, line 17, in buff
ering_record_iter
    for record in record_iter:
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/warcio/archiveiterator.py&quot;, line 112, in _ite
rate_records
    self._raise_invalid_gzip_err()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/warcio/archiveiterator.py&quot;, line 153, in _rai
se_invalid_gzip_err
    raise ArchiveLoadFailed(msg)
warcio.exceptions.ArchiveLoadFailed:
    ERROR: non-chunked gzip file detected, gzip block continues
    beyond single record.

    This file is probably not a multi-member gzip but a single gzip file.

    To allow seek, a gzipped WARC/ARC must have each record compressed into
    a single gzip member and concatenated together.

    This file is likely still valid and can be fixed by running:

    warcio recompress &lt;path/to/file&gt; &lt;path/to/new_file&gt;

@anjackson
Copy link
Contributor Author

+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-C4QWHMBVoj.warc.gz hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743604
-00040-5029~opera~8445.warc.gz
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 710075114
    Remainder: b&apos;rt0-? -\n&apos;
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()

@anjackson
Copy link
Contributor Author

At line: uk,ac,bath,opus)/38491/1/icsr2013_harpspositionpaper.pdf 20180716142635 http://opus.bath.ac.uk/38491/1/ICSR2013_HARPSPositionPaper.pdf application/pdf failed ERUX3RUXGUL4GRKUL5XJI6ERC5NRCA2Q - - 86520 207722996 /heritrix/output/dc2018/20180715072213-b4be5d382977/warcs/BL-20180716134615538-00005-11361~b4be5d382977~8443.warc.gz
java.lang.NumberFormatException: For input string: &quot;failed&quot;

@anjackson
Copy link
Contributor Author

+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170928235826173-19653-20124~crawler04.bl.uk~8443.warc.gz ./input-auL81XHXxS.bl.uk~8443.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-auL81XHXxS.bl.uk~8443.warc.gz hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170928235826173-19653
-20124~crawler04.bl.uk~8443.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

while reading input from hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170928235826173-19653-20124~crawler04.bl.uk~8443.warc.gz

@anjackson
Copy link
Contributor Author

+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928030805213-20320-20207~crawler04.bl.uk~8444.warc.gz ./input-Odqu5taV78.bl.uk~8444.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-Odqu5taV78.bl.uk~8444.warc.gz hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928030805213-20320
-20207~crawler04.bl.uk~8444.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

@anjackson
Copy link
Contributor Author

+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928031519226-2032320207~crawler04.bl.uk~8444.warc.gz ./input-p2dLnnAK54.bl.uk~8444.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-p2dLnnAK54.bl.uk~8444.warc.gz hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928031519226-20323
-20207~crawler04.bl.uk~8444.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

@anjackson
Copy link
Contributor Author

+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928024026812-20317-20207~crawler04.bl.uk~8444.warc.gz ./input-xWwMzxi59r.bl.uk~8444.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-xWwMzxi59r.bl.uk~8444.warc.gz hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928024026812-20317
-20207~crawler04.bl.uk~8444.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

while reading input from hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928024026812-20317-20207~crawler04.bl.uk~8444.warc.gz

@anjackson
Copy link
Contributor Author

Failed with 400 Bad Request
At line: uk,gov,ipswich,ppc)/appndetails.asp?&amp;det_search_params=&amp;iappid=14/00642/ful&amp;pnladvancedopen=1&amp;prev_search_params=&amp;search_params=pagenumber=1&amp;stype=app&amp;txtvalenddate=01/08/20
14&amp;txtvalstartdate=28/07/2014 20170926145518 https://ppc.ipswich.gov.uk/appndetails.asp?iAppID=14/00642/FUL&amp;sType=APP&amp;search_params=pageNumber%3D1%26txtValStartDate%3D28%252F07%252F2014%26txtValEndD
ate%3D01%252F08%252F2014%26pnlAdvancedOpen%3D1%26&amp;prev_search_params=&amp;det_search_params= text/html = IVOVLF33YYYAIF7NPSZBR7L26BO6RDS4 - - 5855 181944679 /heritrix/output/warcs/dc2-20170515/BL-2017092614
5339886-14424-20288~crawler04.bl.uk~8445.warc.gz
java.lang.NumberFormatException: For input string: &quot;=&quot;

@anjackson
Copy link
Contributor Author

At line: uk,gov,ipswich,ppc)/images/new_application_off.gif 20170926143801 https://ppc.ipswich.gov.uk/images/new_application_off.gif warc/revisit = BBPXBYN55CPDZMX3VEURJJT5M74TZCG4 - - 651 250168698 /heritrix/output/warcs/dc2-20170515/BL-20170926143356348-14418-20288~crawler04.bl.uk~8445.warc.gz
java.lang.NumberFormatException: For input string: &quot;=&quot;

@anjackson
Copy link
Contributor Author

At line: uk,gov,ipswich,ppc)/images/view_app_doc_on.gif 20170926143442 https://ppc.ipswich.gov.uk/images/view_app_doc_on.gif warc/revisit = 5GTJ5TDUJ76TSPGATYNX4HABPDBHZSSG - - 648 995751434 /heritrix/output/warcs/dc2-20170515/BL-20170926141616430-14414-20288~crawler04.bl.uk~8445.warc.gz
java.lang.NumberFormatException: For input string: &quot;=&quot;

@anjackson
Copy link
Contributor Author

At line: uk,gov,ipswich,ppc)/img/govuk.png 20170926143547 https://ppc.ipswich.gov.uk/img/govuk.png warc/revisit = EVWEZWEHO5CXXFCXHMNMQCRSZIPCG5BM - - 639 32826165 /heritrix/output/warcs/dc2-20170515/BL-20170926143531735-14420-20288~crawler04.bl.uk~8445.warc.gz
java.lang.NumberFormatException: For input string: &quot;=&quot;

@anjackson
Copy link
Contributor Author

  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624560/attempt_202109081729_624560_m_000215_4/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

while reading input from hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170921073708269-19795-13403~crawler04.bl.uk~8444.warc.gz

@anjackson
Copy link
Contributor Author

At line: com,forensicoutreach)/library/hiding-in-the-cloud-4-things-you-didnt-kn
ow-about-computer-forensics/feed 20170919232554 http://forensicoutreach.com/libr
ary/hiding-in-the-cloud-4-things-you-didnt-know-about-computer-forensics/feed/ a
pplication/rss+xml +0000|v1|52.87.232.174|www.adamsoftware.net|200|11133|35.197.
232.5:80|0.657|0.657|GET 4K6LQEI5J5G5XU6S2CZ2QRQXXW6PO5QS - - 1258 879566264 /heritrix/output/warcs/dc0-20170515/BL-20170919230803894-18667-13313~crawler04.bl.uk~8443.warc.gz
java.lang.NumberFormatException: For input string: &quot;+0000|v1|52.87.232.174|
www.adamsoftware.net|200|11133|35.197.232.5:80|0.657|0.657|GET&quot;

and

At line: org,worldutilitysummit)/wp-content/themes/wus-2018/font/721877/28961fbb
-c8e7-4647-84f1-1d0e25b6e854.eot 20170920003902 http://www.worldutilitysummit.or
g/wp-content/themes/wus-2018/font/721877/28961fbb-c8e7-4647-84f1-1d0e25b6e854.eo
t application/vnd.ms-fontobject +0000|v1|5.104.241.125|www.ilexinstant.com|304|0
|35.189.124.151:80|0.476|0.476|GET RHIA7YF6UP4ZZJAQIH46NI22IPIVKFZ4 - - 22891 95
4110908 /heritrix/output/warcs/dc0-20170515/BL-20170920002230701-18686-13313~crawler04.bl.uk~8443.warc.gz
java.lang.NumberFormatException: For input string: &quot;+0000|v1|5.104.241.125|
www.ilexinstant.com|304|0|35.189.124.151:80|0.476|0.476|GET&quot;

@anjackson
Copy link
Contributor Author

Patching the indexer to skip and note the bad status codes... See 2.3.3 and 2.3.4.

@anjackson
Copy link
Contributor Author

Mapper failure:

+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-sBHbZGMBQP.bl.uk~8444.warc.gz hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170801092009568-14888
-28126~crawler04.bl.uk~8444.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 298, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 137, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

while reading input from hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170801092009568-14888-28126~crawler04.bl.uk~8444.warc.gz

@anjackson
Copy link
Contributor Author

+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170711153400659-09036-3963~crawler04.bl.uk~8446.warc.gz ./input-UHZQH5fYP7.bl.uk~8446.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-UHZQH5fYP7.bl.uk~8446.warc.gz hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170711153400659-09036-3963~crawler04.bl.uk~8446.warc.gz
Traceback (most recent call last):
  File "mr_cdx_pywb_job.py", line 298, in <module>
    MRCDXIndexer.run()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/mrjob.zip/mrjob/job.py", line 616, in run
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/mrjob.zip/mrjob/job.py", line 675, in execute
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/mrjob.zip/mrjob/job.py", line 760, in run_mapper
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/mrjob.zip/mrjob/job.py", line 826, in map_pairs
  File "mr_cdx_pywb_job.py", line 137, in mapper_raw
    cdx11.process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 214, in process_all
    super().process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 250, in process_one
    self.process_index_entry(it, record, filename, output)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 47, in process_index_entry
    value = self.get_field(record, field, it, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 324, in get_field
    value = value.split(":")[-1]
AttributeError: 'NoneType' object has no attribute 'split'

@anjackson
Copy link
Contributor Author

+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170707071909562-07187-3719~crawler04.bl.uk~8443.warc.gz ./input-z1flw93ikq.bl.uk~8443.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-z1flw93ikq.bl.uk~8443.warc.gz hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170707071909562-07187-3719~crawler04.bl.uk~8443.warc.gz
Traceback (most recent call last):
  File "mr_cdx_pywb_job.py", line 298, in <module>
    MRCDXIndexer.run()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/mrjob.zip/mrjob/job.py", line 616, in run
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/mrjob.zip/mrjob/job.py", line 675, in execute
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/mrjob.zip/mrjob/job.py", line 760, in run_mapper
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/mrjob.zip/mrjob/job.py", line 826, in map_pairs
  File "mr_cdx_pywb_job.py", line 137, in mapper_raw
    cdx11.process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 214, in process_all
    super().process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 248, in process_one
    for record in wrap_it:
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/bufferiter.py", line 17, in buffering_record_iter
    for record in record_iter:
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/archiveiterator.py", line 262, in _next_record
    self.check_digests)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/recordloader.py", line 143, in parse_record_stream
    http_headers = self.load_http_headers(rec_type, uri, stream, length)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/recordloader.py", line 184, in load_http_headers
    if not uri.startswith(self.HTTP_SCHEMES):
AttributeError: 'NoneType' object has no attribute 'startswith'

while reading input from hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170707071909562-07187-3719~crawler04.bl.uk~8443.warc.gz

@anjackson
Copy link
Contributor Author

+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170624223904791-07771-3963~crawler04.bl.uk~8446.warc.gz ./input-gZM8tQCDog.bl.uk~8446.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-gZM8tQCDog.bl.uk~8446.warc.gz hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170624223904791-07771-3963~crawler04.bl.uk~8446.warc.gz
Traceback (most recent call last):
  File "mr_cdx_pywb_job.py", line 298, in <module>
    MRCDXIndexer.run()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/mrjob.zip/mrjob/job.py", line 616, in run
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/mrjob.zip/mrjob/job.py", line 675, in execute
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/mrjob.zip/mrjob/job.py", line 760, in run_mapper
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/mrjob.zip/mrjob/job.py", line 826, in map_pairs
  File "mr_cdx_pywb_job.py", line 137, in mapper_raw
    cdx11.process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 214, in process_all
    super().process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 250, in process_one
    self.process_index_entry(it, record, filename, output)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 47, in process_index_entry
    value = self.get_field(record, field, it, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 324, in get_field
    value = value.split(":")[-1]
AttributeError: 'NoneType' object has no attribute 'split'

while reading input from hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170624223904791-07771-3963~crawler04.bl.uk~8446.warc.gz

@anjackson
Copy link
Contributor Author

+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170623074558463-07003-3963~crawler04.bl.uk~8446.warc.gz ./input-FnExuTNMDk.bl.uk~8446.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-FnExuTNMDk.bl.uk~8446.warc.gz hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170623074558463-07003-3963~crawler04.bl.uk~8446.warc.gz
Traceback (most recent call last):
  File "mr_cdx_pywb_job.py", line 298, in <module>
    MRCDXIndexer.run()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/mrjob.zip/mrjob/job.py", line 616, in run
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/mrjob.zip/mrjob/job.py", line 675, in execute
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/mrjob.zip/mrjob/job.py", line 760, in run_mapper
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/mrjob.zip/mrjob/job.py", line 826, in map_pairs
  File "mr_cdx_pywb_job.py", line 137, in mapper_raw
    cdx11.process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 214, in process_all
    super().process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 250, in process_one
    self.process_index_entry(it, record, filename, output)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 47, in process_index_entry
    value = self.get_field(record, field, it, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 324, in get_field
    value = value.split(":")[-1]
AttributeError: 'NoneType' object has no attribute 'split'

while reading input from hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170623074558463-07003-3963~crawler04.bl.uk~8446.warc.gz
```

@anjackson
Copy link
Contributor Author

Okay, this is still too many errors to handle manually. I'm creating ukwa/ukwa-manage:2.3.5 which catches the indexing exception and records it in TrackDB as a field called warc_cdx_indexing_exception_s (which is better than the current manually-managed process for recording skipped WARCs). See f705e1c

@anjackson
Copy link
Contributor Author

As Alex pointed out on the IIPC Slack, this actually looks like a problem with the web host company rather than Heritrix, thankfully. e.g. this fragment is a log of the crawl activity, appearing after the content:

</rss>
19/Sep/2017:03:35:41 +0000|v1|194.66.232.93|www.estiethirionphotography.co.za|200|1985|162.13.104.162:80|5.773|5.773|GET /2011/10/fransua-anne-louise-wedding/feed/ HTTP/1.0||

@anjackson
Copy link
Contributor Author

Spotted a small error in 2.3.5 so creating 2.3.6.

@anjackson
Copy link
Contributor Author

System skips errors now, but we still need to improve the CDX indexer and re-process the marked WARCs at some point. e.g. this query can be used to find the difficult cases:

http://solr8.api.wa.bl.uk/solr/tracking/select?facet.field=cdx_index_ss&facet.field=warc_cdx_indexing_exception_s&facet.field=warc_malformed_status_code_record_count_l&facet=on&q=kind_s%3Awarcs%20AND%20-collection_s%3Aselective&rows=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant