Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDX Index checks failing for URL with many instances #67

Open
anjackson opened this issue Mar 19, 2020 · 0 comments
Open

CDX Index checks failing for URL with many instances #67

anjackson opened this issue Mar 19, 2020 · 0 comments

Comments

@anjackson
Copy link
Contributor

2020-03-19 20:39:21,135 INFO: Getting http://cdx.api.wa.bl.uk/data-heritrix?q=ty
pe%3Aurlquery+url%3Ahttps%253A%252F%252Ftwitter.com%252Fi%252Fjs_inst%253Fc_name
%253Dui_metrics+limit%3A25000+offset%3A1650000
2020-03-19 20:39:32,662 ERROR: [pid 3787] Worker Worker(salt=086575122, workers=
4, host=access, username=root, pid=3665) failed    access.index.CheckCdxIndexFor
WARC(input_file=/heritrix/output/frequent-npld/20200227133858/warcs/BL-NPLD-WEBR
ENDER-frequent-npld-20200227133858-20200311061302718-00540-0o4xyiz2.warc.gz, cdx
_service=http://cdx.api.wa.bl.uk/data-heritrix, sampling_rate=500, max_records_t
o_check=10)
Traceback (most recent call last):
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/urllib3/res
ponse.py", line 397, in _error_catcher
    yield
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/urllib3/res
ponse.py", line 479, in read
    data = self._fp.read(amt)
  File "/usr/local/lib/python3.6/http/client.py", line 449, in read
    n = self.readinto(b)
  File "/usr/local/lib/python3.6/http/client.py", line 493, in readinto
    n = self.fp.readinto(b)
  File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/luigi/worke
r.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/luigi/worke
r.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "/root/github/ukwa-manage-p3/tasks/access/update_cdx_index.py", line 137,
 in run
    for record in reader:
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/warcio/arch
iveiterator.py", line 119, in _iterate_records
    self.read_to_end()
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/warcio/arch
iveiterator.py", line 212, in read_to_end
    b = self.record.raw_stream.read(BUFF_SIZE)
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/warcio/limi
treader.py", line 28, in read
    buff = self.stream.read(length)
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/warcio/buff
eredreaders.py", line 162, in read
    self._fillbuff()
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/warcio/buff
eredreaders.py", line 111, in _fillbuff
    data = self.stream.read(block_size)
  File "/root/github/ukwa-manage-p3/tasks/access/update_cdx_index.py", line 105,
 in read
    chunk = self.stream.read(size)
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/urllib3/res
ponse.py", line 496, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/root/github/ukwa-manage-p3/venv/lib/python3.6/site-packages/urllib3/res
ponse.py", line 415, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104,
 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by pe
er'))
2020-03-19 20:39:32,990 INFO: Informed scheduler that task   access.index.CheckC
dxIndexForWARC_http___cdx_api_w__heritrix_output_10_797ecab879   has status   FA
ILED
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant