BadRequestError: failed to parse field [indexed_document_volume] of type [integer] #735

prashant-elastic · 2023-04-05T10:07:44Z

Bug Description

BadRequestError: 400, 'mapper_parsing_exception' failed to parse field [indexed_document_volume] of type [integer]

To Reproduce

Steps to reproduce the behavior:

Create an index in elasticsearch
Do the necessary changes in config.yml file
Exexcute make run command to execute the connector
Move over to the configurations tab in kibana UI and make sharepoint connector related configurations
Observe sync in progress and wait for it to be completed

Expected behavior

All sharepoint documents should be successfully indexed in Elasticsearch

Actual behavior

BadRequestError: 400, 'mapper_parsing_exception' failed to parse field [indexed_document_volume] of type [integer]

Screenshots

Environment

OS: Ubuntu OS v.22.04

Additional context

[FMWK][12:16:55][INFO] Fetcher <create: 49099 |update: 0 |delete: 0>
Exception in callback ConcurrentTasks._callback(result_callback=None)(<Task finishe...tatus': 400})>)
handle: <Handle ConcurrentTasks._callback(result_callback=None)(<Task finishe...tatus': 400})>)>
Traceback (most recent call last):
File "/home/ubuntu/es-connectors/connectors/sync_job_runner.py", line 131, in execute
await self._sync_done(sync_status=sync_status, sync_error=fetch_error)
File "/home/ubuntu/es-connectors/connectors/sync_job_runner.py", line 170, in _sync_done
await self.sync_job.done(ingestion_stats=ingestion_stats)
File "/home/ubuntu/es-connectors/connectors/byoc.py", line 237, in done
await self._terminate(
File "/home/ubuntu/es-connectors/connectors/byoc.py", line 275, in _terminate
await self.index.update(doc_id=self.id, doc=doc)
File "/home/ubuntu/es-connectors/connectors/es/index.py", line 72, in update
await self.client.update(
File "/home/ubuntu/es-connectors/lib/python3.10/site-packages/elasticsearch/_async/client/init.py", line 4513, in update
return await self.perform_request( # type: ignore[return-value]
File "/home/ubuntu/es-connectors/lib/python3.10/site-packages/elasticsearch/_async/client/_base.py", line 321, in perform_request
raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', "failed to parse field [indexed_document_volume] of type [integer] in document with id 'XmfoS4cBVSm7nRdw6PPq'. Preview of field's value: '13167265024'")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/ubuntu/es-connectors/connectors/utils.py", line 318, in _callback
raise task.exception()
File "/home/ubuntu/es-connectors/connectors/sync_job_runner.py", line 135, in execute
await self._sync_done(sync_status=JobStatus.ERROR, sync_error=e)
File "/home/ubuntu/es-connectors/connectors/sync_job_runner.py", line 164, in _sync_done
await self.sync_job.fail(sync_error, ingestion_stats=ingestion_stats)
File "/home/ubuntu/es-connectors/connectors/byoc.py", line 242, in fail
await self._terminate(
File "/home/ubuntu/es-connectors/connectors/byoc.py", line 275, in _terminate
await self.index.update(doc_id=self.id, doc=doc)
File "/home/ubuntu/es-connectors/connectors/es/index.py", line 72, in update
await self.client.update(
File "/home/ubuntu/es-connectors/lib/python3.10/site-packages/elasticsearch/_async/client/init.py", line 4513, in update
return await self.perform_request( # type: ignore[return-value]
File "/home/ubuntu/es-connectors/lib/python3.10/site-packages/elasticsearch/_async/client/_base.py", line 321, in perform_request
raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', "failed to parse field [indexed_document_volume] of type [integer] in document with id 'XmfoS4cBVSm7nRdw6PPq'. Preview of field's value: '13167265024'")

artem-shelkovnikov · 2023-04-05T12:28:33Z

cc @wangch079

wangch079 · 2023-04-06T02:04:15Z

Hi @prashant-elastic , may I know which version/branch you are running?

parth-elastic · 2023-04-06T05:23:45Z

We checked on main branch with elastic v8.7.0 on cloud

parth-elastic · 2023-04-06T05:26:53Z

Also please note that, The issue is occurring when we are working with the larger data [~10 GB & ~49000 objects].
For small and medium data it is working fine.

wangch079 · 2023-04-06T08:09:41Z

Also please note that, The issue is occurring when we are working with the larger data [~10 GB & ~49000 objects].

Can I get a copy of the data set?

akanshi-elastic · 2023-04-06T08:56:45Z

Also please note that, The issue is occurring when we are working with the larger data [~10 GB & ~49000 objects].

Can I get a copy of the data set?

shared you on slack 1:1

khusbu-crest · 2023-04-11T08:52:00Z

@wangch079 Is there any update on this issue?
We are kind of blocked to check the performance of the connectors because of this issue. While Indexing large data sets (~14k), this issue appears and the script execution gets interrupted hence we are unable to complete the performance testing.

ppf2 · 2023-04-14T20:00:06Z

I can reproduce this on 8.7.0 against a large-ish mySQL dataset here.

[FMWK][10:17:48][INFO] Fetcher <create: 1731013 |update: 336386 |delete: 0>
[FMWK][10:17:48][INFO] Fetcher <create: 1731113 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Task 1 - Sending a batch of 1000 ops -- 0.7MiB
[FMWK][10:17:48][INFO] Fetcher <create: 1731213 |update: 336386 |delete: 0>
[FMWK][10:17:48][INFO] Fetcher <create: 1731313 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Bulker stats - no. of docs indexed: 2051841, volume of docs indexed: 3641845800 bytes, no. of docs deleted: 0
[FMWK][10:17:48][INFO] Fetcher <create: 1731413 |update: 336386 |delete: 0>
[FMWK][10:17:48][INFO] Fetcher <create: 1731513 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Polling every 30 seconds
[FMWK][10:17:48][INFO] Fetcher <create: 1731613 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Task 1 - Sending a batch of 1000 ops -- 0.7MiB
[FMWK][10:17:48][DEBUG] Connector UAhYeIcB4DIAFu1tTlw1 natively supported
[FMWK][10:17:48][DEBUG] Sending heartbeat for connector UAhYeIcB4DIAFu1tTlw1
[FMWK][10:17:48][INFO] Fetcher <create: 1731713 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Connector status is Status.ERROR
[FMWK][10:17:48][DEBUG] Filtering of connector UAhYeIcB4DIAFu1tTlw1 is in state valid, skipping...
[FMWK][10:17:48][DEBUG] scheduler is disabled
[FMWK][10:17:48][DEBUG] Scheduling is disabled for connector UAhYeIcB4DIAFu1tTlw1
[FMWK][10:17:48][INFO] Fetcher <create: 1731813 |update: 336386 |delete: 0>
[FMWK][10:17:48][DEBUG] Bulker stats - no. of docs indexed: 2052341, volume of docs indexed: 3642709800 bytes, no. of docs deleted: 0
[FMWK][10:17:48][CRITICAL] Connector job (ID: aJPBgIcBEquqKqFm0VQP) is not running but in status of JobStatus.ERROR.
Traceback (most recent call last):
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 149, in execute
    await self.check_job()
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 287, in check_job
    raise ConnectorJobNotRunningError(self.job_id, self.sync_job.status)
connectors.sync_job_runner.ConnectorJobNotRunningError: Connector job (ID: aJPBgIcBEquqKqFm0VQP) is not running but in status of JobStatus.ERROR.
[FMWK][10:17:48][INFO] Task is canceled, stop Fetcher...
[FMWK][10:17:48][INFO] Fetcher is stopped.
[FMWK][10:17:48][INFO] Task is canceled, stop Bulker...
[FMWK][10:17:48][INFO] Bulker is stopped.
Exception in callback ConcurrentTasks._callback(result_callback=None)(<Task finishe...tatus': 400})>)
handle: <Handle ConcurrentTasks._callback(result_callback=None)(<Task finishe...tatus': 400})>)>
Traceback (most recent call last):
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 149, in execute
    await self.check_job()
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 287, in check_job
    raise ConnectorJobNotRunningError(self.job_id, self.sync_job.status)
connectors.sync_job_runner.ConnectorJobNotRunningError: Connector job (ID: aJPBgIcBEquqKqFm0VQP) is not running but in status of JobStatus.ERROR.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "<path>/connectors-python/connectors/utils.py", line 315, in _callback
    raise task.exception()
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 162, in execute
    await self._sync_done(sync_status=JobStatus.ERROR, sync_error=e)
  File "<path>/connectors-python/connectors/sync_job_runner.py", line 196, in _sync_done
    await self.sync_job.fail(sync_error, ingestion_stats=ingestion_stats)
  File "<path>/connectors-python/connectors/byoc.py", line 240, in fail
    await self._terminate(
  File "<path>/connectors-python/connectors/byoc.py", line 273, in _terminate
    await self.index.update(doc_id=self.id, doc=doc)
  File "<path>/connectors-python/connectors/es/index.py", line 71, in update
    return await self.client.update(
  File "<path>/connectors-python/lib/python3.10/site-packages/elasticsearch/_async/client/__init__.py", line 4586, in update
    return await self.perform_request(  # type: ignore[return-value]
  File "<path>/connectors-python/lib/python3.10/site-packages/elasticsearch/_async/client/_base.py", line 320, in perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', "failed to parse field [indexed_document_volume] of type [integer] in document with id 'aJPBgIcBEquqKqFm0VQP'. Preview of field's value: '3642709800'")

@danajuratoni Would be nice to address this one before GA.

artem-shelkovnikov · 2023-04-17T09:05:20Z

I think we're overflowing the integer field for indexed_document_volume and need a bigger data type to be able to store the bytes. Additionally we can switch to storing KB instead of B.

wangch079 · 2023-04-17T09:14:49Z

As @artem-shelkovnikov pointed out, we use integer for indexed_document_volume, which only supports maximum 2^31-1. indexed_document_volume stores the number of bytes, so the maximum is only around 2 GB+. We can change it to unsigned_long, which supports a maximum 2^64-1, which is around 18 Exa Bytes (1000 PB).

cc. @danajuratoni This will make any connector trying to sync any source with more than 2 GB of data fail in Ruby (since 8.6) and Python (since 8.7.1). do you think we should document it?

8.7.1 is not released yet, but I don't think this can be considered a blocker. We could fix it in 8.8

wangch079 · 2023-04-17T09:18:02Z

For @ppf2 reported issue, this is because the job is not seeing any update for more than 60 seconds (supposed to receive a heartbeat every 10 seconds), and is marked as error. But the job is actually running.

This can happen when the job reporting task got no chance to run for more than 60 seconds, which is rare. I will look into this issue separately.

## Summary ### Part of elastic/connectors#735 The field `indexed_document_volume` (in bytes) in `.elastic-connectors-sync-jobs` is of type `integer`, which can hold a maximum value of `2^31-1`, which is equivalent to 2-ish GB. This PR changes it to `unsigned_long`, which can hold a maximum value of `2^64-1`, which is equivalent to 18-ish Exa Bytes (1 Exa Byte = 1000 PB). ### Checklist Delete any items that are not applicable to this PR. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### For maintainers - [ ] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

seanstory · 2023-04-18T19:59:41Z

Based on this slack thread, I'm reverting the above 3 PRs.

revert for connectors-python: Revert "Use unsigned_long for field indexed_document_volume" #795
revert for ent-search: https://github.com/elastic/ent-search/pull/7430
revert for kibana: Revert "Use unsigned_long for field indexed_document_volume" kibana#155199
Changing the field type of an existing field in our schema would require a redindex operation, which Enterprise Search doesn't currently execute on startup. This mapping change would make any existing deployment fail to be able to upgrade to 8.8.

Instead, in a separate set of PRs, Chenhui and I will change these fields from representing byte counts to representing MB counts. This raises our limit from ~2GB to ~2PB, which seems much less likely to constrain us.

ent search migration: https://github.com/elastic/ent-search/pull/7431
kibana frontend changes:[Enterprise Search] Render indexed_document_volume in MiB kibana#155250
connectors-python changes: Report indexed_document_volume in MiB #796

Reverts #155014 See: elastic/connectors#735 (comment)

wangch079 · 2023-04-19T12:40:04Z

Regarding this issue: #735 (comment), I tested locally but I can't reproduce it. I guess somehow the sync was stuck somewhere for more than 60 seconds, causing the job marked as idle.

## Part of elastic/connectors#735 ## Summary The field type for `indexed_document_volume` is `integer`, which can only represent about 2GB worth of "bytes". To be able to support syncing with larger datasets, `indexed_document_volume` is updated to store the size in `MebiBytes`. This PR makes sure the size is rendered correctly in UI. ### For maintainers - [ ] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

wangch079 · 2023-04-20T17:01:10Z

Close this issue as all the fixes have been merged.

prashant-elastic added the bug Something isn't working label Apr 5, 2023

wangch079 self-assigned this Apr 6, 2023

This was referenced Apr 17, 2023

Use unsigned_long for field indexed_document_volume #781

Merged

Use unsigned_long for field indexed_document_volume elastic/kibana#155014

Merged

danajuratoni added 8.7 labels Apr 18, 2023

This was referenced Apr 18, 2023

Revert "Use unsigned_long for field indexed_document_volume" #795

Merged

Revert "Use unsigned_long for field indexed_document_volume" elastic/kibana#155199

Merged

wangch079 mentioned this issue Apr 19, 2023

Report indexed_document_volume in MiB #796

Merged

6 tasks

sphilipse pushed a commit to elastic/kibana that referenced this issue Apr 19, 2023

Revert "Use unsigned_long for field indexed_document_volume" (#155199)

4b0b8e4

Reverts #155014 See: elastic/connectors#735 (comment)

wangch079 mentioned this issue Apr 19, 2023

[Enterprise Search] Render indexed_document_volume in MiB elastic/kibana#155250

Merged

1 task

danajuratoni mentioned this issue Apr 20, 2023

Connector errors out when task takes too long #799

Closed

wangch079 closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BadRequestError: failed to parse field [indexed_document_volume] of type [integer] #735

BadRequestError: failed to parse field [indexed_document_volume] of type [integer] #735

prashant-elastic commented Apr 5, 2023

artem-shelkovnikov commented Apr 5, 2023

wangch079 commented Apr 6, 2023

parth-elastic commented Apr 6, 2023

parth-elastic commented Apr 6, 2023

wangch079 commented Apr 6, 2023

akanshi-elastic commented Apr 6, 2023

khusbu-crest commented Apr 11, 2023

ppf2 commented Apr 14, 2023 •

edited

Loading

artem-shelkovnikov commented Apr 17, 2023

wangch079 commented Apr 17, 2023

wangch079 commented Apr 17, 2023

seanstory commented Apr 18, 2023 •

edited by wangch079

Loading

wangch079 commented Apr 19, 2023

wangch079 commented Apr 20, 2023

BadRequestError: failed to parse field [indexed_document_volume] of type [integer] #735

BadRequestError: failed to parse field [indexed_document_volume] of type [integer] #735

Comments

prashant-elastic commented Apr 5, 2023

Bug Description

To Reproduce

Expected behavior

Actual behavior

Screenshots

Environment

Additional context

artem-shelkovnikov commented Apr 5, 2023

wangch079 commented Apr 6, 2023

parth-elastic commented Apr 6, 2023

parth-elastic commented Apr 6, 2023

wangch079 commented Apr 6, 2023

akanshi-elastic commented Apr 6, 2023

khusbu-crest commented Apr 11, 2023

ppf2 commented Apr 14, 2023 • edited Loading

artem-shelkovnikov commented Apr 17, 2023

wangch079 commented Apr 17, 2023

wangch079 commented Apr 17, 2023

seanstory commented Apr 18, 2023 • edited by wangch079 Loading

wangch079 commented Apr 19, 2023

wangch079 commented Apr 20, 2023

ppf2 commented Apr 14, 2023 •

edited

Loading

seanstory commented Apr 18, 2023 •

edited by wangch079

Loading