Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7.7.0 bug with _search and idle shards #57006

Closed
awick opened this issue May 20, 2020 · 10 comments
Closed

7.7.0 bug with _search and idle shards #57006

awick opened this issue May 20, 2020 · 10 comments
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@awick
Copy link

awick commented May 20, 2020

Elasticsearch version (bin/elasticsearch --version): Version: 7.7.0, Build: default/tar/81a1e9eda8e6183f5237786246f6dced26a10eaf/2020-05-12T02:01:37.602180Z, JVM: 14

Plugins installed: [] None

JVM version (java -version): Included

OS version (uname -a if on a Unix-like system): Linux moloches01 3.10.0-1062.12.1.el7.YAHOO.20200205.52.x86_64 #1 SMP Wed Feb 5 22:45:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

7.7.0 seems to no longer wait for idle shards to be refreshed before responding to a _search query. 7.6.2 and previous versions did not have this issue.

  • So I have a index (dstats_v4) that I constantly overwrite documents in (1440 documents per moloch node, written every 5 seconds, wrapping around so always the same doc ids are used)
  • It does NOT have a refresh_interval set, so it should be taking advantage of the idle shard feature (https://www.elastic.co/guide/en/elasticsearch/reference/7.7/index-modules.html#dynamic-index-settings)
  • If we wait to search this index for a long period of time and then do a search, only some of the documents we know are there are returned. so if we a _search?q=molochnode:test we should ALWAYS get 1440 documents, but we randomly get around 600-800, it almost looks like every other document is returned so it is NOT just the documents written during the period of not searching. (I wondering if there is some idle + # of shards issue here, since we are using 2 shards)
  • If we repeat the search several times eventually it will return all 1440 documents and then will keep working
  • As a work around Mark had as set refresh_interval to 1s and that always works, even after wait a long period of time

Here is our _settings

      "dstats_v4" : {
        "settings" : {
          "index" : {
            "number_of_shards" : "2",
            "auto_expand_replicas" : "0-3",
            "provided_name" : "dstats_v4",
            "creation_date" : "1565358628829",
            "priority" : "50",
            "number_of_replicas" : "3",
            "uuid" : "muB0pZV4SYywXmxBcJEEGA",
            "version" : {
              "created" : "6080299",
              "upgraded" : "7040299"
            }
          }
        }
      }
    }```


**Steps to reproduce**:

Haven't been able to reproduce with standalone script yet


@awick awick added >bug needs:triage Requires assignment of a team area label labels May 20, 2020
@dnhatn
Copy link
Member

dnhatn commented May 20, 2020

@awick Thank you for reporting the issue. Can you share the search query and the mapping of index.

@dnhatn dnhatn added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels May 20, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label May 20, 2020
@dnhatn dnhatn added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed Team:Search Meta label for search team labels May 20, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Engine)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 20, 2020
@awick
Copy link
Author

awick commented May 21, 2020

_mapping

{"dstats_v4":{"mappings":{"dynamic":"true","dynamic_templates":[{"numeric":{"match_mapping_type":"long","mapping":{"index":false,"type":"long"}}},{"noindex":{"match":"*","mapping":{"index":false}}}],"properties":{"closeQueue":{"type":"long","index":false},"cpu":{"type":"long","index":false},"currentTime":{"type":"date","format":"epoch_second"},"deltaBytes":{"type":"long","index":false},"deltaDropped":{"type":"long","index":false},"deltaESDropped":{"type":"long","index":false},"deltaFragsDropped":{"type":"long","index":false},"deltaMS":{"type":"long","index":false},"deltaOverloadDropped":{"type":"long","index":false},"deltaPackets":{"type":"long","index":false},"deltaSessionBytes":{"type":"long","index":false},"deltaSessions":{"type":"long","index":false},"deltaUnwrittenBytes":{"type":"long","index":false},"deltaWrittenBytes":{"type":"long","index":false},"diskQueue":{"type":"long","index":false},"esHealthMS":{"type":"long","index":false},"esQueue":{"type":"long","index":false},"espSessions":{"type":"long","index":false},"frags":{"type":"long","index":false},"fragsQueue":{"type":"long","index":false},"freeSpaceM":{"type":"long","index":false},"freeSpaceP":{"type":"float","index":false},"hostname":{"type":"text","index":false},"icmpSessions":{"type":"long","index":false},"interval":{"type":"short"},"memory":{"type":"long","index":false},"memoryP":{"type":"float","index":false},"monitoring":{"type":"long","index":false},"needSave":{"type":"long","index":false},"nodeName":{"type":"keyword"},"otherSessions":{"type":"long","index":false},"packetQueue":{"type":"long","index":false},"sctpSessions":{"type":"long","index":false},"tcpSessions":{"type":"long","index":false},"totalDropped":{"type":"long","index":false},"totalK":{"type":"long","index":false},"totalPackets":{"type":"long","index":false},"totalSessions":{"type":"long","index":false},"udpSessions":{"type":"long","index":false},"usedSpaceM":{"type":"long","index":false},"ver":{"type":"text","index":false}}}}}

sample options as sent to old javascript api, notice we do use an alias of dstats -> dstats_v4

{
 "index": "dstats",
 "body": {
  "query": {
   "bool": {
    "filter": [
     {
      "range": {
       "currentTime": {
        "from": "1590050255",
        "to": "1590057455"
       }
      }
     },
     {
      "term": {
       "interval": "5"
      }
     },
     {
      "term": {
       "nodeName": "THENODENAME"
      }
     }
    ]
   }
  },
  "sort": {
   "currentTime": {
    "order": "desc"
   }
  },
  "size": 1440,
  "_source": [
   "deltaPackets",
   "deltaMS",
   "nodeName",
   "currentTime"
  ],
  "profile": false
 },
 "rest_total_hits_as_int": true,
 "filter_path": "_scroll_id,hits.total,hits.hits._source"
}

Sample response, 718 total while it should be ~1440

 "hits": {
  "total": 718,
  "hits": [
   {
    "_source": {
     "nodeName": "THENODENAME",
     "currentTime": 1590057454,
     "deltaPackets": 1336481,
     "deltaMS": 4636
    }
   },
   {
    "_source": {
     "nodeName": "THENODENAME",
     "currentTime": 1590057449,
     "deltaPackets": 1434213,
     "deltaMS": 5099
    }
   },
...

Here is a picture of how we noticed the problem. I hadn't viewed the page for at least 6 hours. Each line is a dstats query. The first 6 are messed up and the rest aren't. My assumption is only the first 6 are messed up because the browser does 6 at once and then waits before doing the next 6, by which time ES has "background refreshed". A reload of the page instantly fixes the first 6.

image

I did a bunch of _cat stuff before/after

/_cat/indices/dstats_v4?v - the deleted docs get cleaned up

health status index     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   dstats_v4 muB0pZV4SYywXmxBcJEEGA   2   3      73440        11369    242.5mb         38.1mb (before)
green  open   dstats_v4 muB0pZV4SYywXmxBcJEEGA   2   3      73440            4    269.4mb         67.1mb (after)

/_cat/shards/dstats_v4?v

index     shard prirep state    docs  store
dstats_v4 0     p      STARTED 36818   18mb
dstats_v4 0     r      STARTED 36818 33.3mb
dstats_v4 1     r      STARTED 36622 33.8mb
dstats_v4 0     r      STARTED 36818 34.9mb
dstats_v4 1     r      STARTED 36622 33.2mb
dstats_v4 1     p      STARTED 36622 20.1mb
dstats_v4 1     r      STARTED 36622 34.7mb
dstats_v4 0     r      STARTED 36818 34.1mb

index     shard prirep state    docs  store
dstats_v4 0     p      STARTED 36818 33.6mb
dstats_v4 0     r      STARTED 36818 33.8mb
dstats_v4 1     r      STARTED 36622 33.5mb
dstats_v4 0     r      STARTED 36818 33.8mb
dstats_v4 1     r      STARTED 36622 33.6mb
dstats_v4 1     p      STARTED 36622 33.5mb
dstats_v4 1     r      STARTED 36622 33.6mb
dstats_v4 0     r      STARTED 36818 33.7mb

/_cat/segments/dstats_v4?v - here is before/after of just 1 of the primary shards, let me know if you want more

index     shard prirep ip            segment generation docs.count docs.deleted   size size.memory committed searchable version compound
dstats_v4 0     p               .153 _8xkm       416758      31112         5706  15.6mb        3788 true      true       8.5.1   false
dstats_v4 0     p               .153 _8xkn       416759          1            0  10.3kb           0 true      false      8.5.1   true
dstats_v4 0     p               .153 _8xl7       416779       5692            0   2.2mb        3612 false     true       8.5.1   false
dstats_v4 0     p               .153 _8xl8       416780          4            0  12.3kb        2908 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xl9       416781          1            0  10.3kb        1516 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xla       416782          3            0  11.7kb        2668 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xlb       416783          1            0  10.3kb        1516 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xlc       416784          1            0  10.3kb        1516 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xld       416785          2            0  10.8kb        1516 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xle       416786          2            0  10.8kb        1516 false     true       8.5.1   true


index     shard prirep ip            segment generation docs.count docs.deleted   size size.memory committed searchable version compound
dstats_v4 0     p               .153 _8xkm       416758      31112         5706 15.6mb        3788 false     true       8.5.1   false
dstats_v4 0     p               .153 _8xl7       416779       5692            0  2.2mb        3612 false     true       8.5.1   false
dstats_v4 0     p               .153 _8xl8       416780          4            0 12.3kb        2908 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xl9       416781          1            0 10.3kb        1516 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xla       416782          3            0 11.7kb        2668 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xlb       416783          1            0 10.3kb        1516 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xlc       416784          1            0 10.3kb        1516 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xld       416785          2            0 10.8kb        1516 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xle       416786          2            0 10.8kb        1516 false     true       8.5.1   true
dstats_v4 0     p               .153 _8xlj       416791      36818            0 15.6mb        3788 true      false      8.5.1   false

@awick
Copy link
Author

awick commented May 25, 2020

To make sure this wasn't an issue with having start life as a 6.8 cluster, I tried with another cluster I'm running that started life as a 7.6.2, and having the same issues. This new cluster is also fairly lightly loaded so I don't think it is a load issue either.

{
  "dstats_v4" : {
    "settings" : {
      "index" : {
        "number_of_shards" : "2",
        "auto_expand_replicas" : "0-3",
        "provided_name" : "dstats_v4",
        "creation_date" : "1587125029032",
        "priority" : "50",
        "number_of_replicas" : "3",
        "uuid" : "_dayuNqkTs22yTyuPiWNBg",
        "version" : {
          "created" : "7060299",
          "upgraded" : "7070099"
        }
      }
    }
  }
}

@dnhatn
Copy link
Member

dnhatn commented May 25, 2020

@awick Is it also an issue on 7.6.2?

@awick
Copy link
Author

awick commented May 25, 2020

No, there is only a problem with 7.7.0, 7.6.2 works fine. My comment above was only showing that an index that starts out as 7.6.2 -> 7.7.0 still has the same issue. My original bug report was for an index that started as 6.8.2 and gone thru multiple versions before reaching 7.7.0.

@dnhatn
Copy link
Member

dnhatn commented May 25, 2020

Thanks @awick. That's helpful. I will take a look. I think this issue was introduced in #49601 (/cc @jimczi).

@dnhatn dnhatn added Team:Search Meta label for search team and removed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels May 25, 2020
@jimczi
Copy link
Contributor

jimczi commented May 26, 2020

The issue was introduced with #53873, which sets the pre_filter_shard_size to 1 if the primary sort is on a field. This should be fixed by #55428 but you can explicitly set pre_filter_shard_size to a large value in the meantime as a workaround before the fix is released.

@jimczi
Copy link
Contributor

jimczi commented May 26, 2020

We've decided to backport the fix for this bug in 7.7.1 so we can close the issue now. Thanks for reporting @awick.

@jimczi jimczi closed this as completed May 26, 2020
mmguero added a commit to cisagov/Malcolm that referenced this issue Jun 25, 2020
* added Sniffpass and HTTP-Attack plugins for zeek

* documentation updates

* clean up stuff from web generation we don't want

* removed fixed timezone from dashboards (?) and updated notices

* rudimentary detection of telnet protocol

* added telnet to list of insecure protocols

* bump version to 2.0.1 for development

* include basic telnet detection in sensor iso

* more work on the telnet detection feature

* more work on the telnet detection feature

* ensure local zeek policy gets set correctly

* issue #120, detect telnet, rsh, and rlogin traffic with zeek

Squashed commit of the following:

commit fb5c313
Author: SG <[email protected]>
Date:   Tue Apr 14 10:17:44 2020 -0600

    ensure local zeek policy gets set correctly

commit 7ca7add
Author: SG <[email protected]>
Date:   Tue Apr 14 10:01:11 2020 -0600

    more work on the telnet detection feature

commit d921cf9
Author: SG <[email protected]>
Date:   Mon Apr 13 16:29:45 2020 -0600

    more work on the telnet detection feature

commit b643c44
Author: SG <[email protected]>
Date:   Mon Apr 13 08:03:45 2020 -0600

    include basic telnet detection in sensor iso

commit 5952a30
Author: SG <[email protected]>
Date:   Mon Apr 13 07:14:19 2020 -0600

    bump version to 2.0.1 for development

commit ea06a8a
Author: SG <[email protected]>
Date:   Mon Apr 13 07:13:20 2020 -0600

    added telnet to list of insecure protocols

commit 3774c69
Author: SG <[email protected]>
Date:   Mon Apr 13 07:08:28 2020 -0600

    rudimentary detection of telnet protocol

commit 99a9710
Merge: e95d736 18b98db
Author: SG <[email protected]>
Date:   Thu Apr 9 14:07:52 2020 -0600

    Merge remote-tracking branch 'upstream/development' into development

* according to semantic versioning, this version will be 2.1.0 since it introduces new backwards-compatible features

* update zeek to 3.0.4 to address a security vulnerability

* update documentation

* bump version to 2.0.1 for patch release for zeek 3.0.4 (see issue #123)

* meh, might as well be 3.0.5 with the compilation fix for older compilers

* meh, might as well be 3.0.5 with the compilation fix for older compilers

* added telnet/rsh/rlogin dashboard for idaholab#120

* update sha256 sums

* fix idaholab#122 by installing bro-xor-exe-plugin correctly

see also:

- corelight/zeek-xor-exe-plugin#2
- zeek/zeek#916

* added a build-time sanity check for the docker image to make sure all of the third-party plugins install and load correctly

* update version for docs

* include network visualization for possible use in future dashboards

* dockerfile cleanpu

* use Dockerfile ADD instead of 'git clone' to get certain repositories

* categorize xor-decrypted files by saving the original FUID in parent_fuid and normalizing the source

* make sure both original and decrypted FUID show up in notice log for pe_xor decrypted files

* fix recognition of names of file extracted by mitre-attack/bzar when scanned and triggering signatures

there are other extracted files that come from the mitre-attack/bzar scripts, they are formatted like this:

local fname = fmt("%s_%s%s", c$uid, f$id, subst_string(smb_name, "\\", "_"));

CR7X4q2hmcXKqP0vVj_F3jZ2VjYttqhKaGfh__172.16.1.8_C$_WINDOWS_sny4u_un1zbd94ytwj99hcymmsad7j54gr4wdskwnqs0ki252jdsrf763zsm531b.exe
└----------------┘ └---------------┘└------------------------------------------------------------------------------------------┘
        UID              FID          subst_string(smb_name, "\\", "_"))

(see https://github.com/mitre-attack/bzar/blob/master/scripts/bzar_files.bro#L50)

* make sure SNMP Registers actions (GetResponse, GetRequest, SetRequest, GetBulkRequest)

* added missing file for kibana plugin patch

* for idaholab#127, create a field mapping template for elasticsearch

* disabled by default, but starting to work on idaholab#79 mapping fields to ECS fields

* bump netsniff version to 0.6.7

* fix issue with defaults not being set right for ldap

* bump zeek version

* documentation updates

* documentation updates

* documentation updates

* documentation updates, and save hedgehog build artifacts

* documentation fixes

* documentation fixes

* documentation fixes

* bump moloch version to 2.3.0

* updated elasticsearch version, working on ecs fields

* more work on ecs normalization

* more work on ecs normalization

* Revert "updated elasticsearch version" due to discovery of elastic/elasticsearch#57006; should be fixed in 7.7.1

This partially reverts commit 4beaa09.

* update download shas

* update download shas

* added sankey visualization

* testing on my own fork

* sankey visualization fixes

* sankey visualization fixes

* added drilldown plugin for experimentation

* use fork of drilldown plugin

* specify nginx rewrite rule for idkib2mol to allow kibana -> moloch drilldowns

* for idaholab#133, specify drill-down mapping for zeek fields for kibana -> moloch drill-down

* for idaholab#133, handle strings correctly with quotes for moloch expression

* for idaholab#133, even though moloch fields won't map correctly (for now), still create URL drilldown mappings

* moloch test harness

* take ECS stuff out of development branch (will work on it in topic/ecs)

* take drilldown stuff out of development branch (will work on it in topic/drilldown)

* test harness

* use db: prefix for moloch (see arkime/arkime#1461) for constructing kibana -> moloch drilldown URLs

* kibana network visualization having issues with 7.7.1, disabling for now

* bump elasticsearch version to 7.7.1 and moloch version to 2.3.1

* added -w option to allow elasticsearch to be populated with logs before starting curator, elastalert

* Several of my kibana plugins are not working correctly in Kibana 7.7.x, so I am going to switch back to 7.6.x until I can work through those issues

* fix something borked by copy/paste

* fix install of drilldown plugin for 7.6.2

* have Kibana set up drilldown url mappings on each startup

* added some more drilldown links for kibana

* match drilldowns in moloch and kibana

* more working on drilldowns for common fields

* more working on drilldowns for common fields

* fix drilldowns from moloch side

* fix drilldowns on kibana side

* reduce verbosity of message

* fix drilldowns on kibana side

* update comments

* added plugin for zeek to detect cve_2020_0601

* update zeek to 3.0.7 (https://github.com/zeek/zeek/releases/tag/v3.0.7)

* added more actions (smtp, ssh, socks, ssl, rfb, etc.)

* more working on result normalization

* various fixes for results

* more working on result normalization

* more working on result normalization

* fix connection state map

* updated various dashboards to include result

* updated various dashboards to include result

* fix freq lookups by url encoding query parameters

* sort dns randomness charts correctly

* fix DNP3 IIN flags and ftp dashboard

* fix way with more recent vagrant/virtualbox for checking output from vagrant run

* fix issue applying iin_flags to action if they weren't specified

* fix issue applying iin_flags to action if they weren't specified

* more tweaks to dnp3 action/result

* dashboard tweaks

* dashboard fixes/cleanup

* fixes to HTTP And SNMP dashboards

* ended up with some bad JSON in a dashboard somehow :/

* fix issue with split pie charts in kibana

* fix issue with split pie charts in kibana
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

4 participants