You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the solution you'd like
Supporting multiple categorical fields requires changes in, among other things, feature aggregations and model naming. We will switch to composite aggregations that natively support collecting entities from multiple fields in the same document. Composite aggregations enable us to aggregate features over multiple categorical fields with minimal programming effort.
Another change is to not use categorical values directly in the model document Id. The problem is that HCAD v1 uses the categorical value as part of the model document Id, but ES’s document Id can be at most 512 bytes. Categorical values are usually less than 256 characters, but can grow to 32766 (https://stackoverflow.com/questions/47177163/what-is-the-maximum-length-for-keyword-type-in-elasticsearch) in theory. HCAD v1 skips an entity if the entity's name is more than 256 characters. We cannot do that in v2 as that can reject a lot of entities. To overcome the obstacle, we hash categorical values to a 128-bit string (like SHA-1 that git uses) and use the hash as part of the model document Id.
Describe alternatives you've considered
We have choices to make regarding when to use the hash as part of a model document Id: for all HC detectors or a HC detector with multiple categorical fields. The challenge lies in providing backward compatibility of looking for a model checkpoint in the case of a HC detector with one categorical field. If using hashes for all HC detectors, we need two get requests to ensure that a model checkpoint exists. One uses the document Id without a hash, while one uses the document Id with a hash. The dual get requests are ineffective. If limiting hashes to a HC detector with multiple categorical fields, there is no backward compatibility issue. However, the code will be branchy. One may wonder if backward compatibility can be ignored; indeed, the old checkpoints will be gone after a transition period during upgrading. During the transition period, HC detectors can experience unnecessary cold starts as if the detectors were just started. Checkpoint index size can double if every model has two model documents. The transition period can be as long as 3 days since our checkpoint retention period is 3 days. There is no perfect solution. We prefer limiting hashes to an HC detector with multiple categorical fields as its customer impact is none.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
opendistro-for-elasticsearch/anomaly-detection#404
Describe the solution you'd like
Supporting multiple categorical fields requires changes in, among other things, feature aggregations and model naming. We will switch to composite aggregations that natively support collecting entities from multiple fields in the same document. Composite aggregations enable us to aggregate features over multiple categorical fields with minimal programming effort.
Another change is to not use categorical values directly in the model document Id. The problem is that HCAD v1 uses the categorical value as part of the model document Id, but ES’s document Id can be at most 512 bytes. Categorical values are usually less than 256 characters, but can grow to 32766 (https://stackoverflow.com/questions/47177163/what-is-the-maximum-length-for-keyword-type-in-elasticsearch) in theory. HCAD v1 skips an entity if the entity's name is more than 256 characters. We cannot do that in v2 as that can reject a lot of entities. To overcome the obstacle, we hash categorical values to a 128-bit string (like SHA-1 that git uses) and use the hash as part of the model document Id.
Describe alternatives you've considered
We have choices to make regarding when to use the hash as part of a model document Id: for all HC detectors or a HC detector with multiple categorical fields. The challenge lies in providing backward compatibility of looking for a model checkpoint in the case of a HC detector with one categorical field. If using hashes for all HC detectors, we need two get requests to ensure that a model checkpoint exists. One uses the document Id without a hash, while one uses the document Id with a hash. The dual get requests are ineffective. If limiting hashes to a HC detector with multiple categorical fields, there is no backward compatibility issue. However, the code will be branchy. One may wonder if backward compatibility can be ignored; indeed, the old checkpoints will be gone after a transition period during upgrading. During the transition period, HC detectors can experience unnecessary cold starts as if the detectors were just started. Checkpoint index size can double if every model has two model documents. The transition period can be as long as 3 days since our checkpoint retention period is 3 days. There is no perfect solution. We prefer limiting hashes to an HC detector with multiple categorical fields as its customer impact is none.
The text was updated successfully, but these errors were encountered: