Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve the accuracy of model memory control #122

Merged
merged 7 commits into from
Jun 13, 2018

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jun 8, 2018

This makes a number of changes targeting our current memory control functionality. Specifically, these are:

  1. We store the data gatherer object by shared pointer with one reference held by a CAnomalyDetectorModel object and one held by a CAnomalyDetector object. However, the memory was only accounted for by CAnomalyDetectorModel. Since the reference count is two we were effectively halving its accounted memory. I've changed the CResourceMonitor to work in terms of CAnomalyDetector objects and now account for both references to the data gatherer. This incidentally also means we account for the static size of CAnomalyDetector which was also be lost by the resource monitor. The impact can be large, especially for population models, for example in CAnomalyJobLimitTest::testModelledEntityCountForFixedMemoryLimit we model 45% fewer over field values as a result.
  2. We were unnecessarily duplicating state in CDataGatherer. For example, we have access to the partition field name via the search key so don't need a copy in this class as well. (Note that this also reduces the number of parameters to constructor, which had quite a lot of fallout on the unit tests.)
  3. The initial memory assumed per by field was out-of-date and I've adjusted accordingly.
  4. The model memory is not static. In particular, it changes due to cyclic components added as they are detected and also as the compressibility of parts of the state changes. This means we typically underestimate model memory near the start of the job and so create too many models early on and eventually exceed the memory limit (if the job won't ultimately fit in memory without limiting). This effect can be large: I observed up to a 30% overshoot above the memory limit in the worst case that all field values are present at the start of the data set. I now use a time decaying margin on the memory limit at startup so we only create additional models gradually early on as we approach the hard limit.

As a result we now have accurate control of the true memory (I measured consistently within 5% in a unit test with a variety of realistic data characteristics).

Note I factored out the test changes, which are mainly fallout from the CDataGatherer constructor signature change and some tidy ups, from the functional code changes.

This affects results for memory limited jobs only.

a function of elapsed time not buckets and assert on memory used vs
target in limit test.
@tveasey tveasey force-pushed the bug/memory-accounting branch from 764246f to f927102 Compare June 8, 2018 14:43
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a few minor comments.

~CDataGatherer();
CDataGatherer(const CDataGatherer&) = delete;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a cool language feature!

result += core::CMemory::dynamicSize(window);
// The 0.3 is a rule-of-thumb estimate of the worst case
// compression ratio we achieve on the test state.
result += 0.3 * core::CMemory::dynamicSize(window);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we only compress on persist?

Copy link
Contributor Author

@tveasey tveasey Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, no. As of #100, we compress the raw bytes of some this object we actually hold in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I recall you saying we'd do that but I missed the fact it's already in. Cool.

@@ -95,7 +95,7 @@ CAnomalyDetectorModel::CAnomalyDetectorModel(bool isForPersistence,
: // The copy of m_DataGatherer is a shallow copy. This would be unacceptable
// if we were going to persist the data gatherer from within this class.
// We don't, so that's OK, but the next issue is that another thread will be
// modifying the data gatherer m_DataGatherer points to whilst this object
// modifying the data gatherer m_DataGatherer points too whilst this object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a typo

Copy link
Contributor Author

@tveasey tveasey Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a typo before, the to[o] in this context means as well which is too rather than to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still kind of a weird sentence but fair enough!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a sec, I'm sorry you're actually completely right. I'd somehow (repeatedly) misread!

@@ -478,9 +478,7 @@ void CEventRateModel::debugMemoryUsage(core::CMemoryUsage::TMemoryUsagePtr mem)
}

std::size_t CEventRateModel::memoryUsage() const {
return this->CIndividualModel::memoryUsage() +
core::CMemory::dynamicSize(m_InterimBucketCorrector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the memory for the interim bucket corrected accounted for now?

Copy link
Contributor Author

@tveasey tveasey Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is getting accounted for in this->CEventRateModel::computeMemoryUsage(). This is all tied up with the model memory estimation process we use, i.e. measuring the computed memory usage periodically then using a regression on those measurements. The extra memory used is effectively accounted for in that regressions' parameters.

// will be the overwhelmingly common source of additional memory
// so the model memory should be accurate (on average) in this
// time frame.
double scale{1.0 - static_cast<double>(elapsedTime) /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the scale is fixed for a given bucket span. Should we consider setting it on construction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do given current usage. Although in the current usage this is called at the bucket frequency this doesn't feel like it has to be the case and we'd lose the ability to do this if that changed. Also, I think if we move this away from the API to adjust the margin it hides an important part of the implementation from the caller. If I changed how this function is called I think it would be easy to overlook that I need to change the value computed in the constructor. I think on balance I prefer to keep as is for that reason. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This gives us flexibility to change the way it works if needed in the future.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tveasey added a commit to elastic/elasticsearch that referenced this pull request Jun 13, 2018
…31289)

To avoid temporary failures, this also disables these tests until elastic/ml-cpp#122 is committed.
@tveasey tveasey merged commit fae7f38 into elastic:master Jun 13, 2018
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jun 14, 2018
tveasey added a commit to elastic/elasticsearch that referenced this pull request Jun 14, 2018
…31289)

To avoid temporary failures, this also disables these tests until elastic/ml-cpp#122 is committed.
@tveasey tveasey deleted the bug/memory-accounting branch March 22, 2019 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants