AggregateException/InvalidOperationException when training from IEnumerable backed by EF Core Context #2159

endintiers · 2019-01-16T04:00:27Z

System information

.Net Core 2.2, EF Core 2.2.1, ML.NET 0.9

Issue

Read data from SQL DB via EF Core as IEnumerable, pass to pipeline via streaming IDataView
Get InvalidOperationException from Microsoft.ML.Transforms.RowShufflingTransformer.Cursor.LoopProducerWorker()
It worked fine in ML.NET 0.8. I can work-around by forcing the trainer to be single-threaded but this makes training in 0.9 much slower than in 0.8 (on my machine).

The EF Core problem could be addressed by allowing multiple dbcontexts to be provided but this is likely to be an issue with any non thread-safe IDataView source...

Source code / logs

Source at: https://github.com/endintiers/Unearth.Demo.ML.FromDB shows this working in 0.8, failing in 0.9 and a work-around for 0.9 (single-threading the training).

Ivanidzo4ka · 2019-01-16T22:48:24Z

Thank you for reporting this.
We need to investigate why we try to shuffle data even if it's comes from StreamingDataView which shouldn't be shuffle-able.
If you need immediate patch for current problem, i would suggest you to add AppendCacheCheckpoint to the end of your dataProcessPipeline.

var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey("IATACode", "Label")
                    .Append(mlContext.Transforms.Text.FeaturizeText("FlightCode", "FlightCodeFeaturized"))
                    .Append(mlContext.Transforms.Concatenate("Features", "FlightCodeFeaturized")).
                    .AppendCacheCheckpoint();

this way we would read all cache all data in memory, which would speed up training, and also allow us properly shuffle data during training.

endintiers · 2019-01-17T00:22:27Z

OK, I see what AppendCacheCheckpoint does - it caches the whole Enumerable (gradually) so the cache is used for re-reads/shuffles.

The 'Airlines' data is just some public test data I can use for demos. I have more than a billion rows of real data in an Azure SQL DB that I am using in my day job. Even if I'm training on a small subset of this (I'm using 1%) I don't think caching it in memory is a good idea... I'd like to be able to give the trainer a 100K rows at a time or somesuch and dispose of it before moving on to the next tranche.

I have been trying to figure out a fix for myself but there is a lot of code to understand... Could it be that the IDataView needs to determine if CanShuffle is true for this source? But since I used 'Where' the underlying Enumerable is seen as a Linq object? So therefore (for now) ensure that CanShuffle is false for all Streaming IDataViews… Maybe later [insert clever code here] make more nuanced decisions...

endintiers · 2019-01-17T10:35:01Z

The glass is still cloudy, but I have noticed that the LinearTrainerBase.PrepareDataFromTrainingExamples method in SdcaBinary.cs always sets ForceShuffle to be true when constructing it's RowShufflingTransformer. It ignores the original streaming dataview's CanShuffle flag (which is always false for any streaming dataview).

That might make sense - the RowShufflingTransformer is supposed to abstract away the implementation of shuffles. And it works for other streaming sources and the trainer WANTS shuffled data to work good.

The actual problem is that EF Core DBContexts aren't thread-safe, so if cursors from multiple threads exhaust the pool and more than one tries to access the Enumerable to get more: 'boom'. Maybe the answer is to add another flag to the source IDataView ('IsNotThreadSafe') and then implement a lock when fetching more pool data (in that specific case)?

CESARDELATORRE · 2019-07-15T23:07:13Z

Adding @eerhardt - Please, take a look to this issue. Not directly related to the current DatabaseLoader we're implementing, but we should fix LoadFromEnumerable() when possible, as well.

Adding also @divega who mentioned this bug in a related discussion, as well.

LoadFromEnumerable doesn't always work properly because it assumes that the Enumerable-backed IDataView is thread-safe - this is not always the case, like when using EF Core DbContext.

Related specs PR: #3857

@endintiers proposal is the following:
So I propose starting by adding a switch to LoadFromEnumerable 'singleThreaded=true' which would intercept IDataView cache update requests and force them onto a single thread. This may slow down the trainers a bit but nowhere near as much as single-threading them (where possible) and we can improve performance with eager loading.
Doing that first would remove a lot of complexity from the DatabaseLoader task (threading no longer an issue).

Thoughts?

endintiers · 2019-07-16T00:40:30Z

@CESARDELATORRE I wouldn't say it's totally unrelated to DatabaseLoader. In my mind the obvious thing is for DatabaseLoader to leverage the existing stack under LoadFromEnumerable and various parts of IDataView and the cache loader... Unless there is some reason that generic database as a source is fundamentally different? But, yes, depending on what the spec said, assuming that the IEnumerable implementation is thread-safe is a bug.

eerhardt · 2019-07-19T15:25:41Z

The exception I get with the original code (using EF Core 2.2) says:

A second operation started on this context before a previous operation completed. This is usually caused by different threads using the same instance of DbContext, however instance members are not guaranteed to be thread safe. This could also be caused by a nested query being evaluated on the client, if this is the case rewrite the query avoiding nested invocations.

See https://github.com/aspnet/EntityFramework.Docs/blob/master/entity-framework/core/miscellaneous/configuring-dbcontext.md#avoiding-dbcontext-threading-issues for more details.

In ML.NET you are using the SDCA algorithm, which uses multiple threads by default. This is in direct conflict with the data source that you are reading from which is prohibiting multi-threaded access.

I don't see how something in ML.NET could solve this, there is an inherent mismatch between the algorithm and the data source. You will have to handle it on your own by doing one or more of the following:

Create a new DbContext for each enumeration. See our sample of how to do this here: https://github.com/dotnet/machinelearning-samples/blob/874a1fec13bbd51c543755048a4da6cfdd618d3a/samples/csharp/getting-started/DatabaseIntegration/DatabaseIntegration/Program.cs#L44-L60
Use in-memory caching, as @Ivanidzo4ka pointed out.
You can cache the data from the DB to an .idv file locally, and then perform training on the local data. (see mlContext.Data.SaveAsBinary().)
Change the SDCA options to only use a single thread.
Use a different algorithm that isn't multi-threaded by default. For example, LightGbm.

As for it working in EF Core 3.0, I see it working as well, but I am not certain it is intended given the above avoiding-dbcontext-threading-issues link. It looks like there is a new query pipeline in EF Core 3.0. See:

dotnet/efcore#14455
dotnet/efcore#15913

@divega @ajcvickers - any comment here? Should EF Core 3.0 be throwing an exception when multiple threads are accessing the same DbContext, like it did in earlier versions?

divega · 2019-07-19T21:14:02Z

@eerhardt, @CESARDELATORRE we had a very long discussion with Tom and others about this when we were preparing the sample at https://github.com/dotnet/machinelearning-samples/tree/master/samples/csharp/getting-started/DatabaseIntegration.

Notice that in https://github.com/dotnet/machinelearning-samples/blob/874a1fec13bbd51c543755048a4da6cfdd618d3a/samples/csharp/getting-started/DatabaseIntegration/DatabaseIntegration/Program.cs#L44-L60 we make sure each new enumeration of the data happen on a separate DbContext and DbConnection.

I would recommend that we point customers to this sample when questions like this arise, since we already made the investment to work out all the major kinks there.

Re the fact that ML.NET makes the assumption that IEnumerable<T> instances are thread safe, the more I think about it, the more I am convinced that it is just a bug. It is only true for a subset of implementations (e.g. in-memory collections that are not being modified) but as @ajcvickers likes to remind me, all abstractions in .NET are not thread-safe unless explicitly stated.

I think besides pointing people to the existing sample, other things the ML.NET team could consider are:

Document the special assumptions made in the LoadFromEnumerable() API and point to a fwlink.
Deprecate the current LoadFromEnumerable() and create a new overload that takes a factory argument of type Func<IEnumerable<T>>. This gives the user a hint (although maybe not too obvious) that a new IEnumerable<T> should be created every time, although it is simple enough to just pass always the same IEnumerable<T> instance if the user knows that the data is already cached in memory.
You could even have a LoadFromInMemoryEnumerable() method with the current signature of LoadFromEnumerable() for the simple scenarios.

divega · 2019-07-19T22:07:29Z

Re how this impacts DatabaseLoader, although it shouldn't need to use LoadFromEnumerable() because going directly from a DbDataReader to an IDataView should be more efficient, we will run into the same fundamental problem because DbConnection implementations are not thread-safe, and even worse, many of them don't even support MARS (multiple active result sets).

ajcvickers · 2019-07-19T22:10:45Z

100% agree with @divega

endintiers · 2019-07-20T01:28:04Z

@eerhardt >> I don't see how something in ML.NET could solve this
The same way that WinForms solved the UI update thread problem - marshalling.
If an IEnumerable is marked as single-threaded then all cache update requests should be intercepted, the cache requesting thread goes into Wait, the request is marshalled to the original thread that dispatched the IEnumerable (which has been waiting for this), the data is retrieved, the requesting thread is woken.
This will result in slightly slower training for multi-threaded models, but the flag in LoadFromEnumerable need only be set for cases where this is an issue.

Something like that anyway :-) There has to be a queue for cache update requests.

endintiers · 2019-07-20T01:38:07Z

@divega @ajcvickers >> need to use LoadFromEnumerable()
DatabaseLoader doesn't need to use the top level LoadFromEnumerable code but will quickly start to depend on code further down (which is multi-threaded). The issue will be the same. A decision needs to be made: allow single-threaded cache sources and code for that or specifically disallow them (maybe for V1?).
Developers will still use LoadFromEnumerable to access database-backed/single-threaded IEnumerables when DatabaseLoader is available, so the same can be said for LoadFromEnumerable (fix or document)

Although (thinking) since all that DatabaseLoader is doing is trying to make it easier for devs to create an IEnumerable (the end game here) from some database, adding an additional method at the IDataView level could be seen as needlessly making that interface more complex. The 'database wizard' stuff could easily sit on top of LoadFromEnumerable? (am I missing some subtlety here?)

divega · 2019-07-20T17:12:58Z

DatabaseLoader doesn't need to use the top level LoadFromEnumerable code but will quickly start to depend on code further down (which is multi-threaded). The issue will be the same.

@endintiers Maybe I wasn't clear, but this is exactly what I meant when I said:

we will run into the same fundamental problem because DbConnection implementations are not thread-safe...

The thing is that with most databases using multiple connections can be a cheaper approach than marshalling, at least from the perspective the number of changes that user code or ML.NET may need (finding out if it is ultimately more efficient at runtime would require some testing, but database drivers are usually optimized to be used this way).

The 'database wizard' stuff could easily sit on top of LoadFromEnumerable?

Possibly what you are missing is why we want to skip IEnumerable<T> and LoadFromEnumerable in the database loader: DbDataReader and IDataView are similar in that they work like sliding windows over the data. Neither has the concept of a .Current object as IEnumerator<T> does, hence when reading data there is no need to allocate and initialize an object to be returned by it on every iteration. Going from DbDataReader to IDataView directly can save a lot of runtime work and allocations.

eerhardt · 2019-07-24T16:25:00Z

Thanks for the pointer to the sample, @divega. I didn't know we already had this. I've updated my list above to include the link to this sample.

I've opened #4036 on docs.microsoft.com to explicitly document this as a limitation/concern of LoadFromEnumerable.

CESARDELATORRE · 2019-07-25T17:04:32Z

@eerhardt - But I think that documenting it and pointing to a sample is just a short term workaround.

The fact that ML.NET makes the assumption that IEnumerable instances are thread safe feels completely like a bug. The way to fix a bug is by fixing the API implementation not making possible that users will get into trouble. Documenting a workaround for a bug is only a short term workaround.

As suggested by Diego, I also believe we need to change/fix the API implementation so it won't be possible for the user to get into this trouble, either by:

(Pointed by @divega) Deprecate the current LoadFromEnumerable() and create a new overload that takes a factory argument of type Func<IEnumerable>. This gives the user a hint (although maybe not too obvious) that a new IEnumerable should be created every time, although it is simple enough to just pass always the same IEnumerable instance if the user knows that the data is already cached in memory.
(Pointed by @divega) You could even have a LoadFromInMemoryEnumerable() method with the current signature of LoadFromEnumerable() for the simple scenarios.
(Pointed by @endintiers ) Marking and using the IEnumerable as single-threaded. This might slow down the training process multi-threaded models, but the flag in LoadFromEnumerable need only be set for cases where this is an issue.

Also, for better performance we're going to have the new DatabaseLoader (for relational databases). However, for non-SQL databases or other data sources we still need the LoadFromEnumerable() approach. It is not something we can simply deprecate.

My point is that it is better if we create "fences/guards" in the API itself than just documenting it and show a sample because the chances that developers will get into this trouble (like when using EF if not following the example we created) are pretty high...

CESARDELATORRE · 2019-07-26T00:37:55Z

@endintiers - Question related to SQL databases but in a different page researching possible customer scenarios and differences in features and issues depending on that:

When you use a database to train ML.NET models, what RDBMS do you use? SQL Server? Other? Do you use that database in Azure such as using Azure SQL Database? Or do you use SQL Server or any other RDBMS on-prem?

This question is not directly related to this issue so we can follow up by email if you'd like. Can you send me an email to 'cesardl at microsoft.com' or through Twitter to @CESARDELATORRE when possible? - Thanks! :)

I'll remove this comment as soon as we're talking offline. :)

frank-dong-ms-zz · 2020-06-11T23:15:49Z

This issue is already documented and we suggest that IEnumerable needs to be thread-safe, I would like to mark this as enhancement and mark this as P2 CC @harishsk

endintiers changed the title ~~InvalidOperationException when training from IEnumerable backed by EF Core Context~~ AggregateException/InvalidOperationException when training from IEnumerable backed by EF Core Context Jan 16, 2019

Ivanidzo4ka mentioned this issue Feb 13, 2019

V1 Scenarios need to be covered by tests #2498

Open

CESARDELATORRE added the bug Something isn't working label Jul 15, 2019

eerhardt mentioned this issue Jul 24, 2019

Need to explicitly call out that the IEnumerable needs to be thread-safe #4036

Closed

CESARDELATORRE mentioned this issue Jul 26, 2019

Need to explicitly call out that the IEnumerable needs to be thread-safe dotnet/docs#13570

Closed

CESARDELATORRE mentioned this issue Oct 30, 2019

DatabaseSource requires new SqlConnection even if an open Connection exists! dotnet/machinelearning-samples#727

Closed

harishsk added the P1 Priority of the issue for triage purpose: Needs to be fixed soon. label Jan 22, 2020

harishsk added loadsave Bugs related loading and saving data or models lightgbm Bugs related lightgbm labels Apr 29, 2020

frank-dong-ms-zz assigned frank-dong-ms-zz and unassigned frank-dong-ms-zz Jun 11, 2020

frank-dong-ms-zz removed P1 Priority of the issue for triage purpose: Needs to be fixed soon. bug Something isn't working lightgbm Bugs related lightgbm labels Jun 11, 2020

frank-dong-ms-zz added enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point. and removed loadsave Bugs related loading and saving data or models labels Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AggregateException/InvalidOperationException when training from IEnumerable backed by EF Core Context #2159

AggregateException/InvalidOperationException when training from IEnumerable backed by EF Core Context #2159

endintiers commented Jan 16, 2019 •

edited

Loading

Ivanidzo4ka commented Jan 16, 2019 •

edited

Loading

endintiers commented Jan 17, 2019 •

edited

Loading

endintiers commented Jan 17, 2019 •

edited

Loading

CESARDELATORRE commented Jul 15, 2019

endintiers commented Jul 16, 2019 •

edited

Loading

eerhardt commented Jul 19, 2019 •

edited

Loading

divega commented Jul 19, 2019 •

edited

Loading

divega commented Jul 19, 2019 •

edited

Loading

ajcvickers commented Jul 19, 2019

endintiers commented Jul 20, 2019 •

edited

Loading

endintiers commented Jul 20, 2019 •

edited

Loading

divega commented Jul 20, 2019 •

edited

Loading

eerhardt commented Jul 24, 2019

CESARDELATORRE commented Jul 25, 2019

CESARDELATORRE commented Jul 26, 2019

frank-dong-ms-zz commented Jun 11, 2020 •

edited

Loading

AggregateException/InvalidOperationException when training from IEnumerable backed by EF Core Context #2159

AggregateException/InvalidOperationException when training from IEnumerable backed by EF Core Context #2159

Comments

endintiers commented Jan 16, 2019 • edited Loading

System information

Issue

Source code / logs

Ivanidzo4ka commented Jan 16, 2019 • edited Loading

endintiers commented Jan 17, 2019 • edited Loading

endintiers commented Jan 17, 2019 • edited Loading

CESARDELATORRE commented Jul 15, 2019

endintiers commented Jul 16, 2019 • edited Loading

eerhardt commented Jul 19, 2019 • edited Loading

divega commented Jul 19, 2019 • edited Loading

divega commented Jul 19, 2019 • edited Loading

ajcvickers commented Jul 19, 2019

endintiers commented Jul 20, 2019 • edited Loading

endintiers commented Jul 20, 2019 • edited Loading

divega commented Jul 20, 2019 • edited Loading

eerhardt commented Jul 24, 2019

CESARDELATORRE commented Jul 25, 2019

CESARDELATORRE commented Jul 26, 2019

frank-dong-ms-zz commented Jun 11, 2020 • edited Loading

endintiers commented Jan 16, 2019 •

edited

Loading

Ivanidzo4ka commented Jan 16, 2019 •

edited

Loading

endintiers commented Jan 17, 2019 •

edited

Loading

endintiers commented Jan 17, 2019 •

edited

Loading

endintiers commented Jul 16, 2019 •

edited

Loading

eerhardt commented Jul 19, 2019 •

edited

Loading

divega commented Jul 19, 2019 •

edited

Loading

divega commented Jul 19, 2019 •

edited

Loading

endintiers commented Jul 20, 2019 •

edited

Loading

endintiers commented Jul 20, 2019 •

edited

Loading

divega commented Jul 20, 2019 •

edited

Loading

frank-dong-ms-zz commented Jun 11, 2020 •

edited

Loading