Skip to content
This repository has been archived by the owner on Jul 30, 2024. It is now read-only.
/ NuGet.Jobs Public archive

[Revalidation] Speed up bulk insertions #530

Merged
merged 1 commit into from
Aug 10, 2018
Merged

[Revalidation] Speed up bulk insertions #530

merged 1 commit into from
Aug 10, 2018

Conversation

loic-sharma
Copy link
Contributor

The initialization phase of the job has to insert 1.5 million records into the validation DB in batches of 1,000 records. On DEV, each batch was taking 30 - 120 seconds using Entity Framework. This fix uses SqlBulkCopy to speed up batches to be sub-second.

l.Contains("Billy") &&
l.Contains("Bob"))),
// Ensure that the owners are lexicographically ordered.
l.Count() == 4 &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was out of date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was failing in the CI? Or not being run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's failing the CI. I have no idea how the CI didn't fail until now... I think someone may have accidentally reverted the test.


var bulkCopy = new SqlBulkCopy(
_connection,
SqlBulkCopyOptions.TableLock | SqlBulkCopyOptions.FireTriggers | SqlBulkCopyOptions.UseInternalTransaction,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will TableLock have any impact to the validation pipeline?

Based on your DEV testing, sounds like lock duration would be <1s? What's the frequency of locks / interval between batches?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a SqlBulkCopy.BulkCopyTimeout property... maybe we should use that to ensure that bulk copies don't accidentally keep the table lock for significant time?

If the frequency of locks causes pipeline degradation, I assume we can flip the killswitch to disable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will TableLock have any impact to the validation pipeline?

This inserts records to the PackageRevalidations table which is only used by the revalidation job, so this shouldn't impact the validation pipeline.

Based on your DEV testing, sounds like lock duration would be <1s? What's the frequency of locks / interval between batches?

There's a 30 second sleep time between batches, and there will be roughly ~1,500 batches (1.5 million packages inserted in batches of a thousand).

Looks like there's a SqlBulkCopy.BulkCopyTimeout property... maybe we should use that to ensure that bulk copies don't accidentally keep the table lock for significant time?

Good idea, I'll look into that.

If the frequency of locks causes pipeline degradation, I assume we can flip the killswitch to disable?

Yup, that's correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew - I missed that it was using a separate table from the pipeline, so this sounds great!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default BulkCopyTimeout value is 30 seconds, which seems reasonable for our scenario.

bulkCopy.DestinationTableName = TableName;
bulkCopy.WriteToServer(table);

_connection.Close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: DbConnection.Close and DbConnection.Dispose are supposed to be functionally equivalent (had to look that up). Just wondering if we should explicitly add DbConnection.Dispose so that the behavior is clear from code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @chenriksson offline. She recommended I create a new SqlConnection every time I call AddPackageRevalidationsAsync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


public PackageRevalidationInserter(SqlConnection connection, ILogger<PackageRevalidationInserter> logger)
{
_connection = connection ?? throw new ArgumentNullException(nameof(connection));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type should be IDisposable since it owns a disposable dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, ask DI for a Func<SqlConnection> and make the invoker of the Func dispose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated some legacy jobs to pass Func<Task<SqlConnection>>, e.g.: OpenSqlConnectionAsync<StatisticsDbConfiguration>.

This ensures we can get a new access token when we create/open a new connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a common pattern that many jobs will need. I addressed this by creating ISqlConnectionFactory<T>. You can ask for ISqlConnectionFactory<ValidationDbConfiguration> or ISqlConnectionFactory<IEntitiesContext> to create connections to the database of your choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, sounds good.

/// Create a SQL Connection.
/// </summary>
/// <returns>The created SQL connection.</returns>
SqlConnection Create();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add CreateAsync? Note that Create is async in ServerCommon because it may require a request for the access token.

I added the non-async overload to JobBase because I think it's necessary for the DI creation of the EF contexts. Feel free to correct if I'm wrong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return new DelegateSqlConnectionFactory<GalleryDbConfiguration>(
CreateSqlConnection<GalleryDbConfiguration>,
p.GetRequiredService<ILogger<DelegateSqlConnectionFactory<GalleryDbConfiguration>>>());
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move registration of ISqlConnectionFactory to JsonConfigurationJob (now in Jobs.Common), since they would be useful to legacy jobs as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


namespace NuGet.Jobs.Validation
{
public class DelegateSqlConnectionFactory<T> : ISqlConnectionFactory<T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave a comment that this T is a marker type. From what I can tell it only is for simplifying wire-up.

This could also be achieved by DI magic, e.g. keyed bindings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment sounds good, I'll do that.

This could indeed also be done with keyed bindings. Personally, I prefer the marker type over the keyed bindings because:

  1. The type that accepts the connection factory clearly indicates which database it is connecting to. This is unclear from keyed bindings.
  2. Setting up keyed bindings are a pain, especially if we have to do these bindings for each type that depends on ISqlConnectionFactory. I would like to avoid them if possible 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the comment

_logger = logger ?? throw new ArgumentNullException(nameof(logger));
}

public async Task AddPackageRevalidationsAsync(IReadOnlyList<PackageRevalidation> revalidations)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all packages in memory here? Or is this method called, batch-wise?

Copy link
Contributor Author

@loic-sharma loic-sharma Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is called batch-wise, 1000 revalidations at a time. Also, this type is recreated on each batch.

@loic-sharma loic-sharma merged commit 9535e69 into dev Aug 10, 2018
@loic-sharma loic-sharma deleted the loshar-perf branch August 10, 2018 22:06
joelverhagen pushed a commit that referenced this pull request Oct 26, 2020
The initialization phase of the job has to insert 1.5 million records into the validation DB in batches of 1,000 records. On DEV, each batch was taking 30 - 120 seconds using Entity Framework. This fix uses `SqlBulkCopy` to speed up batches to be sub-second.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants