incremental backup and incremental backfill generate different file names #90

yonahforst · 2017-03-01T12:01:21Z

Hi there!

First off, great library. It's super useful and a much better/simpler option (for me) than the whole EMR/Datapipeline situation.

I have this simple lambda function that is subscribed to the tables I want to update:
(the bucket, region, and prefix are set as env variables in the lambda function)

var replicator = require('dynamodb-replicator')
module.exports.streaming = (event, context, callback) => {
  return replicator.backup(event, callback)
}

Then I ran the backfill by importing dynamodb-replicator/s3-backfill and passing it a config object.

However, I noticed that when records get updated via the stream/lambda function, they are written to a different file from the one created by the backfill.

I see that the formula for generating filenames is slightly different.

\\backfilll
            var id = crypto.createHash('md5')
                .update(Dyno.serialize(key))
                .digest('hex');

\\backup
            var id = crypto.createHash('md5')
                .update(JSON.stringify(change.dynamodb.Keys))
                .digest('hex');

https://github.com/mapbox/dynamodb-replicator/blob/master/s3-backfill.js#L46-L48
https://github.com/mapbox/dynamodb-replicator/blob/master/index.js#L130-L132

Does this make any practical difference? Should the restore function work regardless?

The text was updated successfully, but these errors were encountered:

yonahforst · 2017-03-06T11:00:37Z

I've realized that Dyno.serialize in backfill just converts from js objects to DynamoDB JSON, which is what get from the stream in backup. Then I'm not sure why they generate different keys. maybe the order of the stringified keys?

yonahforst · 2017-03-06T12:01:43Z

confirmed that sorting key obj before generating the id hash resolves this issue.

This exposes a bug (previously filed as mapbox#90) which occurs when items with a range key are read from the DDB event stream, an md5 hash of the key is computed and the item written to S3. The issue is that the DDB event stream handler does not (and should not) do a 'describe_table' to know which key is the HASH and which is the RANGE and therefor simply generates the md5 hash of the item keys in whatever order they happen to appear in the stream event. The s3-backfill util does do a 'describe_table' and does order the keys by declaration order which DDB requires to be HASH first, RANGE second. The different ordering of the item keys will produce a distinct md5 hash value and different S3 path/keys will result in some items appearing twice in S3 effectively corrupting the incremental backups since two valid versions will be present at the same time.

…ox#90. see previous commit 0c065a5 for tests which this commit allows to pass

This exposes a bug (previously filed as mapbox#90) which occurs when items with a range key are read from the DDB event stream, an md5 hash of the key is computed and the item written to S3. The issue is that the DDB event stream handler does not (and should not) do a 'describe_table' to know which key is the HASH and which is the RANGE and therefor simply generates the md5 hash of the item keys in whatever order they happen to appear in the stream event. The s3-backfill util does do a 'describe_table' and does order the keys by declaration order which DDB requires to be HASH first, RANGE second. The different ordering of the item keys will produce a distinct md5 hash value and different S3 path/keys will result in some items appearing twice in S3 effectively corrupting the incremental backups since two valid versions will be present at the same time.

…ox#90. see previous commit 0c065a5 for tests which this commit allows to pass

mapbox#90

yonahforst mentioned this issue Mar 6, 2017

sort keys before stringifying (addresses #90) #91

Closed

btalbot added a commit to iJJi/dynamodb-replicator that referenced this issue May 9, 2017

always order DDB keys by name when hashing for S3 storage. fixes mapb…

a597ddc

…ox#90. see previous commit 0c065a5 for tests which this commit allows to pass

btalbot added a commit to iJJi/dynamodb-replicator that referenced this issue May 9, 2017

always order DDB keys by name when hashing for S3 storage. fixes mapb…

837d199

…ox#90. see previous commit 0c065a5 for tests which this commit allows to pass

btalbot mentioned this issue May 9, 2017

Order keys for s3 #96

Open

akum32 added a commit to ACloudGuru/dynamodb-replicator that referenced this issue Sep 7, 2017

Sort keys before stringifying

cb06fb1

mapbox#90

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incremental backup and incremental backfill generate different file names #90

incremental backup and incremental backfill generate different file names #90

yonahforst commented Mar 1, 2017

yonahforst commented Mar 6, 2017 •

edited

Loading

yonahforst commented Mar 6, 2017

incremental backup and incremental backfill generate different file names #90

incremental backup and incremental backfill generate different file names #90

Comments

yonahforst commented Mar 1, 2017

yonahforst commented Mar 6, 2017 • edited Loading

yonahforst commented Mar 6, 2017

yonahforst commented Mar 6, 2017 •

edited

Loading