Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incremental backup and incremental backfill generate different file names #90

Open
yonahforst opened this issue Mar 1, 2017 · 2 comments

Comments

@yonahforst
Copy link

Hi there!

First off, great library. It's super useful and a much better/simpler option (for me) than the whole EMR/Datapipeline situation.

I have this simple lambda function that is subscribed to the tables I want to update:
(the bucket, region, and prefix are set as env variables in the lambda function)

var replicator = require('dynamodb-replicator')
module.exports.streaming = (event, context, callback) => {
  return replicator.backup(event, callback)
}

Then I ran the backfill by importing dynamodb-replicator/s3-backfill and passing it a config object.

However, I noticed that when records get updated via the stream/lambda function, they are written to a different file from the one created by the backfill.

I see that the formula for generating filenames is slightly different.

\\backfilll
            var id = crypto.createHash('md5')
                .update(Dyno.serialize(key))
                .digest('hex');

\\backup
            var id = crypto.createHash('md5')
                .update(JSON.stringify(change.dynamodb.Keys))
                .digest('hex');

https://github.com/mapbox/dynamodb-replicator/blob/master/s3-backfill.js#L46-L48
https://github.com/mapbox/dynamodb-replicator/blob/master/index.js#L130-L132

Does this make any practical difference? Should the restore function work regardless?

@yonahforst
Copy link
Author

yonahforst commented Mar 6, 2017

I've realized that Dyno.serialize in backfill just converts from js objects to DynamoDB JSON, which is what get from the stream in backup. Then I'm not sure why they generate different keys. maybe the order of the stringified keys?

@yonahforst
Copy link
Author

confirmed that sorting key obj before generating the id hash resolves this issue.

btalbot added a commit to iJJi/dynamodb-replicator that referenced this issue May 9, 2017
This exposes a bug (previously filed as mapbox#90) which occurs when items with a range key are read
from the DDB event stream, an md5 hash of the key is computed and the item written to S3.

The issue is that the DDB event stream handler does not (and should not) do a 'describe_table' to
know which key is the HASH and which is the RANGE and therefor simply generates the md5 hash of
the item keys in whatever order they happen to appear in the stream event.
The s3-backfill util does do a 'describe_table' and does order the keys by declaration
order which DDB requires to be HASH first, RANGE second.

The different ordering of the item keys will produce a distinct md5 hash value and different S3
path/keys will result in some items appearing twice in S3 effectively corrupting the incremental
backups since two valid versions will be present at the same time.
btalbot added a commit to iJJi/dynamodb-replicator that referenced this issue May 9, 2017
…ox#90. see previous commit 0c065a5 for tests which this commit allows to pass
btalbot added a commit to iJJi/dynamodb-replicator that referenced this issue May 9, 2017
This exposes a bug (previously filed as mapbox#90) which occurs when items with a range key are read
from the DDB event stream, an md5 hash of the key is computed and the item written to S3.

The issue is that the DDB event stream handler does not (and should not) do a 'describe_table' to
know which key is the HASH and which is the RANGE and therefor simply generates the md5 hash of
the item keys in whatever order they happen to appear in the stream event.
The s3-backfill util does do a 'describe_table' and does order the keys by declaration
order which DDB requires to be HASH first, RANGE second.

The different ordering of the item keys will produce a distinct md5 hash value and different S3
path/keys will result in some items appearing twice in S3 effectively corrupting the incremental
backups since two valid versions will be present at the same time.
btalbot added a commit to iJJi/dynamodb-replicator that referenced this issue May 9, 2017
…ox#90. see previous commit 0c065a5 for tests which this commit allows to pass
akum32 added a commit to ACloudGuru/dynamodb-replicator that referenced this issue Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant