Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[WIP] Fix non-determinism in sparse embedding #9846

Closed
wants to merge 2 commits into from

Conversation

eric-haibin-lin
Copy link
Member

Description

The original GPU sparse embedding operator uses atomic add which results non-deterministic gradient due to limited precision of fp32 and non-deterministic execution order. This PR replaces atomic add with sort to guarantee determinism.

Tested with example/rnn/word_lm/train.py. Using SparseEmbedding and Embedding results the same loss with fixed seed.

The fixes makes the backward pass ~50% slower compared to the atomic_add implementation, measured by the script at the end. Further optimization can be done using cub::Unique instead of cub::InclusiveSum to generate lookup table.

@ZiyueHuang @sxjscience

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here
# the benchmark script also requires other files under example/rnn/word_lm

import numpy as np
import mxnet as mx, math
import argparse, math
import logging
from data import Corpus, CorpusIter
from model import *
from module import *
from mxnet.model import BatchEndParam

parser = argparse.ArgumentParser(description='PennTreeBank LSTM Language Model')
parser.add_argument('--data', type=str, default='./data/ptb.',
                    help='location of the data corpus')
parser.add_argument('--batch_size', type=int, default=128,
                    help='batch size')
parser.add_argument('--bptt', type=int, default=35,
                    help='sequence length')
parser.add_argument('--dim', type=int, default=1024*1024,
                    help='dim')
args = parser.parse_args()

if __name__ == '__main__':
    # args
    head = '%(asctime)-15s %(message)s'
    logging.basicConfig(level=logging.DEBUG, format=head)
    args = parser.parse_args()
    logging.info(args)
    ctx = mx.gpu()
    batch_size = args.batch_size
    bptt = args.bptt

    # data
    ctx = mx.gpu()
    corpus = Corpus(args.data)
    ntokens = len(corpus.dictionary)
    train_data = CorpusIter(corpus.train, batch_size, bptt)
    data = []
    for i in range(1):
        data.append(train_data.next().data[0].reshape((-1,)).astype('int64'))
    word = mx.sym.var('data')
    weight = mx.sym.var('embed_weight', stype='row_sparse')
    embed = mx.sym.contrib.SparseEmbedding(data=word, weight=weight, input_dim=args.dim, output_dim=512, name='embed')
    grad_req = {'data': 'null', 'embed_weight': 'write'}
    exe_test = embed.simple_bind(mx.gpu(), grad_req=grad_req, data=(data[0].shape[0],))
    arg_map = dict(zip(embed.list_arguments(), exe_test.arg_arrays))
    grad_map = dict(zip(embed.list_arguments(), exe_test.grad_arrays))
    # init data
    arg_map["data"][:] = data[0].astype('float32')
    print(data[0])
    grad = mx.nd.ones(exe_test.outputs[0].shape).copyto(mx.gpu(0))
    # weight
    weight = arg_map["embed_weight"]
    weight[:] = 1
    exe_test.forward()
    # warm up
    for i in range(10):
        exe_test.backward([grad])
    import time
    mx.nd.waitall()
    a = time.time()
    for i in range(10000):
        exe_test.backward([grad])
    mx.nd.waitall()
    b = time.time()
    print(b - a)

* refactor embed backward kernelcallker

* pass unit test

* refactor

* fix dim bug

* add unique impl

* remove old op

* remove unused kernel
@eric-haibin-lin eric-haibin-lin changed the title Fix non-determinism in sparse embedding [WIP] Fix non-determinism in sparse embedding Feb 21, 2018
@eric-haibin-lin eric-haibin-lin deleted the fix-embedding branch September 18, 2018 23:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant