Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cuIO timestamp processing with cuda::std::chrono #9278

Merged
merged 30 commits into from
Sep 30, 2021

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Sep 22, 2021

Closes #6674.

This PR refactors cuIO timestamp processing with cuda::std::chrono. It gets rid of hard-coded conversions of different time units and simplifies related operations.

@PointKernel PointKernel added 2 - In Progress Currently a work in progress code quality libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function breaking Breaking change labels Sep 22, 2021
@PointKernel PointKernel self-assigned this Sep 22, 2021
@PointKernel PointKernel requested a review from a team as a code owner September 22, 2021 22:36
@codecov
Copy link

codecov bot commented Sep 23, 2021

Codecov Report

Merging #9278 (51f5259) into branch-21.12 (ab4bfaa) will decrease coverage by 0.02%.
The diff coverage is 2.60%.

❗ Current head 51f5259 differs from pull request most recent head a9df285. Consider uploading reports for the commit a9df285 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9278      +/-   ##
================================================
- Coverage         10.79%   10.76%   -0.03%     
================================================
  Files               116      116              
  Lines             18869    18996     +127     
================================================
+ Hits               2036     2045       +9     
- Misses            16833    16951     +118     
Impacted Files Coverage Δ
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/join/_join_helpers.py 0.00% <ø> (ø)
python/cudf/cudf/core/multiindex.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/tools/datetimes.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/udf/api.py 0.00% <0.00%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd0b710...a9df285. Read the comment docs.

cpp/src/io/csv/csv_gpu.cu Outdated Show resolved Hide resolved
cpp/src/io/csv/csv_gpu.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to look further into datime.cuh, but PR looks great so far. So much removed code!

Can you please run the gbenchmarks for the affected components? I'm curious if there's any improvement.

@PointKernel
Copy link
Member Author

rerun tests

@PointKernel
Copy link
Member Author

PointKernel commented Sep 29, 2021

  • CSV_READER_BENCH

    • chrono:
    Benchmark                                               Time            CPU 
    CsvRead/timestamps_file_input/32/0/manual_time          750 ms          750 ms
    CsvRead/timestamps_buffer_input/32/1/manual_time        676 ms          676 ms
    
    • ref:
    Benchmark                                               Time            CPU 
    CsvRead/timestamps_file_input/32/0/manual_time          740 ms          740 ms
    CsvRead/timestamps_buffer_input/32/1/manual_time        657 ms          657 ms
    
  • CSV_WRITER_BENCH

    • chrono:
    Benchmark                                                Time            CPU 
    CsvWrite/timestamps_file_output/32/0/manual_time         1157 ms         1003 ms
    CsvWrite/timestamps_buffer_output/32/1/manual_time       1414 ms         1414 ms
    CsvWrite/timestamps_void_output/32/2/manual_time          248 ms          248 ms
    
    • ref:
    Benchmark                                                Time            CPU 
    CsvWrite/timestamps_file_output/32/0/manual_time         1150 ms         1032 ms
    CsvWrite/timestamps_buffer_output/32/1/manual_time       1442 ms         1442 ms
    CsvWrite/timestamps_void_output/32/2/manual_time          245 ms          244 ms
    

If there is any performance impact, CSV/JSON should be affected the most. Based on the above results, the new implementation with chrono is slightly slower in almost all cases. Not sure if it's an actual performance difference or just acceptable runtime noise. No matter which is the case, the slow down should be related to reduced precision operations in the original implementation. e.g. using int to perform second-related operations while in chrono, seconds are always int64_t.

@davidwendt
Copy link
Contributor

The benchmark results may indeed be noise since the code changed in this PR is not actually used in the CSV writer.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor nitpick.

cpp/src/io/csv/datetime.cuh Outdated Show resolved Hide resolved
Comment on lines +1793 to +1794
duration_ns d_ns{nanos};
d_ns += duration_s{seconds};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nanos and seconds should be constructed as duration_ns and duration_s to begin with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible for seconds but seconds will do some manual timezone operations by invoking get_gmt_offset(). My plan was to leave it for the timezone refactoring. As for nanos, there are bitwise operations like "shift" and "and" which are not supported by duration type.

@PointKernel
Copy link
Member Author

PointKernel commented Sep 30, 2021

Added some minor refactoring of the timezone code. Tried to replace days_in_month() with chrono::year_month_day_last but get an pytest error which shows 0.29326% difference against the gold reference:

test_orc.py:652: AssertionError
_________________________________________________________________________________________ test_orc_reader_gmt_timestamps __________________________________________________________________________________________

datadir = PosixPath('/home/yunsongw/dev/rapids/cudf/python/cudf/cudf/tests/data/orc')

    def test_orc_reader_gmt_timestamps(datadir):
        path = datadir / "TestOrcFile.gmt.orc"
        try:
            orcfile = pa.orc.ORCFile(path)
        except pa.ArrowIOError as e:
            pytest.skip(".orc file is not found: %s" % e)
    
        pdf = orcfile.read().to_pandas()
        gdf = cudf.read_orc(path, engine="cudf").to_pandas()
>       assert_eq(pdf, gdf)

test_orc.py:708: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../testing/_utils.py:99: in assert_eq
    tm.assert_frame_equal(left, right, **kwargs)
../../../../../compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.7/site-packages/pandas/_testing/asserters.py:824: in assert_extension_array_equal
    np.asarray(left.asi8), np.asarray(right.asi8), index_values=index_values
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

left = array([-9223372036854775808, -2142515480837758080, -7567920988965551616,
        2792579797553587072,  -54466860059230...   7525281687306035456,   796389951378828992,  4737082654906448384,
       -9223372036854775808, -8570925633996758080])
right = array([-9223372036854775808, -2142515480837758080, -7567920988965551616,
        2792579797553587072,  -54466860059230...   7525281687306035456,   796389951378828992,  4737082654906448384,
       -9223372036854775808, -8570925633996758080])
err_msg = None

    def _raise(left, right, err_msg):
        if err_msg is None:
            if left.shape != right.shape:
                raise_assert_detail(
                    obj, f"{obj} shapes are different", left.shape, right.shape
                )
    
            diff = 0
            for left_arr, right_arr in zip(left, right):
                # count up differences
                if not array_equivalent(left_arr, right_arr, strict_nan=strict_nan):
                    diff += 1
    
            diff = diff * 100.0 / left.size
            msg = f"{obj} values are different ({np.round(diff, 5)} %)"
>           raise_assert_detail(obj, msg, left, right, index_values=index_values)
E           AssertionError: numpy array are different
E           
E           numpy array values are different (0.29326 %)
E           [index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
E           [left]:  [-9223372036854775808, -2142515480837758080, -7567920988965551616, 2792579797553587072, -544668600592309696, 8297071309727241920, -6098990521686964544, -4045530081545654848, -7322980025668309696, 4293722186243138688, -7750717270576861312, -4785740552164516160, -161961075020103232, 1406249513655241920, 5715929550633896768, -9146790865645206464, 9168135723176380608, 899702797973035456, 3156765762737896768, -3158961808237654848, 5048889195422793536, 5428392956266587072, 3597397302246138688, 708153631141241920, 5384453953847448384, 7077052911552483840, -6241260517960206464, -6693705449827103232, 5495653761970483840, 8570983229708138688, 695728049194483840, -320493707408619392, 7967874651631483840, -8611366343465551616, 5169946153967380608, 2941962443750690304, 3820458547896587072, 4041953621313932224, 8479584079397000000, -9223372036854775808, -2324266697823206464, -7627237450543103232, -5312905053312619392, 8379706198113587072, -6192231860505654848, 7424232249756828992, 4090209118022690304, 6974596115118828992, 4993036072188241920, 1459313580297690304, -9223372036854775808, -1722690325890861312, -7114672037893654848, -518090880692861312, 8453955073001345152, -478125103150103232, -7032580620334412928, -8178069250459309696, -9223372036854775808, 4467367234201138688, -581164836138516160, 169391712222380608, -5991596338872516160, 3634476496931690304, 6540140071394000000, -9223372036854775808, -8047982283914619392, -3925635588309758080, -7612537130950516160, -2303074531770103232, 602635749356828992, 601084410810138688, -9223372036854775808, 6276364605199380608, 1671999857286483840, 57905023973345152, -1709746292640067776, -2295211943214309696, -9223372036854775808, -4659779523427103232, -3615777840447412928, 6040895237957587072, 7683325620332000000, 1864036452543035456, 4335730993151448384, 8517458632479380608, -6807729320590758080, 2620299928752035456, -335724784506964544, -4799525985631309696, -9150513182766412928, -4515293891343171008, -3130016277931171008, -853232952495171008, 8836645149901896768, 4867213370890793536, 3010234015614000000, -3910365969989412928, -2011305780414103232, -1870062057019103232, ...]
E           [right]: [-9223372036854775808, -2142515480837758080, -7567920988965551616, 2792579797553587072, -544668600592309696, 8297071309727241920, -6098990521686964544, -4045530081545654848, -7322980025668309696, 4293722186243138688, -7750717270576861312, -4785740552164516160, -161961075020103232, 1406249513655241920, 5715929550633896768, -9146790865645206464, 9168135723176380608, 899702797973035456, 3156765762737896768, -3158961808237654848, 5048889195422793536, 5428392956266587072, 3597397302246138688, 708153631141241920, 5384453953847448384, 7077052911552483840, -6241260517960206464, -6693705449827103232, 5495653761970483840, 8570983229708138688, 695728049194483840, -320493707408619392, 7967874651631483840, -8611366343465551616, 5169946153967380608, 2941962443750690304, 3820458547896587072, 4041953621313932224, 8479584079397000000, -9223372036854775808, -2324266697823206464, -7627237450543103232, -5312905053312619392, 8379706198113587072, -6192231860505654848, 7424232249756828992, 4090209118022690304, 6974596115118828992, 4993036072188241920, 1459313580297690304, -9223372036854775808, -1722690325890861312, -7114672037893654848, -518090880692861312, 8453955073001345152, -478125103150103232, -7032580620334412928, -8178069250459309696, -9223372036854775808, 4467367234201138688, -581164836138516160, 169391712222380608, -5991596338872516160, 3634476496931690304, 6540140071394000000, -9223372036854775808, -8047982283914619392, -3925635588309758080, -7612537130950516160, -2303074531770103232, 602635749356828992, 601084410810138688, -9223372036854775808, 6276364605199380608, 1671999857286483840, 57905023973345152, -1709746292640067776, -2295211943214309696, -9223372036854775808, -4659779523427103232, -3615777840447412928, 6040895237957587072, 7683325620332000000, 1864036452543035456, 4335730993151448384, 8517458632479380608, -6807729320590758080, 2620299928752035456, -335724784506964544, -4799525985631309696, -9150513182766412928, -4515293891343171008, -3130016277931171008, -853232952495171008, 8836645149901896768, 4867213370890793536, 3010234015614000000, -3910365969989412928, -2011305780414103232, -1870062057019103232, ...]

I guess the error is related to #7314 since it constantly shows up in my local system but can pass all CI checks in a PR. The rest of the timezone refactoring would be in a follow-up PR whenever cuda::std::chrono::timezone is ready.

@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4d7e69a into rapidsai:branch-21.12 Sep 30, 2021
@PointKernel PointKernel deleted the cuio-timestamp branch October 1, 2021 13:50
rapids-bot bot pushed a commit that referenced this pull request Oct 5, 2021
This fixes some compile warnings recently added in #9299 and #9278. These warnings are turned into errors on a libcudf Debug build.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9360
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Use std::chrono for timestamp processing in cuIO
5 participants