-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Data Corruption in columns by orc reader when skiprows is specified #7343
Comments
galipremsagar
added
bug
Something isn't working
Needs Triage
Need team to review and classify
libcudf
Affects libcudf (C++/CUDA) code.
cuIO
cuIO issue
labels
Feb 8, 2021
galipremsagar
changed the title
[BUG] Data Corruption in boolean column of orc reader when skiprows is specified
[BUG] Data Corruption in boolean column by orc reader when skiprows is specified
Feb 8, 2021
@galipremsagar How the test file |
using pyorc |
@rgsl888prabhu This doesn't seem to be just specific to orc_int_bug.orc.zip (Remove .zip) (Pdb) cudf.read_orc('orc_int_bug.orc')
0 1 2
0 <NA> -2039474302 5829562885553925120
1 <NA> -851478164 5829562885553925120
2 <NA> 1815327551 -985824069601317760
3 <NA> -1033368904 6290216132129532928
4 <NA> 1403224887 5829562885553925120
5 False 105022598 -3316202600110668288
6 <NA> 255779779 -3670875956966439424
7 <NA> -1364605221 5138853157892062208
8 <NA> -990224089 -3316202600110668288
9 False -1024409087 5829562885553925120
10 <NA> -1625037941 -985824069601317760
11 <NA> <NA> -3670875956966439424
12 True -1658974646 -985824069601317760
13 <NA> -556027779 6178433988546542592
14 <NA> -238654423 -3316202600110668288
15 <NA> -128398233 5829562885553925120
16 <NA> 987032885 6178433988546542592
17 <NA> <NA> 6290216132129532928
18 <NA> 1117133010 6178433988546542592
19 <NA> <NA> 5829562885553925120
20 <NA> -267043327 5138853157892062208
21 <NA> -690910650 5829562885553925120
22 <NA> -1658974646 5138853157892062208
23 <NA> -2039474302 -3670875956966439424
24 <NA> 987032885 -985824069601317760
25 <NA> <NA> 6178433988546542592
26 <NA> -959282444 6178433988546542592
27 <NA> -1744623531 -3670875956966439424
28 <NA> -462848453 -3670875956966439424
29 <NA> -1731261747 -3316202600110668288
30 <NA> 104134845 -3670875956966439424
31 <NA> -1167141929 -3670875956966439424
32 <NA> -1337529137 5829562885553925120
33 <NA> <NA> 6178433988546542592
34 <NA> -542285268 5829562885553925120
35 <NA> -1671290850 -3316202600110668288
36 <NA> -1675605997 6290216132129532928
37 <NA> 491158244 -985824069601317760
38 <NA> -621336100 5829562885553925120
39 <NA> <NA> -985824069601317760
40 <NA> 829987561 5829562885553925120
41 <NA> <NA> -985824069601317760
42 <NA> 104134845 -3316202600110668288
43 True -1097769038 -3670875956966439424
44 <NA> -60427301 6290216132129532928
45 <NA> 987032885 <NA>
46 <NA> -959282444 6290216132129532928
47 <NA> -851478164 -985824069601317760
48 <NA> 1815327551 5138853157892062208
49 <NA> -113330593 6290216132129532928
50 <NA> <NA> 6178433988546542592
51 <NA> <NA> -3670875956966439424
52 <NA> -128398233 6290216132129532928
(Pdb) cudf.read_orc('orc_int_bug.orc', skiprows=44)
0 1 2
0 <NA> 987032885 -3670875956966439424
1 <NA> -959282444 <NA>
2 <NA> -851478164 6290216132129532928
3 <NA> 1815327551 6290216132129532928
4 <NA> -113330593 -985824069601317760
5 <NA> -128398233 5138853157892062208
6 <NA> <NA> 6290216132129532928
7 <NA> <NA> 6178433988546542592
8 <NA> 184 -3670875956966439424 |
galipremsagar
changed the title
[BUG] Data Corruption in boolean column by orc reader when skiprows is specified
[BUG] Data Corruption in columns by orc reader when skiprows is specified
Feb 8, 2021
rapids-bot bot
pushed a commit
that referenced
this issue
Feb 15, 2021
closes #7343 The validity bits in streams are placed msb to lsb in a byte, [True, False, True. False. True, True, True, False] -> 10101110. So, when it is being analyzed as 32 bit chunk, we can't apply mask directly, which caused this issue. `__brev(__byte_perm(bits, 0, 0x0123)) ` takes care of that issue and rearranges the bits as per the expectation. Authors: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) Approvers: - GALI PREM SAGAR (@galipremsagar) - Vukasin Milovanovic (@vuule) URL: #7359
#7359 was supposed to close this issue. closing manually |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
When an orc file contains multiple column types, and
skiprows
is specified, then the data of some columns seem to be corrupted.Steps/Code to reproduce bug
ORC file:(Please remove .zip at the end, added to bypass GitHub attachment restrictions)orc_bool_bug.orc.zip
Expected behavior
If we compare the first dataframe and second dataframe, the corresponding boolean column value for
-18505
isFalse
.Environment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context
Surfaced while running fuzz tests: #6001
The text was updated successfully, but these errors were encountered: