forderv handles complex input #3701

MichaelChirico · 2019-07-12T21:26:33Z

Closes #1703
and other parts of #3690.

I took a look into adding support "properly" (directly on the complex columns themselves) at the C level but the effort level there seems prohibitive -- would have to modularize the code a bit to allow CPLXSXP to delegate to REALSXP engine on .r and .i components. #1703 was filed 3 years ago without much follow-up, so I assume the simple approach here will do just fine. A quick benchmark suggested the current implementation beats base::order and an approach that explicitly delegates like order(Re(z), Im(z)) in terms of speed & memory efficiency.

Sorting itself is done I think.

Now working on the stuff downstream of forderv that this should unlock -- keys, joins, etc.

TO DO:

…o cplx_forder

codecov · 2019-07-13T16:50:37Z

Codecov Report

Merging #3701 into master will increase coverage by 0.15%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3701      +/-   ##
==========================================
+ Coverage    98.3%   98.45%   +0.15%     
==========================================
  Files          69       69              
  Lines       13267    13284      +17     
==========================================
+ Hits        13042    13079      +37     
+ Misses        225      205      -20

Impacted Files	Coverage Δ
src/dogroups.c	`96.97% <ø> (+4.12%)`	⬆️
R/setkey.R	`100% <100%> (+1.3%)`	⬆️
src/uniqlist.c	`100% <100%> (ø)`	⬆️
R/data.table.R	`97.87% <100%> (ø)`	⬆️
src/bmerge.c	`98.25% <100%> (+0.29%)`	⬆️
R/bmerge.R	`100% <100%> (ø)`	⬆️
src/forder.c	`100% <100%> (+0.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebe5787...a9d9165. Read the comment docs.

MichaelChirico · 2019-07-13T17:38:30Z

Have taken this pretty far but could go for some help to go further.

@mattdowle could you weigh in on how to proceed with bmerge & uniqlist?

Joins break at https://github.com/Rdatatable/data.table/blob/master/src/bmerge.c#L234 this SWITCH(TYPEOF()). I'm not exactly sure what it's doing, but IIUC there's a pretty big logic shift around the use of dtwiddle: uint64_t (*twiddle)(void *, int) = isInt64 ? &i64twiddle : &dtwiddle;, where CPLXSXP columns are two 64-bit REALSXP. It feels like we'll have to internally mash on = c(z1 = 'z2') into on = c(Re(z1) = Re(z2), Im(z1) = Im(z2)), but I'm weary of going down that path too hastily since it'll be messy to clean up afterwards.
setkey fails using uniqlist at https://github.com/Rdatatable/data.table/blob/master/src/uniqlist.c#L56 this SWITCH(TYPEOF()). Again, not sure exactly what it's doing, but again there's use of dtwiddle. Same concerns.

Any thoughts?

MichaelChirico · 2019-07-13T17:56:54Z

R/setkey.R

@@ -51,14 +51,9 @@ setkeyv = function(x, cols, verbose=getOption("datatable.verbose"), physical=TRU
  }
  if (identical(cols,"")) stop("cols is the empty string. Use NULL to remove the key.")
  if (!all(nzchar(cols))) stop("cols contains some blanks.")
-  if (!length(cols)) {


this if is redundant & contradictory -- just a few lines earlier (L47) has if (!length(cols)) as a warning. I guess this branch was some leftover or copy/paste from setkey... revealed by Codecov

…o cplx_setkey

mattdowle · 2019-07-17T23:13:35Z

Ok so this is a lot of progress! But it's risky. There's a lot of core files changes here, and as we've seen from #3692 (comment), adding support for the new type can have unintended and hard to track memory faults because it's a new type that's bigger too (16 bytes).

The new cplxTwiddled is 30 lines (out of +300 in this PR). I had been thinking we could remove dtwiddle actually, as I've been reducing the use of dtwiddle over time. This new twiddle would need to be tested and debugged separately for complex. The simpler way to handle complex in ordering would be to treat it as two columns of double and reuse the existing tested twiddle on real and imaginary parts. Either the complex column could be unpacked so it really was 2 columns in memory (i.e. unpack the rowwise two-doubles), or, the algos could be extended so that when they saw a complex column they could iterate twice over it (the first time over the real part and the second time over the imaginary part) as if it were two columns of double. The latter would save the allocation and unpacking step and would work better in joins too.

I haven't seen any requests from actual users who want to order, group and join on complex columns. I think until we've got actual use-cases and requests from users, we should draw a line between i) supporting complex in value columns (those PRs merged, like ops on complex in j should work when grouping), and ii) grouping/joining on columns which are complex. For now at least.

The other aspect is rather than adding complex support to grouping/joining as a new type with all the cases in this PR, a generic any-bytes approach should be possible which would also handle grouping by list columns. That's probably the way to go, but I see much higher value tasks on the list; e.g. #3189

Can we close this PR? We can support complex in value columns now, but not join-on-complex or group-by-complex until a user provides a use case that needs grouping or joining on complex columns. The issue can be left open with this distinction clear.

mattdowle · 2019-07-17T23:23:54Z

I hadn't seen this from @eliocamp before I wrote the comment above.
#3689 (comment)
Reading now ...

mattdowle · 2019-07-17T23:29:03Z

@eliocamp The info in your comment linked above was very helpful, thanks! In that by = .(dataset, season, lev) those columns (dataset, season, lev) are not complex iiuc. So in data.table-dev as of now, complex should work in value columns when grouping and joining non-complex columns. But joining-on-complex and grouping-by-complex columns won't be supported. Is that going to be ok for you?

MichaelChirico · 2019-07-17T23:35:34Z

agree this is the heaviest new piece of code--> should think about potential maintenance burden.

also note that #1444 is closed here... perhaps this PR could be scaled back to do forder only? that part doesn't require the by/joins/unique to work, it simply creates a new sorting object based on the real&complex parts.

inefficient, but convenient.

this commit did so at the R level: 2e8c996

this commit moved that logic to C: 52bcac0

PS also agree it would be helpful as a backup to default to base:: functionality for unsupported types, but would require some thought re:proper implementation for that

mattdowle · 2019-07-17T23:58:37Z

Major scaling back would be good. The reorder.c change here would be good to merge as-is though. If cplxTwiddled can be removed then maybe forder.c, but then I'm looking at the 50 lines inside the n_cplx branch and wondering why that needs so many lines.

MichaelChirico · 2019-07-18T00:18:34Z

the reorder.c change was originally another PR but I needed part of it for the bmerge/uniqlist stuff. I'll reopen that one to merge separately.

#3688

MichaelChirico · 2019-07-18T00:25:23Z

for the n_cplx branch, 12 lines or so are comments.

a lot of the rest is duplication -- apply to real part, apply to complex part.

but the idea is to reconstruct a new DT (in order of the original by vector) where the complex columns are replaced by two columns (real part, complex part)

MichaelChirico · 2019-07-18T01:56:12Z

@mattdowle OK, have scaled back by removing the cplxTwiddled bit & anything using it.

Besides the large-ish part for forder itself, the rest of the diff is pretty boilerplate & unlocks some decent functionality slightly downstream of forder:

by = z
dcast(z ~
frank(z)
rowid(z)
rleid(z)
set{key,order}(z)
unique(by=z)

Will file (and close) a follow-up PR with the rest as a starting point if needed in the future, so the commits here can be squashed

src/dogroups.c

eliocamp · 2019-07-18T15:36:18Z

@eliocamp The info in your comment linked above was very helpful, thanks! In that by = .(dataset, season, lev) those columns (dataset, season, lev) are not complex iiuc. So in data.table-dev as of now, complex should work in value columns when grouping and joining non-complex columns. But joining-on-complex and grouping-by-complex columns won't be supported. Is that going to be ok for you?

Yes, its eminently reasonable. Complex values are probably exclusively used as measured values and not category-like so I don't imagine there are many reasonable workflows that would include a by = complex and in that case using by = ReIm(complex) would be a relatively easy workaround.

mattdowle · 2019-07-18T23:05:42Z

src/forder.c

+    }
+    DT = new_dt;
+    ascArg = new_asc;
+    by = new_by;
  }


I'm a bit nervous about this branch. I looks constrained to only when n_cplx>0 (which is good that it's isolated), and I know what it is doing (splitting complex into real and imaginary columns) but I don't see how it's doing it. I'm going to need more time to review it and make sure it doesn't impact anything unexpected.

How about dealing with complex columns further down in this switch. (GitHub wouldn't let me add a comment on that line as it's outside the diff perhaps? so I pasted an image)

when i first took a look i decided against that (I think my initial reaction was range doesn't make sense for complex, but of course it does because they're well-ordered using the lexicographic rule). will take another look.

The n_cplx branch is a bit messy, and took me several iterations to get there as I kept failing a few more tests each time. I think it can be cleaned up a little by snazzier use of the j index.

Idea is: receive by which is say length n, m of which are complex columns.

Instead of sorting using the input list DT, do so using a new list (constructed as new_dt).

If there are no complex columns, new_dt[[1]] is DT[[ by[1] ]], new_dt[[2]] is DT[[ by[2] ]], and so on.

If there are complex columns, say the first is at the third index of by, then new_dt[[3]] is Re( DT[[ by[3] ]] ) and new_dt[[4]] is Im( DT[[ by[3] ]] ). Then continuing on as usual. That "bump" of the index by a complex column is what made things complicated.

Similar thinking applies to the re-mapped ascArg; by simply gets re-mapped to seq_along(new_dt)

I wasn't thinking of doing a single range for complex. The thought was to trick the loop into doing a complex column twice. The first time doing range for the real part only. The 2nd time calling range for the imaginary part only. There might need to be new range_complex_r and range_complex_i perhaps.

The other option is to postpone this PR to a later date. Elio's response above confirms that setkey-on-complex and grouping-on-complex isn't high priority. It was complex value columns that was the main thing; which you did and have been merged.

I think the biggest impediment to that is I don't see a way to get a double * object pointing to the real/imaginary vectors (like REAL(COMPLEX(x)) or COMPLEX(x).r, hope that makes sense), which is what all of the functions for REAL columns take as input.

It would nice for completeness to get this merged and closed #1703, yes. But I'm also an eye that we've both been working on complex number support for 3 days now (elapsed) and continuing. And Elio said going this far (sorting complex) isn't needed. There isn't any demand for this from users as far as I can see.

True... I'm just loath to close the PR given that it's passing tests. Also understand the tradeoff that is the maintenance burden.

I think the biggest impediment to that is I don't see a way to get a double * object pointing to the real/imaginary vectors (like REAL(COMPLEX(x)) or COMPLEX(x).r, hope that makes sense), which is what all of the functions for REAL columns take as input.

Yes makes sense. That's why I wrote above: "There might need to be new range_complex_r and range_complex_i perhaps." But I didn't look at it closely.

maintenance burden

It's not a burden in the sense that more code is just always generally bad. It's that this particular change could cause hard to trace bugs because of the type of change it is to such a core function. It's risk. And it's time in reviewing it.

it's passing tests

But this shows the tests aren't sufficient: the cast isn't correct but it's still passing tests.

But this shows the tests aren't sufficient: the cast isn't correct but it's still passing tests.

Very true, I meant more that it's passing existing tests (so "no" existing functionality is being affected).

Would adding a lot more tests to the suite for the new code be a good direction?

A good direction would be to focus on the most requested issues: #3189

mattdowle · 2019-07-18T23:12:25Z

src/uniqlist.c

+        case CPLXSXP: {
+          // tried to make long long complex * but got a warning that it's a GNU extension
+          Rcomplex *pz = COMPLEX(jcol);
+          same = (long long)pz[i].r == (long long)pz[i-1].r && (long long)pz[i].i == (long long)pz[i-1].i;


This (long long) is a cast and won't achieve the desired op. This cast will trunc 3.4 to 3 for example. What's needed is a type pun, which is what the REALSXP case just above does.

So to clarify in the REALSXP branch:

long long *ll = (long long *)REAL(jcol); same = ll[i]==ll[i-1];

What's happening here is REAL is actually a pointer to double, but ll is a long long pointer to the same address as where jcol is.

So then ll[i] == ll[i-1] treats the elements of jcol (which are actually real) as 8-byte integers & compares those 64 bits.

Where as (long long)pz[i].r (etc) is actually copying the double at pz[i].r and casting it to long long, leading to truncation, so the operations aren't equivalent

Is that the gist of it?

Yes: exactly.

Is there a native 16-byte type to pun to? I'm not seeing one. I guess we could just use pure equality-on-doubles & leave it at that as it's low enough priority.

Another way, since all the cases have the same code other than the comparison, is to take away all the cases and just do one memcmp for all types (just pass the size to memcmp). Would need benchmarking on large data to see if memcmp was any slower (overhead might come into it because memcmp would be called a lot). I didn't look closely to see if there are any other difference between all those cases in this switch in this file, to know whether that's possible. It may be faster to pass a fixed constant to memcmp in each of the cases (rather than passing a variable size to it) because then gcc optimizer would convert the memcmp to the best method in assembly for that size and avoid a function call.

mattdowle · 2019-07-18T23:12:58Z

src/uniqlist.c

+      Rcomplex *pzjcol = COMPLEX(jcol);
+      for (R_xlen_t i=1; i<nrow; i++) {
+        bool same = (long long)pzjcol[i].r == (long long)pzjcol[i-1].r &&
+                    (long long)pzjcol[i].i == (long long)pzjcol[i-1].i;


same here (pun needed not cast)

…alves the working ram needed as it reuses the same tmp for the rerun

Michael Chirico added 9 commits July 10, 2019 13:21

Closes #1444 -- setkey works on tables with complex columns

73a63ca

extra test for group operation mentioned in issue

bdce4a5

new tests for coverage

e6533dc

missing arg

c210ede

Closes #1703 -- forderv handles complex input

55bd501

slight re-tooling, now passing tests

2e8c996

more progress; but stonewalled by bmerge

e3a6aa6

Merge branch 'master' of https://github.com/Rdatatable/data.table int…

37e8431

…o cplx_forder

moved new logic to C so e.g. bmerge can call it from there

52bcac0

some coverage tests, extension to rleid()

bf59da2

MichaelChirico commented Jul 13, 2019

View reviewed changes

Michael Chirico added 6 commits July 14, 2019 03:34

progress making ctwiddle (dtwiddle for cplx)

09c8b19

start preferring Rcomplex type

e5d1d1d

switch to Rcomplex API

9a4cef4

Merge branch 'master' of https://github.com/Rdatatable/data.table int…

494468f

…o cplx_setkey

Merge branch 'cplx_setkey' into cplx_forder

e0b17b8

setkey now works on complex columns

761fe90

MichaelChirico mentioned this pull request Jul 14, 2019

setkey works on tables with complex columns #3688

Merged

ostensibly done uniqlist; progress on bmerge

7261011

MichaelChirico mentioned this pull request Jul 15, 2019

Closes #3639 -- proper handling of CPLXSXP in dogroups #3689

Merged

Merge branch 'master' into cplx_forder

25cdf0d

mattdowle added the WIP label Jul 17, 2019

mattdowle changed the title ~~WIP: Closes #1703 -- forderv handles complex input~~ forderv handles complex input Jul 17, 2019

Merge branch 'master' into cplx_forder

681ab5d

mattdowle and others added 2 commits July 17, 2019 18:24

Merge branch 'master' into cplx_forder

81a941a

scale back attempts at bmerge, all of uniqlist

5f443cb

Michael Chirico added 2 commits July 18, 2019 09:58

tidy up tests

b922ee9

unique also works

b656541

MichaelChirico removed the WIP label Jul 18, 2019

updated NEWS item & added coverage tests

3c3228b

MichaelChirico commented Jul 18, 2019

View reviewed changes

src/dogroups.c Outdated Show resolved Hide resolved

Michael Chirico added 3 commits July 18, 2019 12:28

one more nocov

009935c

actually hit LGLSXP branch!

f59dc57

more coverage

0338226

mattdowle reviewed Jul 18, 2019

View reviewed changes

Michael Chirico and others added 3 commits July 19, 2019 07:17

Merge branch 'master' into cplx_forder

523ab00

use direct double comparison instead of type punning

598097b

replaced big block with smaller modification to the main loop; also h…

2df22a3

…alves the working ram needed as it reuses the same tmp for the rerun

mattdowle added this to the 1.12.4 milestone Jul 19, 2019

memcmp for complex instead of == on double

dd7b24a

mattdowle added a commit that referenced this pull request Jul 19, 2019

formatting-only subtracted from #3701

ebe5787

mattdowle added 2 commits July 19, 2019 13:22

merge master

5e81694

news item tidy

a9d9165

Rdatatable deleted a comment from codecov bot Jul 19, 2019

mattdowle merged commit e5c0bed into master Jul 19, 2019

mattdowle deleted the cplx_forder branch July 19, 2019 21:39

ben-schwen mentioned this pull request Jul 2, 2024

Update documentation for CJ supporting complex number #6215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

forderv handles complex input #3701

forderv handles complex input #3701

MichaelChirico commented Jul 12, 2019 •

edited by mattdowle

Loading

codecov bot commented Jul 13, 2019 •

edited

Loading

MichaelChirico commented Jul 13, 2019

MichaelChirico Jul 13, 2019

mattdowle commented Jul 17, 2019 •

edited

Loading

mattdowle commented Jul 17, 2019

mattdowle commented Jul 17, 2019

MichaelChirico commented Jul 17, 2019 •

edited

Loading

mattdowle commented Jul 17, 2019 •

edited

Loading

MichaelChirico commented Jul 18, 2019 •

edited

Loading

MichaelChirico commented Jul 18, 2019

MichaelChirico commented Jul 18, 2019 •

edited

Loading

eliocamp commented Jul 18, 2019

mattdowle Jul 18, 2019

mattdowle Jul 18, 2019 •

edited

Loading

MichaelChirico Jul 18, 2019 •

edited

Loading

mattdowle Jul 18, 2019

mattdowle Jul 18, 2019 •

edited

Loading

MichaelChirico Jul 18, 2019

MichaelChirico Jul 19, 2019

mattdowle Jul 19, 2019

MichaelChirico Jul 19, 2019

mattdowle Jul 19, 2019

mattdowle Jul 18, 2019

MichaelChirico Jul 18, 2019 •

edited

Loading

mattdowle Jul 18, 2019

MichaelChirico Jul 18, 2019

mattdowle Jul 18, 2019 •

edited

Loading

mattdowle Jul 18, 2019

forderv handles complex input #3701

forderv handles complex input #3701

Conversation

MichaelChirico commented Jul 12, 2019 • edited by mattdowle Loading

codecov bot commented Jul 13, 2019 • edited Loading

Codecov Report

MichaelChirico commented Jul 13, 2019

Choose a reason for hiding this comment

mattdowle commented Jul 17, 2019 • edited Loading

mattdowle commented Jul 17, 2019

mattdowle commented Jul 17, 2019

MichaelChirico commented Jul 17, 2019 • edited Loading

mattdowle commented Jul 17, 2019 • edited Loading

MichaelChirico commented Jul 18, 2019 • edited Loading

MichaelChirico commented Jul 18, 2019

MichaelChirico commented Jul 18, 2019 • edited Loading

eliocamp commented Jul 18, 2019

Choose a reason for hiding this comment

mattdowle Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

MichaelChirico Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico commented Jul 12, 2019 •

edited by mattdowle

Loading

codecov bot commented Jul 13, 2019 •

edited

Loading

mattdowle commented Jul 17, 2019 •

edited

Loading

MichaelChirico commented Jul 17, 2019 •

edited

Loading

mattdowle commented Jul 17, 2019 •

edited

Loading

MichaelChirico commented Jul 18, 2019 •

edited

Loading

MichaelChirico commented Jul 18, 2019 •

edited

Loading

mattdowle Jul 18, 2019 •

edited

Loading

MichaelChirico Jul 18, 2019 •

edited

Loading

mattdowle Jul 18, 2019 •

edited

Loading

MichaelChirico Jul 18, 2019 •

edited

Loading

mattdowle Jul 18, 2019 •

edited

Loading