Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forderv handles complex input #3701

Merged
merged 33 commits into from
Jul 19, 2019
Merged

forderv handles complex input #3701

merged 33 commits into from
Jul 19, 2019

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Jul 12, 2019

Closes #1703
and other parts of #3690.

I took a look into adding support "properly" (directly on the complex columns themselves) at the C level but the effort level there seems prohibitive -- would have to modularize the code a bit to allow CPLXSXP to delegate to REALSXP engine on .r and .i components. #1703 was filed 3 years ago without much follow-up, so I assume the simple approach here will do just fine. A quick benchmark suggested the current implementation beats base::order and an approach that explicitly delegates like order(Re(z), Im(z)) in terms of speed & memory efficiency.

Sorting itself is done I think.

Now working on the stuff downstream of forderv that this should unlock -- keys, joins, etc.

TO DO:

@codecov
Copy link

codecov bot commented Jul 13, 2019

Codecov Report

Merging #3701 into master will increase coverage by 0.15%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3701      +/-   ##
==========================================
+ Coverage    98.3%   98.45%   +0.15%     
==========================================
  Files          69       69              
  Lines       13267    13284      +17     
==========================================
+ Hits        13042    13079      +37     
+ Misses        225      205      -20
Impacted Files Coverage Δ
src/dogroups.c 96.97% <ø> (+4.12%) ⬆️
R/setkey.R 100% <100%> (+1.3%) ⬆️
src/uniqlist.c 100% <100%> (ø) ⬆️
R/data.table.R 97.87% <100%> (ø) ⬆️
src/bmerge.c 98.25% <100%> (+0.29%) ⬆️
R/bmerge.R 100% <100%> (ø) ⬆️
src/forder.c 100% <100%> (+0.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebe5787...a9d9165. Read the comment docs.

@MichaelChirico
Copy link
Member Author

Have taken this pretty far but could go for some help to go further.

@mattdowle could you weigh in on how to proceed with bmerge & uniqlist?

Any thoughts?

@@ -51,14 +51,9 @@ setkeyv = function(x, cols, verbose=getOption("datatable.verbose"), physical=TRU
}
if (identical(cols,"")) stop("cols is the empty string. Use NULL to remove the key.")
if (!all(nzchar(cols))) stop("cols contains some blanks.")
if (!length(cols)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if is redundant & contradictory -- just a few lines earlier (L47) has if (!length(cols)) as a warning. I guess this branch was some leftover or copy/paste from setkey... revealed by Codecov

@mattdowle mattdowle added the WIP label Jul 17, 2019
@mattdowle mattdowle changed the title WIP: Closes #1703 -- forderv handles complex input forderv handles complex input Jul 17, 2019
@mattdowle
Copy link
Member

mattdowle commented Jul 17, 2019

Ok so this is a lot of progress! But it's risky. There's a lot of core files changes here, and as we've seen from #3692 (comment), adding support for the new type can have unintended and hard to track memory faults because it's a new type that's bigger too (16 bytes).

The new cplxTwiddled is 30 lines (out of +300 in this PR). I had been thinking we could remove dtwiddle actually, as I've been reducing the use of dtwiddle over time. This new twiddle would need to be tested and debugged separately for complex. The simpler way to handle complex in ordering would be to treat it as two columns of double and reuse the existing tested twiddle on real and imaginary parts. Either the complex column could be unpacked so it really was 2 columns in memory (i.e. unpack the rowwise two-doubles), or, the algos could be extended so that when they saw a complex column they could iterate twice over it (the first time over the real part and the second time over the imaginary part) as if it were two columns of double. The latter would save the allocation and unpacking step and would work better in joins too.

I haven't seen any requests from actual users who want to order, group and join on complex columns. I think until we've got actual use-cases and requests from users, we should draw a line between i) supporting complex in value columns (those PRs merged, like ops on complex in j should work when grouping), and ii) grouping/joining on columns which are complex. For now at least.

The other aspect is rather than adding complex support to grouping/joining as a new type with all the cases in this PR, a generic any-bytes approach should be possible which would also handle grouping by list columns. That's probably the way to go, but I see much higher value tasks on the list; e.g. #3189

Can we close this PR? We can support complex in value columns now, but not join-on-complex or group-by-complex until a user provides a use case that needs grouping or joining on complex columns. The issue can be left open with this distinction clear.

@mattdowle
Copy link
Member

I hadn't seen this from @eliocamp before I wrote the comment above.
#3689 (comment)
Reading now ...

@mattdowle
Copy link
Member

@eliocamp The info in your comment linked above was very helpful, thanks! In that by = .(dataset, season, lev) those columns (dataset, season, lev) are not complex iiuc. So in data.table-dev as of now, complex should work in value columns when grouping and joining non-complex columns. But joining-on-complex and grouping-by-complex columns won't be supported. Is that going to be ok for you?

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jul 17, 2019

agree this is the heaviest new piece of code--> should think about potential maintenance burden.

also note that #1444 is closed here... perhaps this PR could be scaled back to do forder only? that part doesn't require the by/joins/unique to work, it simply creates a new sorting object based on the real&complex parts.

inefficient, but convenient.

this commit did so at the R level: 2e8c996

this commit moved that logic to C: 52bcac0

PS also agree it would be helpful as a backup to default to base:: functionality for unsupported types, but would require some thought re:proper implementation for that

@mattdowle
Copy link
Member

mattdowle commented Jul 17, 2019

Major scaling back would be good. The reorder.c change here would be good to merge as-is though. If cplxTwiddled can be removed then maybe forder.c, but then I'm looking at the 50 lines inside the n_cplx branch and wondering why that needs so many lines.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jul 18, 2019

the reorder.c change was originally another PR but I needed part of it for the bmerge/uniqlist stuff. I'll reopen that one to merge separately.

#3688

@MichaelChirico
Copy link
Member Author

for the n_cplx branch, 12 lines or so are comments.

a lot of the rest is duplication -- apply to real part, apply to complex part.

but the idea is to reconstruct a new DT (in order of the original by vector) where the complex columns are replaced by two columns (real part, complex part)

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jul 18, 2019

@mattdowle OK, have scaled back by removing the cplxTwiddled bit & anything using it.

Besides the large-ish part for forder itself, the rest of the diff is pretty boilerplate & unlocks some decent functionality slightly downstream of forder:

  • by = z
  • dcast(z ~
  • frank(z)
  • rowid(z)
  • rleid(z)
  • set{key,order}(z)
  • unique(by=z)

Will file (and close) a follow-up PR with the rest as a starting point if needed in the future, so the commits here can be squashed

src/dogroups.c Outdated Show resolved Hide resolved
@eliocamp
Copy link
Contributor

@eliocamp The info in your comment linked above was very helpful, thanks! In that by = .(dataset, season, lev) those columns (dataset, season, lev) are not complex iiuc. So in data.table-dev as of now, complex should work in value columns when grouping and joining non-complex columns. But joining-on-complex and grouping-by-complex columns won't be supported. Is that going to be ok for you?

Yes, its eminently reasonable. Complex values are probably exclusively used as measured values and not category-like so I don't imagine there are many reasonable workflows that would include a by = complex and in that case using by = ReIm(complex) would be a relatively easy workaround.

src/forder.c Outdated
}
DT = new_dt;
ascArg = new_asc;
by = new_by;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit nervous about this branch. I looks constrained to only when n_cplx>0 (which is good that it's isolated), and I know what it is doing (splitting complex into real and imaginary columns) but I don't see how it's doing it. I'm going to need more time to review it and make sure it doesn't impact anything unexpected.

Copy link
Member

@mattdowle mattdowle Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about dealing with complex columns further down in this switch. (GitHub wouldn't let me add a comment on that line as it's outside the diff perhaps? so I pasted an image)
image

Copy link
Member Author

@MichaelChirico MichaelChirico Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when i first took a look i decided against that (I think my initial reaction was range doesn't make sense for complex, but of course it does because they're well-ordered using the lexicographic rule). will take another look.

The n_cplx branch is a bit messy, and took me several iterations to get there as I kept failing a few more tests each time. I think it can be cleaned up a little by snazzier use of the j index.

Idea is: receive by which is say length n, m of which are complex columns.

Instead of sorting using the input list DT, do so using a new list (constructed as new_dt).

If there are no complex columns, new_dt[[1]] is DT[[ by[1] ]], new_dt[[2]] is DT[[ by[2] ]], and so on.

If there are complex columns, say the first is at the third index of by, then new_dt[[3]] is Re( DT[[ by[3] ]] ) and new_dt[[4]] is Im( DT[[ by[3] ]] ). Then continuing on as usual. That "bump" of the index by a complex column is what made things complicated.

Similar thinking applies to the re-mapped ascArg; by simply gets re-mapped to seq_along(new_dt)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't thinking of doing a single range for complex. The thought was to trick the loop into doing a complex column twice. The first time doing range for the real part only. The 2nd time calling range for the imaginary part only. There might need to be new range_complex_r and range_complex_i perhaps.

Copy link
Member

@mattdowle mattdowle Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option is to postpone this PR to a later date. Elio's response above confirms that setkey-on-complex and grouping-on-complex isn't high priority. It was complex value columns that was the main thing; which you did and have been merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the biggest impediment to that is I don't see a way to get a double * object pointing to the real/imaginary vectors (like REAL(COMPLEX(x)) or COMPLEX(x).r, hope that makes sense), which is what all of the functions for REAL columns take as input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would nice for completeness to get this merged and closed #1703, yes. But I'm also an eye that we've both been working on complex number support for 3 days now (elapsed) and continuing. And Elio said going this far (sorting complex) isn't needed. There isn't any demand for this from users as far as I can see.

True... I'm just loath to close the PR given that it's passing tests. Also understand the tradeoff that is the maintenance burden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the biggest impediment to that is I don't see a way to get a double * object pointing to the real/imaginary vectors (like REAL(COMPLEX(x)) or COMPLEX(x).r, hope that makes sense), which is what all of the functions for REAL columns take as input.

Yes makes sense. That's why I wrote above: "There might need to be new range_complex_r and range_complex_i perhaps." But I didn't look at it closely.

maintenance burden

It's not a burden in the sense that more code is just always generally bad. It's that this particular change could cause hard to trace bugs because of the type of change it is to such a core function. It's risk. And it's time in reviewing it.

it's passing tests

But this shows the tests aren't sufficient: the cast isn't correct but it's still passing tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this shows the tests aren't sufficient: the cast isn't correct but it's still passing tests.

Very true, I meant more that it's passing existing tests (so "no" existing functionality is being affected).

Would adding a lot more tests to the suite for the new code be a good direction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good direction would be to focus on the most requested issues: #3189

src/uniqlist.c Outdated
case CPLXSXP: {
// tried to make long long complex * but got a warning that it's a GNU extension
Rcomplex *pz = COMPLEX(jcol);
same = (long long)pz[i].r == (long long)pz[i-1].r && (long long)pz[i].i == (long long)pz[i-1].i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (long long) is a cast and won't achieve the desired op. This cast will trunc 3.4 to 3 for example. What's needed is a type pun, which is what the REALSXP case just above does.

Copy link
Member Author

@MichaelChirico MichaelChirico Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify in the REALSXP branch:

long long *ll = (long long *)REAL(jcol);
same = ll[i]==ll[i-1];

What's happening here is REAL is actually a pointer to double, but ll is a long long pointer to the same address as where jcol is.

So then ll[i] == ll[i-1] treats the elements of jcol (which are actually real) as 8-byte integers & compares those 64 bits.

Where as (long long)pz[i].r (etc) is actually copying the double at pz[i].r and casting it to long long, leading to truncation, so the operations aren't equivalent

Is that the gist of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: exactly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a native 16-byte type to pun to? I'm not seeing one. I guess we could just use pure equality-on-doubles & leave it at that as it's low enough priority.

Copy link
Member

@mattdowle mattdowle Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way, since all the cases have the same code other than the comparison, is to take away all the cases and just do one memcmp for all types (just pass the size to memcmp). Would need benchmarking on large data to see if memcmp was any slower (overhead might come into it because memcmp would be called a lot). I didn't look closely to see if there are any other difference between all those cases in this switch in this file, to know whether that's possible. It may be faster to pass a fixed constant to memcmp in each of the cases (rather than passing a variable size to it) because then gcc optimizer would convert the memcmp to the best method in assembly for that size and avoid a function call.

src/uniqlist.c Outdated
Rcomplex *pzjcol = COMPLEX(jcol);
for (R_xlen_t i=1; i<nrow; i++) {
bool same = (long long)pzjcol[i].r == (long long)pzjcol[i-1].r &&
(long long)pzjcol[i].i == (long long)pzjcol[i-1].i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (pun needed not cast)

@mattdowle mattdowle added this to the 1.12.4 milestone Jul 19, 2019
mattdowle added a commit that referenced this pull request Jul 19, 2019
@Rdatatable Rdatatable deleted a comment from codecov bot Jul 19, 2019
@mattdowle mattdowle merged commit e5c0bed into master Jul 19, 2019
@mattdowle mattdowle deleted the cplx_forder branch July 19, 2019 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend forder support to 'complex' type.
3 participants