Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joins depend on column order. Normal behaviour? #1179

Closed
ChristK opened this issue Jun 15, 2015 · 10 comments
Closed

Joins depend on column order. Normal behaviour? #1179

ChristK opened this issue Jun 15, 2015 · 10 comments

Comments

@ChristK
Copy link

ChristK commented Jun 15, 2015

I'm not an expert on joins but this doesn't look normal to me. See below.

require(data.table)  # v1.95 latest github version
DT1 <- data.table(a=1:6, key="a")
DT2 <- data.table(a=1:3, b=letters[1:3])
DT3 <- data.table(b=letters[1:3], a=1:3) # reverse the column order

DT1[DT2] # Works as expected
#    a b
#1: 1 a
#2: 2 b
#3: 3 c

DT1[DT3] # Gives error
# Error in bmerge(i <- shallow(i), x, leftcols, rightcols, io <- haskey(i),  : 
# typeof x.a (integer) != typeof i.b (character)

In more complex cases this doesn't break but continuous silently giving unexpected results

sessionInfo()
# R version 3.2.0 (2015-04-16)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.2 LTS

# locale:
# [1] LC_CTYPE=en_GB.UTF-8      
#  [2] LC_NUMERIC=C              
#  [3] LC_TIME=en_GB.UTF-8       
#  [4] LC_COLLATE=en_GB.UTF-8    
#  [5] LC_MONETARY=en_GB.UTF-8   
#  [6] LC_MESSAGES=en_GB.UTF-8   
#  [7] LC_PAPER=en_GB.UTF-8      
#  [8] LC_NAME=C                 
#  [9] LC_ADDRESS=C              
# [10] LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_GB.UTF-8
# [12] LC_IDENTIFICATION=C       

# attached base packages:
# [1] stats     graphics  grDevices utils    
# [5] datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.5

# loaded via a namespace (and not attached):
# [1] tools_3.2.0  chron_2.3-45
@franknarf1
Copy link
Contributor

The error message sort of explains: you see typeof x.a (integer) != typeof i.b (character) because DT1[DT3] merges using DT3's columns in order/by position, not by name (trying to merge a with b).

@ChristK
Copy link
Author

ChristK commented Jun 15, 2015

Yes, but is this expected behaviour? I would expect a join on the key column "a", irrespective of the position of column "a" in the data.table

@franknarf1
Copy link
Contributor

Yes. You can try using merge, which got by.x and by.y arguments recently: #637

@arunsrinivasan
Copy link
Member

The yellow box in this answer pretty much explains the current status of joins.

Joining by column name rather than position when i doesn't have a key set is on the agenda, but we've not managed to get to it yet.

When #1130 will be implemented, the interface will be much better (and transparent).

@ChristK
Copy link
Author

ChristK commented Jun 15, 2015

Thanks you both. That makes sense.
However, shouldn't this throw an error (or warning) then?

DT4 <- data.table(a=factor(1:6), key="a")
DT5 <- data.table(a=letters[1:3], b=1:3)

DT4[DT5] # This is not expected to work
#    a b
# 1: 1 a
# 2: 2 b
# 3: 3 c

I've only started to explore joins. I'm not sure what is considered as the expected behaviour, therefore it would be unsafe to PR myself.

@franknarf1
Copy link
Contributor

Hm. In 1.9.4, I do not see that, but rather

    a b
1: NA 1
2: NA 2
3: NA 3

which is what I expect. There are no matches (so a is blank), but each row of i (DT5 here) is still represented.

@arunsrinivasan
Copy link
Member

@ChristK not sure which commit you're using. I get this:

#    a b
# 1: a 1
# 2: b 2
# 3: c 3

And this is what I expect. @franknarf1 why would you expect the result you show? It was a bug in 1.9.4 that was fixed recently.

x[i] takes each row in DT5 = i and returns the corresponding matching rows in DT4 = x (and NA if no match). So you'd definitely have all rows and columns from DT5 = i in the result (as in this case). In SQL terms, a right join.

However, there are no columns in DT4 = x. So we don't have any NAs. That's the confusing part. Add another column to DT4 and perform the join again, and things should clear up.

Joins between character and factor types are allowed in data.tables.

@franknarf1
Copy link
Contributor

@arunsrinivasan I guess I've been using 1.9.4 so long (after like a year not using R) that I've come to rationalize what I see (and I've seen this bug several times, not thinking it a bug). Whoops.

@ChristK
Copy link
Author

ChristK commented Jun 15, 2015

@arunsrinivasan Now I get it! Thank you for your time and insight.

@ChristK ChristK closed this as completed Jun 15, 2015
@arunsrinivasan
Copy link
Member

@ChristK 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants