Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More than one join and stdin, part two #403

Closed
johnkerl opened this issue Jan 18, 2021 · 4 comments
Closed

More than one join and stdin, part two #403

johnkerl opened this issue Jan 18, 2021 · 4 comments

Comments

@johnkerl
Copy link
Owner

I have multiple files with the same structure (== same header), i.e. one of the files looks like:

→ mlr --csv cat input_1.csv
zone,label,mean,stddev
1,Barren,0.985039418507162,0.00327046755267665
2,Permanent Snow and Ice,0.990449367088603,0.00347695390530483
3,Water Bodies,0.989689587426295,0.00283130417558745
9,Urban and Built-up Lands,0.975935137657604,0.00444728462815199
10,Dense Forests,0.982209571011,0.00151525916704626
20,Open Forests,0.982498749692162,0.00156407685057156
25,Forest/Cropland Mosaics,0.983158952435782,0.00119917817740868
30,Natural Herbaceous,0.982886083655933,0.00172176084515656
35,Natural Herbaceous/Croplands Mosaics,0.983636363636364,0.000771389215405308
36,Herbaceous Croplands,0.983256814928586,0.00116699486215588
40,Shrublands,0.977095890410958,0.00439150607284168

I can repeat the above example for up to 3 inputs files:

→ mlr --csv join --ul --ur --lp l --rp r -j zone,label -f input_1.csv then join -j zone,label -f input_2.csv input_3.csv
zone,label,mean,stddev,lmean,lstddev,rmean,rstddev
1,Barren,0.985141452451229,0.00296063409807811,0.985039418507162,0.00327046755267665,0.984987557668154,0.00294390190031405
2,Permanent Snow and Ice,0.990172413793102,0.003303950103698,0.990449367088603,0.00347695390530483,0.989895569620253,0.0036093031143493
3,Water Bodies,0.988460091843363,0.00249696810252014,0.989689587426295,0.00283130417558745,0.988820568927725,0.00259939680954198
9,Urban and Built-up Lands,0.976210534599518,0.00436170313798978,0.975935137657604,0.00444728462815199,0.976246019422661,0.00448461857275749
10,Dense Forests,0.982076308739861,0.00148154340071296,0.982209571011,0.00151525916704626,0.982190048828062,0.00146571376545496
20,Open Forests,0.982497740034809,0.00153034810204273,0.982498749692162,0.00156407685057156,0.982466195761245,0.00150786476199308
25,Forest/Cropland Mosaics,0.983225779156313,0.00117294632959691,0.983158952435782,0.00119917817740868,0.983067448680308,0.00118920277246029
30,Natural Herbaceous,0.982983720528064,0.00166879860717559,0.982886083655933,0.00172176084515656,0.982925841727431,0.00165974257213183
35,Natural Herbaceous/Croplands Mosaics,0.983354838709678,0.00106402725807591,0.983636363636364,0.000771389215405308,0.983575757575758,0.000817620458171498
36,Herbaceous Croplands,0.983352463549144,0.00113906386917183,0.983256814928586,0.00116699486215588,0.983220069731363,0.00116521692353466
40,Shrublands,0.977145547945205,0.00442587899140026,0.977095890410958,0.00439150607284168,0.977166380789022,0.00444267104546084

How about doing this for many more input files?

Originally posted by @NikosAlexandris in #235 (comment)

@NikosAlexandris
Copy link
Contributor

NikosAlexandris commented Jan 18, 2021

I probably don't understand the underlying programmatic structures and the complexity of it. Else, is it not common to join multiple files that are identically structured and only rename the columns that differ in content?

Imaginary example

mlr --csv join --cp -j zone, label -f input*.csv

( --cp as in count prefix or maybe --pc as in prefix counter)

that will output

zone,label,mean,stddev,mean1,stddev1,mean2,stddev2,mean3,stddev3
1,Barren,0.91,0.01,0.92,0.02,0.93,0.03,0.94,0.04
..

@johnkerl
Copy link
Owner Author

johnkerl commented Feb 9, 2021

Hi @NikosAlexandris -- sorry for the long delay.

I can indeed see the value of this!!

The original idea of Miller was that -- for all verbs, not just join -- the input*.csv are one long stream. Another example of this is mlr --json count *.json -- that counts the number of records in the input stream, not counts per file. One could do for file in *.json; do mlr --json count $file; done.

But here even that wouldn't work, since you truly want an n-wise join of n files, which is a great idea -- and not what was implemented. :^/

@NikosAlexandris
Copy link
Contributor

Imaginary example

mlr --csv join --cp -j zone, label -f input*.csv

( --cp as in count prefix or maybe --pc as in prefix counter)

Should be --cs or --sc, s as in suffix as the counter is logical to go in the end :-)

@johnkerl
Copy link
Owner Author

Closing this as a duplicate of #711 (which remains open).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants