Wrangling data using only (unix) shell commands.
The focus is on plain text and CSV (interpreted broadly in that the C(omma) can be any other kind of delimiter)!
cut
= filter columnssed
= replace (and much more)grep
= filter rowssort
= sort!uniq
= count duplicate (with sort = crude group by)paste
= join 2 files (line by line)wc
= count lines or "words"split
= split a file into pieces (less useful)
The limitation of most shell utilities for CSV is that:
- they are fundamentally line oriented
- they are limited to a fairly naive approach to delimiters
If your CSVs are well behaved and do not:
- include line terminators within fields (which is allowed by CSV!)
- have delimiter occurring with values (e.g. having , appear in a field value - again this is very allowed as long as field value is quoted)
Then there is no problem. If your CSVs are not like this it is still possible these tools will be useful but you will have to be more careful.
Pipes! One of the great (computing design) ideas of all time:
- We should have some ways of coupling programs like garden hose--screw in another segment when it becomes when it becomes necessary to massage data in another way. This is the way of IO also. [Doug McIlroy, 1964] (source)
Basic point: the unix shell (and all of the above commands) have great support for feeding the output of one command into the input of another and doing so in a streaming "pipe-like" manner.
Example:
head -n20 file.txt | tail -n5 | cut -c1-10
This take first 20 lines of file.txt pipes that into tail which takes first 5 lines of that pipes that into cut which take characters 1-10 of each line. The result: characters 1-10 of lines 15-20.
If continuous lines at top and bottom of the file use head or tail.
Delete first line:
tail -n +2 {file}
Delete last line:
head -n +2 {file}
Delete the nth line:
sed nd {file}
Delete a range of lines (n-m):
sed n,md {file}
Multiple deletes (first, third, n-m)
sed 1;3;n,md {file}
Sort a CSV file by 2nd, 1st and 3rd columns.
sort --field-separator=',' --key=2,1,3 {file}
You can just do a simple cat:
cat somefile.csv
For a nice view with proper spacing:
column -s, -t -n < somefile.csv | less -#2 -N -S
Note: when you have empty fields, you will need to put some kind of placeholder in it, otherwise the column gets merged with following columns. The following example demonstrates how to use sed to insert a placeholder:
$ cat data.csv
1,2,3,4,5
1,,,,5
$ sed 's/,,/, ,/g;s/,,/, ,/g' data.csv | column -s, -t -n
1 2 3 4 5
1 5
$ cat data.csv
1,2,3,4,5
1,,,,5
$ column -s, -t -n < data.csv
1 2 3 4 5
1 5
$ sed 's/,,/, ,/g;s/,,/, ,/g' data.csv | column -s, -t -n
1 2 3 4 5
1 5