Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ChangeLog: v5.6.2 Bug fixes: #271 fixes a corner-case bug with more than 100 CSV/TSV files with headers of varying lengths. Documentation: The new http://johnkerl.org/miller/doc/whyc-details.html is an elaboration on http://johnkerl.org/miller/doc/whyc.html which answers a question posed by @BurntSushi on Reddit a couple years ago which I did not address in detail at the time. v5.6.1 The only change is that http://johnkerl.org/miller/doc is now more mobile-friendly. All build artifacts are the same as at https://github.com/johnkerl/miller/releases/tag/v5.6.0 v5.6.0 The new system DSL function allows you to run arbitrary shell commands and store them in field values. Some example usages are documented here. This is in response to issues #246 and #209. There is now support for ASV and USV file formats. This is in response to issue #245. The new format-values verb allows you to apply numerical formatting across all record values. This is in response to issue #252. Documentation: The new DKVP I/O in Python sample code now works for Python 2 as well as Python 3. There is a new cookbook entry on doing multiple joins. This is in response to issue #235. Bugfixes: The toupper, tolower, and capitalize DSL functions are now UTF-8 aware, thanks to @sheredom's marvelous https://github.com/sheredom/utf8.h. The internationalization page has also been expanded. This is in response to issue #254. #250 fixes a bug using in-place mode in conjunction with verbs (such as rename or sort) which take field-name lists as arguments. #253 fixes a bug in the label when one or more names are common between old and new. #251 fixes a corner-case bug when (a) input is CSV; (b) the last field ends with a comma and no newline; (c) input is from standard input and/or --no-mmap is supplied. v5.5.0 The new positional-indexing feature resolves #236 from @aborruso. You can now get the name of the 3rd field of each record via $[[3]], and its value by $[[[3]]]. These are both usable on either the left-hand or right-hand side of assignment statements, so you can more easily do things like renaming fields progrmatically within the DSL. There is a new capitalize DSL function, complementing the already-existing toupper. This stems from #236. There is a new skip-trivial-records verb, resolving #197. Similarly, there is a new remove-empty-columns verb, resolving #206. Both are useful for data-cleaning use-cases. Another pair is #181 and #256. While Miller uses mmap internally (and invisibily) to get approximately a 20% performance boost over not using it, this can cause out-of-memory issues with reading either large files, or too many small ones. Now, Miller automatically avoids mmap in these cases. You can still use --mmap or --no-mmap if you want manual control of this. There is a new --ivar option for the nest verb which complements the already-existing --evar. This is from #260 thanks to @jgreely. There is a new keystroke-saving urandrange DSL function: urandrange(low, high) is the same as low + (high - low) * urand(). This arose from #243. There is a new -v option for the cat verb which writes a low-level record-structure dump to standard error. There is a new -N option for mlr which is a keystroke-saver for --implicit-csv-header --headerless-csv-output. Documentation: The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_to_escape_'%3F'_in_regexes%3F resolves #203. The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_can_I_filter_by_date%3F resolves #208. #244 fixes a documentation issue while highlighting the need for #241. Bugfixes: There was a SEGV using nest within then-chains, fixed in response to #220. Quotes and backslashes weren't being escaped in JSON output with --jvquoteall; reported on #222. v5.4.0 The new clean-whitespace verb resolves #190 from @aborruso. Along with the new functions strip, lstrip, rstrip, collapse_whitespace, and clean_whitespace, there is now both coarse-grained and fine-grained control over whitespace within field names and/or values. See the linked-to documentation for examples. The new altkv verb resolves #184 which was originally opened via an email request. This supports mapping value-lists such as a,b,c,d to alternating key-value pairs such as a=b,c=d. The new fill-down verb resolves #189 by @aborruso. See the linked-to documentation for examples. The uniq verb now has a uniq -a which resolves #168 from @sjackman. The new regextract and regextract_or_else functions resolve #183 by @aborruso. The new ssub function arises from #171 by @dohse, as a simplified way to avoid escaping characters which are special to regular-expression parsers. There are new localtime functions in response to #170 by @sitaramc. However note that as discussed on #170 these do not undo one another in all circumstances. This is a non-issue for timezones which do not do DST. Otherwise, please use with disclaimers: localdate, localtime2sec, sec2localdate, sec2localtime, strftime_local, and strptime_local. Builds: Windows build-artifacts are now available in Appveyor at https://ci.appveyor.com/project/johnkerl/miller/build/artifacts, and will be attached to this and future releases. This resolves #167, #148, and #109. Travis builds at https://travis-ci.org/johnkerl/miller/builds now run on OSX as well as Linux. An Ubuntu 17 build issue was fixed by @singalen on #164. Documentation: put/filter documentation was confusing as reported by @NikosAlexandris on #169. The new FAQ entry http://johnkerl.org/miller-releases/miller-head/doc/faq.html#How_to_rectangularize_after_joins_with_unpaired? resolves #193 by @aborruso. The new cookbook entry http://johnkerl.org/miller/doc/cookbook.html#Options_for_dealing_with_duplicate_rows arises from #168 from @sjackman. The unsparsify documentation had some words missing as reported by @tst2005 on #194. There was a typo in the cookpage page http://johnkerl.org/miller/doc/cookbook.html#Full_field_renames_and_reassigns as fixed by @tst2005 in #192. Bugfixes: There was a memory leak for TSV-format files only as reported by @treynr on #181. Dollar sign in regular expressions were not being escaped properly as reported by @dohse on #171. v5.3.0 Comment strings in data files: mlr --skip-comments allows you to filter out input lines starting with #, for all file formats. Likewise, mlr --skip-comments-with X lets you specify the comment-string X. Comments are only supported at start of data line. mlr --pass-comments and mlr --pass-comments-with X allow you to forward comments to program output as they are read. The count-similar verb lets you compute cluster sizes by cluster labels. While Miller DSL arithmetic gracefully overflows from 64-integer to double-precision float (see also here), there are now the integer-preserving arithmetic operators .+ .- .* ./ .// for those times when you want integer overflow. There is a new bitcount function: for example, echo x=0xf0000206 | mlr put '$y=bitcount($x)' produces x=0xf0000206,y=7. Issue 158: mlr -T is an alias for --nidx --fs tab, and mlr -t is an alias for mlr --tsvlite. The mathematical constants π and e have been renamed from PI and E to M_PI and M_E, respectively. (It's annoying to get a syntax error when you try to define a variable named E in the DSL, when A through D work just fine.) This is a backward incompatibility, but not enough of us to justify calling this release Miller 6.0.0. Documentation: As noted here, while Miller has its own DSL there will always be things better expressible in a general-purpose language. The new page Sharing data with other languages shows how to seamlessly share data back and forth between Miller, Ruby, and Python. SQL-input examples and SQL-output examples contain detailed information the interplay between Miller and SQL. Issue 150 raised a question about suppressing numeric conversion. This resulted in a new FAQ entry How do I suppress numeric conversion?, as well as the longer-term follow-on issue 151 which will make numeric conversion happen on a just-in-time basis. To my surprise, csvlite format options weren’t listed in mlr --help or the manpage. This has been fixed. Documentation for auxiliary commands has been expanded, including within the manpage. Bugfixes: Issue 159 fixes regex-match of literal dot. Issue 160 fixes out-of-memory cases for huge files. This is an old bug, as old as Miller, and is due to inadequate testing of huge-file cases. The problem is simple: Miller prefers memory-mapped I/O (using mmap) over stdio since mmap is fractionally faster. Yet as any processing (even mlr cat) steps through an input file, more and more pages are faulted in -- and, unfortunately, previous pages are not paged out once memory pressure increases. (This despite gallant attempts with madvise.) Once all processing is done, the memory is released; there is no leak per se. But the Miller process can crash before the entire file is read. The solution is equally simple: to prefer stdio over mmap for files over 4GB in size. (This 4GB threshold is tunable via the --mmap-below flag as described in the manpage.) Issue 161 fixes a CSV-parse error (with error message "unwrapped double quote at line 0") when a CSV file starts with the UTF-8 byte-order-mark ("BOM") sequence 0xef 0xbb 0xbf and the header line has double-quoted fields. (Release 5.2.0 introduced handling for UTF-8 BOMs, but missed the case of double-quoted header line.) Issue 162 fixes a corner case doing multi-emit of aggregate variables when the first variable name is a typo. The Miller JSON parser used to error with Unable to parse JSON data: Line 1 column 0: Unexpected 0x00 when seeking value on empty input, or input with trailing whitespace; this has been fixed.
- Loading branch information