forked from Yelp/mrjob
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGES.txt
420 lines (395 loc) · 19.5 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
v0.4.3, 2014-04-08 -- SO many bugfixes
* jobs:
* MRStep's constructor treats kwarg=None same as not setting it (#970)
* parse_counters() and parse_output() are deprecated (#829)
* self.mr is deprecated in favor of MRStep (#815)
* runners:
* All runners:
* You can now set strict_protocols from mrjob.conf (#726)
* new --no-strict-protocols command-line option
* streaming output from closed runner shows a warning (#853)
* EMR:
* --check-input-paths and --no-check-input-paths options (#864)
* skip (very slow) validation of s3 buckets if boto < 2.25.0 (#865)
* Fix for max_hours_idle bug that was terminating job flows early (#932)
* Job flows are visible to all IAM users by default (#922)
* --emr-api-param allows users to pass additional parameters to boto's
EMR API (#879)
* unset paramaters with --no-emr-api-param
* bootstrap_python_packages (deprecated) now works on 3.x EMR AMIs (#863)
* Use TERMINATE_CLUSTER instead of deprecated TERMINATE_JOB_FLOW (#974)
* updated EC2 instance type data for pooling (#995)
* Hadoop:
* exclude hadoop source jars when looking for streaming jar (#861)
* Fixed mkdir_on_hdfs for Hadoop version 2.x (#923)
* Fixed hadoop_bin on Windows (#843)
* Local
* bootstrap mrjob by default (#984)
* Inline
* fix for add_file_option() (#851)
* cd to job's working directory before instantiating mrjob class (#988)
* Use pytest to run tests (#898)
* collect-emr-active-stats subcommand (#947)
* Using xtrace flag to get more output during bootstrap (#943)
* Fixed log printouts for command line tools (#901)
* Fix to avoid interpreting windows paths as URIs (#880)
* Better error message when ssh keyfile is missing (#858)
* Update EMR tool ISO8601 parsing to be consistent with EMR runner (#869)
* Dropped support for Python 2.5 (#713)
* Dropped support for the 1.x EMR AMI series, which uses Python 2.5
v0.4.2, 2013-11-27 -- that's one small step for a JAR
* jobs:
* can interpolate input and output path(s) into arguments of JarSteps,
so they can be part of multi-step jobs (#773)
* see mrjob/examples/mr_jar_step_example.py
* JarStep now takes keyword arguments only (#769)
* removed useless "name" field; "step_args" is now just "args"
* MRJobStep (usually accessed via MRJob.mr()) is now MRStep
* runners:
* All runners:
* --setup is now fully functional (#206)
* --python-archive, --setup-cmd, and --setup-script are deprecated
* --bootstrap option works and uses sh (#206)
* --bootstrap-cmd, --bootstrap-file, --bootstrap-python-package,
--bootstrap-script are deprecated
* setup commands can no longer corrupt a task's input and output (#803)
* sh_bin is now "sh -e" by default so setup fails fast (#810)
* default is "/bin/sh -e" on EMR
* EMR:
* JarSteps work again (#763)
* auto-uploads jars for JarSteps (#772)
* JARs on the EMR instances can be accessed with file:/// URIs
* ssh_cat() no longer raises an error when catting a file
containing an error (#807)
* Fixed SignatureDoesNotMatchError that happens with boto 2.10.0+
with Python prior to 2.7.5 (#778)
* Hadoop:
* now handles JarSteps too (#770)
* Fix to mrjob.parse.urlparse() that was breaking Python 2.5
* mrjob.util.buffer_iterator_to_line_iterator() is now more efficient
and uses a bounded amount of memory
* bz2 decompression no longer discards data (#817)
v0.4.1, 2013-09-16 -- secondary sort and self-terminating job flows
* jobs:
* SORT_VALUES: Secondary sort by value (#240)
* see mrjob/examples/
* can now override jobconf() again (#656)
* renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
* examples:
* bash_wrap/ (mapper/reducer_cmd() example)
* mr_most_used_word.py (two step job)
* mr_next_word_stats.py (SORT_VALUES example)
* runners:
* All runners:
* single setup option works but is not yet documented (#206)
* setup now uses sh rather than python internally
* EMR runner:
* max_hours_idle: self-terminating idle job flows (#628)
* mins_to_end_of_hour option gives finer control over self-termination.
* Can reuse pooled job flows where previous job failed (#633)
* Throws IOError if output path already exists (#634)
* Gracefully handles SSL cert issues (#621, #706)
* Automatically infers EMR/S3 endpoints from region (#658)
* ls() supports s3n:// schema (#672)
* Fixed log parsing crash on JarSteps (#645)
* visible_to_all_users works with boto <2.8.0 (#701)
* must use --interpreter with non-Python scripts (#683)
* cat() can decompress gzipped data (#601)
* Hadoop runner:
* check_input_paths: can disable input path checking (#583)
* cat() can decompress gzipped data (#601)
* Inline/Local runners:
* Fixed counter parsing for multi-step jobs in inline mode
* Supports per-step jobconf (#616)
* Documentation revamp
* mrjob.parse.urlparse() works consistently across Python versions (#686)
* deprecated:
* many constants in mrjob.emr replaced with functions in mrjob.aws
* removed deprecated features:
* old conf locations (~/.mrjob and in PYTHONPATH) (#747)
* built-in protocols must be instances (#488)
v0.4.0, 2013-04-30 -- Slouching toward nirvana
* Changes:
* 'mrjob' command (#225)
* Changed default runner from 'local' to 'inline' (#423)
* Local runner no longer adds working directory to PYTHONPATH of
subprocesses; use inline runner instead (#424)
* Requires boto 2.2.0 or later
* Filesystem functionality moved out of MRJobRunner into into 'fs' objects
but forwarded from runners for backward compatibility
* Changed exception hierarchy of mrjob.ssh (which is private but
important)
* Inline and local runners now inherit from the SimMRJobRunner class and thus share most
of their implementation
* Internal data structure for representing a step is much richer, allowing
many cool future features (#479)
* mrjob detects Hadoop version from EMR based on API responses instead of
what's in the config (#611)
* New features:
* Support for non-Hadoop Streaming jar steps (#499)
* Support for arbitrary commands as Hadoop Streaming
mappers/combiners/reducers
* mapper_pre_filter, combiner_pre_filter, and reducer_pre_filter allow
running of a UNIX command in front of tasks to filter input outside of
the interpreter
* Hadoop runner uses PTY to print output from the Hadoop sub process to the
console (#580)
* mrjob knows how to terminate the job on cleanup (Ctrl+C closes the job).
(#353)
* Allow use of multiple -c flags on the command line (#420)
* Bug fixes:
* Silenced some incorrect warnings about ignored options in 'inline' runner
* terminate_idle_job_flows uses the default configuration to terminate idle jobs (#559)
* Removed deprecated functionality:
* --hadoop-*-format
* --*-protocol switches
* MRJob.DEFAULT_*_PROTOCOL
* MRJob.get_default_opts()
* MRJob.protocols()
* PROTOCOL_DICT
* IF_SUCCESSFUL
* DEFAULT_CLEANUP
* S3Filesystem.get_s3_folder_keys()
v0.3.5, 2012-08-21 -- The Last Ride of v0.3.x[?]
* EMR:
* --pool-wait-minutes option lets you wait up to X minutes before creating a
job flow (#455)
* Job flow ID included in error messages on failure (#452)
* JOB and JOB_FLOW cleanup options (#485, #455)
* EMR and Hadoop:
* Compatibility fixes related to deprecated options and Hadoop's bizarre
non-sequential version numbers (#489, #534)
* Other:
* Warn when *_PROTOCOL is not a class (#490)
* Bug fixes:
* Unicode strings can be used when specifying interpreters (#431)
* --enable-emr-logging no longer causes the wrong counters/logs to be parsed
(#446)
* TMP_DIR inserted into 'sort' environment variables (#477)
* Setting hadoop_home in mrjob.conf works again
* Gzipped input files work when specified with relative paths (#494)
* Passthrough options are not re-ordered when sent to Hadoop Streaming
(#509)
v0.3.4.1, 2012-06-12 -- The test suite doesn't catch everything...
* Local mode doesn't try to send multiple mappers to the same output file
when using multiple compressed files as input
v0.3.4, 2012-06-11 -- We are friendly people.
* Experimental support for IronPython in the local and inline runners
* set_status() and increment_counter() will encode messages/names of type
'unicode' as UTF-8 when writing to Hadoop Streaming
* EMR and Hadoop counter parsing is more correct
* mrjob.tools.emr.fetch_logs fetches logs from S3 when asked instead of
incorrectly refusing to do so
* jobconf values can be booleans in mrjob.conf as well as 'true' and 'false'
strings
* hadoop_version can be a float in mrjob.conf, but a warning is printed to the
console
* Command line help is split across several --help-* commands
* Local runner sorts output consistently
v0.3.3.2, 2012-04-10 -- It's a race [condition]!
* Option parsing no longer dies when -- is used as an argument (#435)
* Fixed race condition where two jobs can join same job flow thinking it is
idle, delaying one of the jobs (#438)
* Better error message when a config file contains no data for the current
runner (#433)
v0.3.3.1, 2012-04-02 -- Hothothothothothothotfix
* Fixed S3 locking mechanism parsing of last modified time to work around an
inconsistency in the EMR API
v0.3.3, 2012-03-29 -- Bug...bug...bug...bug...bug...FEATURE!
* EMR:
* Error detection code follows symlinks in Hadoop logs (#396)
* terminate_idle_job_flows locks job flows before terminating them (#391)
* terminate_idle_job_flows -qq silences all output (#380)
* Other fixes:
* mr_tower_of_powers test no longer requires Testify (#395)
* Various runner du() implementations no longer broken (#393, #394)
* Hadoop counter parser regex handles long lines better (#388)
* Hadoop counter parser regex is more correct (#305)
* Better error when trying to parse YAML without PyYAML (#348)
v0.3.2, 2012-02-22 -- AMI versions, spot instances, and more
* Docs:
* 'Testing with mrjob' section in docs (includes #321)
* MRJobRunner.counters() included in docs (#321)
* terminate_idle_job_flows is spelled correctly in docs (#339)
* Running jobs:
* local mode:
* Allow non-string jobconf values again (this changed in v0.3.0)
* Don't split *.gz files (#333)
* emr mode:
* Spot instance support via ec2_*_instance_bid_price and renamed instance
type/number options (#219)
* ami_version option to allow switching between EMR AMIs (#306)
* 'Error while reading from input file' displays correct file (#358)
* python_bin used for bootstrap_python_packages instead of just 'python'
(#355)
* Pooling works with bootstrap_mrjob=False (#347)
* Pooling makes sure a job flow has space for the new job before joining
it (#324)
* EMR tools:
* create_job_flow no longer tries to use an option that does not exist
(#349)
* report_long_jobs tool alerts on jobs that have run for more than X hours
(#345)
* mrboss no longer spells stderr 'stsderr'
* terminate_idle_job_flows counts jobs with pending (but not running)
steps as idle (#365)
* terminate_idle_job_flows can terminate job flows near the end of a
billable hour (#319)
* audit_usage breaks down job flows by pool (#239)
* Various tools (e.g. audit_usage) get list of job flows correctly (#346)
v0.3.1, 2011-12-20 -- Nooooo there were bugs!
* Instance-type command-line arguments always override mrjob.conf (Issue #311)
* Fixed crash in mrjob.tools.emr.audit_usage (Issue #315)
* Tests now use unittest; python setup.py test now works (Issue #292)
v0.3.0, 2011-12-07 -- Worth the wait
* Configuration:
* Saner mrjob.conf locations (Issue #97):
* ~/.mrjob is deprecated in favor of ~/.mrjob.conf
* searching in PYTHONPATH is deprecated
* MRJOB_CONF environment variable for custom paths
* Defining Jobs (MRJob):
* Combiner support (Issue #74)
* *_init() and *_final() methods for mappers, combiners, and reducers
(Issue #124)
* mapper/combiner/reducer methods no longer need to contain a yield
statement if they emit no data
* Protocols:
* Protocols can be anything with read() and write() methods, and are
instances by default (Issue #229)
* Set protocols with the *_PROTOCOL attributes or by re-defining the
*_protocol() methods
* Built-in protocol classes cache the encoded and decoded value of the
last key for faster decoding during reducing (Issue #230)
* --*protocol switches and aliases are deprecated (Issue #106)
* Set Hadoop formats with HADOOP_*_FORMAT attributes or the hadoop_*_format()
methods (Issue #241)
* --hadoop-*-format switches are deprecated
* Hadoop formats can no longer be set from mrjob.conf
* Set jobconf with JOBCONF attribute or the jobconf() method (in addition
to --jobconf)
* Set Hadoop partitioner class with --partitioner, PARTITIONER, or
partitioner() (Issue #6)
* Custom option parsing (Issue #172)
* Use mrjob.compat.get_jobconf_value() to get jobconf values from environment
* Running jobs:
* All modes:
* All runners are Hadoop-version aware and use the correct jobconf and
combiner invocation styles (Issue #111)
* All types of URIs can be passed through to Hadoop (Issue #53)
* Speed up steps with no mapper by using cat (Issue #5)
* Stream compressed files with cat() method (Issue #17)
* hadoop_bin, python_bin, and ssh_bin can now all take switches (Issue #96)
* job_name_prefix option is gone (was deprecated)
* Better cleanup (Issue #10):
* Separate cleanup_on_failure option
* More granular cleanup options
* Cleaner handling of passthrough options (Issue #32)
* emr mode:
* job flow pooling (Issue #26)
* vastly improved log fetching via SSH (Issue #2)
* New tool: mrjob.tools.emr.fetch_logs
* default Hadoop version on EMR is 0.20 (was 0.18)
* ec2_instance_type option now only sets instance type for slave nodes
when there are multiple EC2 instances (Issue #66)
* New tool: mrjob.tools.emr.mrboss for running commands on all nodes and
saving output locally
* inline mode:
* Supports cmdenv (Issue #136)
* Passthrough options can now affect steps list (Issue #301)
* local mode:
* Runs 2 mappers and 2 reducers in parallel by default (Issue #228)
* Preliminary Hadoop simulation for some jobconf variables (Issue #86)
* Misc:
* boto 2.0+ is now required (Issue #92)
* Removed debian packaging (should be handled separately)
v0.2.8, 2011-09-07 -- Bugfixes and betas
* Fix log parsing crash dealing with timeout errors
* Make mr_travelling_salesman.py work with simplejson
* Add emr_additional_info option, to support EMR beta features
* Remove debian packaging (should be handled separately)
* Fix crash when creating tmp bucket for job in us-east-1
v0.2.7, 2011-07-12 -- Hooray for interns!
* All runner options can be set from the command line (Issue #121)
* Including for mrjob.tools.emr.create_job_flow (Issue #142)
* New EMR options:
* availability_zone (Issue #72)
* bootstrap_actions (Issue #69)
* enable_emr_debugging (Issue #133)
* Read counters from EMR log files (Issue #134)
* Clean old files out of S3 with mrjob.tools.emr.s3_tmpwatch (Issue #9)
* EMR parses and reports job failure due to steps timing out (Issue #15)
* EMR boostrap files are no longer made public on S3 (Issue #70)
* mrjob.tools.emr.terminate_idle_job_flows handles custom hadoop streaming
jars correctly (Issue #116)
* LocalMRJobRunner separates out counters by step (Issue #28)
* bootstrap_python_packages works regardless of tarball name (Issue #49)
* mrjob always creates temp buckets in the correct AWS region (Issue #64)
* Catch abuse of __main__ in jobs (Issue #78)
* Added mr_travelling_salesman example
v0.2.6, 2011-05-24 -- Hadoop 0.20 in EMR, inline runner, and more
* Set Hadoop to run on EMR with --hadoop-version (Issue #71).
* Default is still 0.18, but will change to 0.20 in mrjob v0.3.0.
* New inline runner, for testing locally with a debugger
* New --strict-protocols option, to catch unencodable data (Issue #76)
* Added steps_python_bin option (for use with virtualenv)
* mrjob no longer chokes when asked to run on an EMR job flow running
Hadoop 0.20 (Issue #110)
* mrjob no longer chokes on job flows with no LogUri (Issue #112)
v0.2.5, 2011-04-29 -- Hadoop input and output formats
* Added hadoop_input/output_format options
* You can now specify a custom Hadoop streaming jar (hadoop_streaming_jar)
* extra args to hadoop now come before -mapper/-reducer on EMR, so
that e.g. -libjar will work (worked in hadoop mode since v0.2.2)
* hadoop mode now supports s3n:// URIs (Issue #53)
v0.2.4, 2011-03-09 -- fix bootstrapping mrjob
* Fix bootstrapping of mrjob in hadoop and local mode (Issue #89)
* SSH tunnels try to use the same port for the same job flow (Issue #67)
* Added mr_postfix_bounce and mr_pegasos_svm to examples.
* Retry on spurious 505s from EMR API
v0.2.3, 2011-02-24 -- boto compatibility
* Fix incompatibility with boto 2.0b4 (Issue #91)
v0.2.2, 2011-02-15 -- GET/POST EMR issue
* Use POST requests for most EMR queries (EMR was choking on large GETs)
* find_probable_cause_of_failure() ignores transient errors (Issue #31)
* --hadoop-arg now actually works (Issue #79)
* on Hadoop, extra args are added first, so you can set e.g. -libjar
* S3 buckets may now have . in their names
* MRJob scripts now respect --quiet (Issue #84)
* added --no-output option for MRJob scripts (Issue #81)
* added --python-bin option (Issue #54)
v0.2.1, 2010-11-17 -- laststatechangereason bugfix
* Don't assume EMR sets laststatechangereason
v0.2.0, 2010-11-15 -- Many bugfixes, Windows support
* New Features/Changes:
* EMRJobRunner now prints % of mappers and reducers completed when you
enable the SSH tunnel.
* Added mr_page_rank example
* Added mrjob.tools.emr.audit_usage script (Issue #21)
* You can specify alternate job owners with the "owner" option. Useful for
auditing usage. (Issue #59)
* The job_name_prefix option has been renamed to label (the old name still
works but is deprecated)
* bootstrap_cmds and bootstrap_scripts no longer automatically invoke sudo
* Bugs Fixed/Cleanup:
* bootstrap files no longer get uploaded to S3 twice (Issue #8)
* When using add_file_option(), show_steps() can now see the local version
of the file (Issue #45)
* Now works on Windows (Issue #46)
* No longer requires external jar, tar, or zip binaries (Issue #47)
* mrjob-* scratch bucket is only created as needed (Issue #50)
* Can now specify us-east-1 region explicitly (Issue #58)
* mrjob.tools.emr.terminate_idle_job_flows leaves Hive jobs alone (Issue #60)
v0.1.0, 2010-10-28 -- Same code, better version. It's official!
v0.1.0-pre3, 2010-10-27 -- Pre-release to run Yelp code against
* Added debian packaging
* mrjob bootstrapping can now deal with symlinks in site-packages/mrjob
* MRJobRunner.stream_output() can now be called multiple times
v0.1.0-pre2, 2010-10-25 -- Second pre-release after testing
* Fixed small bugs that broke Python 2.5.1 and Python 2.7
* Fixed reading mrjob.conf without yaml installed
* Fix tests to work with modern simplejson and pipes.quote()
* Auto-create temp bucket on S3 if we don't have one (Issue #16)
* Auto-infer AWS region from bucket (Issue #7)
* --steps now passes in all extra args (e.g. --protocol) (Issue #4)
* Better docs
v0.1.0-pre1, 2010-10-21 -- Initial pre-release. YMMV!