-
Notifications
You must be signed in to change notification settings - Fork 77
Conversation
…referring to 2.0 packages/jars
I have not yet added a mapping between old keys and new keys between 1 .0 and 2.0. For instance, in 1.0 there was a "TOTAL_MAPS" and in 2.0, it now is "totalMaps". Will add it if needed after discussion. |
public class JobHistoryFileParserHadoop2 implements JobHistoryFileParser { | ||
|
||
private JobKey jobKey; | ||
@SuppressWarnings("unused") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning is actually telling us something. JobId is being set but is never read, which means it is really not being used. Should we remove this field? The actual information is captured in jobNumber.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it was being used earlier in a key generation but not anymore. am removing it.
/** | ||
* populate the hash set for counter names | ||
*/ | ||
counterNames.add(CounterTypes.mapCounters.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will iterate over the enum.values instead of adding individual ones
…to be compatible, also formatting changes
/** Column qualifier prefix to namespace total-specific counter data */ | ||
public static final String TOTAL_COUNTER_COLUMN_PREFIX = "gt"; | ||
public static final byte[] TOTAL_COUNTER_COLUMN_PREFIX_BYTES = Bytes | ||
.toBytes(TOTAL_COUNTER_COLUMN_PREFIX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at a sample file, I think "TOTAL_COUNTERS" is the same as what we're treating as just "COUNTERS" in Hadoop 1.0. So I don't think we need this extra prefix. Correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are task level counters which are called "COUNTERS" and the job level ones which are called "TOTAL_COUNTERS", "MAP_COUNTERS" AND "REDUCE_COUNTERS"
JSONArray fields = j1.getJSONArray(FIELDS); | ||
|
||
for (int k = 0; k < fields.length(); k++) { | ||
JSONObject allEvents = new JSONObject(fields.get(k).toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could fields.get(k) be null ? If so, defensive coding would use fields.isNull(k) first to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I will add the null check. It is probably very unlikely that there are no fields in the job history file since that means an empty job history file was created which I don't think will happen unless there is a bug in the history file writing. But agree adding the null check in any case.
@@ -0,0 +1,96 @@ | |||
/* | |||
* Copyright 2012 Twitter, Inc. Licensed under the Apache License, Version 2.0 (the "License"); you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: for new files you probably want to use the current year.
… and year in copy right, also adding in some more keys I saw in 2.0.5 history files
…hadoop 2.0 cluster to process 2.0 job history files
// the first 10 bytes contain Avro-Json | ||
return new String(contents, 0, HADOOP2_VERSION_LENGTH); | ||
} | ||
throw new IllegalArgumentException(" Unknown format of job history file: " + contents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little unexpected if it's possible we're reading a v1 file. Maybe we should return null in this case instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Gary, (maybe I didn't follow completely) The sub-string for a 1.0 history file is "Meta VERS" since the first line in 1.0 is 'Meta VERSION="1" .;'. Hence this won't throw the exception. I have a unit test at TestJobHistoryFileParserFactory.java that confirms that we get back a JobHistoryFileParserHadoop1 object when it's a 1.0 file.
Is that what you were referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Still seems a little weird that in the 2.0 case it returns the version string, and in the 1.0 case it returns a truncated string (doesn't seem quite consistent with the method name). I would either name it getVersion2StringFromFile or change it to return a boolean value and do the string comparison in this method -- hasVersion2String(). But I guess it works as is as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so I have changed it now. Made the function check for both versions' strings and return accordingly
Two minor comments to resolve: mixed static and instance context, and a wording fix in an exception message. Otherwise this looks good to merge to me. |
Thanks Gary! I will fix these and update the request today. |
/** | ||
* utitlity function for printing all Task puts | ||
*/ | ||
public void printAllTaskPuts() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to duplication this functionality betwee printAllJobPuts() and printAllTaskPuts(). You should just have a single method: printPuts(List puts). Then pass either jobPuts or taskPuts as an argument.
…other review changes
Integrate Hadoop2.0 job history support.
Enabling parsing of 2.0 job history files without referring to 2.0 packages/jars