Skip to content
This repository has been archived by the owner on Jan 15, 2022. It is now read-only.

Hadoop2.0 #18

Merged
merged 14 commits into from
Aug 27, 2013
Merged

Hadoop2.0 #18

merged 14 commits into from
Aug 27, 2013

Conversation

vrushalic
Copy link
Collaborator

Enabling parsing of 2.0 job history files without referring to 2.0 packages/jars

@vrushalic
Copy link
Collaborator Author

I have not yet added a mapping between old keys and new keys between 1 .0 and 2.0. For instance, in 1.0 there was a "TOTAL_MAPS" and in 2.0, it now is "totalMaps". Will add it if needed after discussion.

public class JobHistoryFileParserHadoop2 implements JobHistoryFileParser {

private JobKey jobKey;
@SuppressWarnings("unused")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is actually telling us something. JobId is being set but is never read, which means it is really not being used. Should we remove this field? The actual information is captured in jobNumber.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it was being used earlier in a key generation but not anymore. am removing it.

@ghost ghost assigned vrushalic Aug 6, 2013
/**
* populate the hash set for counter names
*/
counterNames.add(CounterTypes.mapCounters.toString());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will iterate over the enum.values instead of adding individual ones

/** Column qualifier prefix to namespace total-specific counter data */
public static final String TOTAL_COUNTER_COLUMN_PREFIX = "gt";
public static final byte[] TOTAL_COUNTER_COLUMN_PREFIX_BYTES = Bytes
.toBytes(TOTAL_COUNTER_COLUMN_PREFIX);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at a sample file, I think "TOTAL_COUNTERS" is the same as what we're treating as just "COUNTERS" in Hadoop 1.0. So I don't think we need this extra prefix. Correct me if I'm wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are task level counters which are called "COUNTERS" and the job level ones which are called "TOTAL_COUNTERS", "MAP_COUNTERS" AND "REDUCE_COUNTERS"

JSONArray fields = j1.getJSONArray(FIELDS);

for (int k = 0; k < fields.length(); k++) {
JSONObject allEvents = new JSONObject(fields.get(k).toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could fields.get(k) be null ? If so, defensive coding would use fields.isNull(k) first to check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will add the null check. It is probably very unlikely that there are no fields in the job history file since that means an empty job history file was created which I don't think will happen unless there is a bug in the history file writing. But agree adding the null check in any case.

@@ -0,0 +1,96 @@
/*
* Copyright 2012 Twitter, Inc. Licensed under the Apache License, Version 2.0 (the "License"); you
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: for new files you probably want to use the current year.

Vrushali Channapattan added 2 commits August 14, 2013 15:11
… and year in copy right, also adding in some more keys I saw in 2.0.5 history files
…hadoop 2.0 cluster to process 2.0 job history files
// the first 10 bytes contain Avro-Json
return new String(contents, 0, HADOOP2_VERSION_LENGTH);
}
throw new IllegalArgumentException(" Unknown format of job history file: " + contents);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little unexpected if it's possible we're reading a v1 file. Maybe we should return null in this case instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gary, (maybe I didn't follow completely) The sub-string for a 1.0 history file is "Meta VERS" since the first line in 1.0 is 'Meta VERSION="1" .;'. Hence this won't throw the exception. I have a unit test at TestJobHistoryFileParserFactory.java that confirms that we get back a JobHistoryFileParserHadoop1 object when it's a 1.0 file.
Is that what you were referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Still seems a little weird that in the 2.0 case it returns the version string, and in the 1.0 case it returns a truncated string (doesn't seem quite consistent with the method name). I would either name it getVersion2StringFromFile or change it to return a boolean value and do the string comparison in this method -- hasVersion2String(). But I guess it works as is as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so I have changed it now. Made the function check for both versions' strings and return accordingly

@ghelmling
Copy link
Contributor

Two minor comments to resolve: mixed static and instance context, and a wording fix in an exception message. Otherwise this looks good to merge to me.

@vrushalic
Copy link
Collaborator Author

Thanks Gary! I will fix these and update the request today.

/**
* utitlity function for printing all Task puts
*/
public void printAllTaskPuts() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to duplication this functionality betwee printAllJobPuts() and printAllTaskPuts(). You should just have a single method: printPuts(List puts). Then pass either jobPuts or taskPuts as an argument.

ghelmling added a commit that referenced this pull request Aug 27, 2013
Integrate Hadoop2.0 job history support.
@ghelmling ghelmling merged commit 448f9ee into twitter:master Aug 27, 2013
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants