Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the functionality to run TD distributed on Kubernetes. #12

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

meirosucristina
Copy link

Extend ToughDay to be able to run distributed on Kubernetes.

Related Issue

Fix for #10.

Motivation and Context

Please refer to #10 for more context.

How Has This Been Tested?

JUnit & manually.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • I have signed the Adobe Open Source CLA.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Copy link
Collaborator

@andreituicu andreituicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution!
Looks good overall! Clean code, automated tests, good inline documentation. We will need to improve the user docs too, but that can happen after merging the PR.
I have a few comments inline, most important ones are related to the threading and exception handling. That is why I'm currently requesting changes instead of Approving.

}


private static Representer getRepresenter() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this class?

Copy link
Author

@meirosucristina meirosucristina May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, SnakeYaml includes all empty collections when dumping an object (for instance, an empty Map is represented as {}). This class configures the Representer to skip all empty collections in order to keep the generated Yaml representation of the object as clean as possible.

return null;
}
} catch (Throwable t) {
t.printStackTrace();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something else here beside printing the stack trace?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error should not impact the overall functionality. It would only mean that the empty collections are also included in the Yaml representation on an object so I believe we can simply ignore this exception.

* It is also assumed that the only action to be taken into consideration when dumping the configuration is Actions.ADD
* since all the other actions were already processed before sending the execution query to the driver.
*/
public class YamlDumpConfiguration {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name of the class is a bit too generic for what it does. If I see this name, I'm inclined to use it anywhere I need to dump a yaml configuration, but it looks like it is only supposed to be used when sending messages between driver and agent. Anything that we could do here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved this class into the distributedTd package and I have renamed it to reflect the case in which it should be used.

}
}
} catch (IllegalAccessException | InvocationTargetException e) {
e.printStackTrace();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regarding stack traces and handling exceptions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These errors should only happen when there are no @ConfigArgSet and @ConfigArgGet methods properly defined for certain configurable properties of the object instance given as a parameter. I think we could handle this exceptions by logging a warning message specifying that these properties will be ignored when dumping the configuration.

public void merge(DistributedConfig other) {
this.setHeartbeatInterval(other.getHeartbeatInterval());
this.setRedistributionWaitTime(other.redistributionWaitTime);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are only these properties overwritten at merge time? What is the difference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These properties can be set in the configuration specified in the Driver docker image, but these values can be modified when sending the TD configuration to be executed distributed. The rest of the properties should not be modified once the Driver/Agents are deployed in the cluster.

this.inactiveAgents.add(agentIdentifier);
}

private void waitUntilAllAgentsAreReadyForRebalancing(Queue<String> agents) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would an agent not be ready for a redistribution?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that when an Agent receives a redistribution request, the Configuration object is already created so that we will be able to modify the properties of certain objects (RunMode, TestSuite). The following will explain this concern better:

  1. A new agent(10.0.0.1) joins the cluster.
  2. The work redistribution process begins => agent 10.0.0.1 receives a Configuration in YAML format
  3. A new agent(10.0.0.2) joins the cluster
  4. The work redistribution process begins => agent 10.0.0.1 receives instructions to modify certain properties of the run mode.

If agent (10.0.0.1) failed to create the Configuration before step 4, it will not be able to process the redistribution request.

} catch (JsonProcessingException e) {
LOG.error("Unexpected exception while sending redistribution instructions", e);
LOG.warn("Agent " + ipAddress + " will continue to run with the configuration it had before " +
"the process of redistribution was triggered.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the impact here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The properties of the current Phase of the execution will no longer be respected(for instance, the desired load might not be achieved). The next phase will execute properly, so I believe it is not necessary to kill the entire distributed execution because of this.

This exception should never be thrown as long as the RedistributionInstructions class has a format that allows the ObjectMapper to dump it in JSON format.

public void setLoad(String load) {
checkNotNegative(Long.parseLong(load), "load");
this.load = Integer.parseInt(load);
}

@ConfigArgGet
@ConfigArgGet(redistribute = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot why we need redistribute if the work is done by the run mode rebalancer which knows the intrinsics of the class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run mode balancer knows how to interpret the redistribution instructions received from the Driver and how to modify the properties of the run mode instance so that the desired configuration will be achieved (it adapts the number of threads or the current load, etc).

We need the redistribute field to automatically detect which properties should be included in the redistribution instruction message when creating it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants