Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track record schema validation errors in Datadog #13114

Closed

Conversation

alovew
Copy link
Contributor

@alovew alovew commented May 24, 2022

Track record schema validation errors in Segment

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels May 24, 2022
@alovew alovew requested a review from lmossman May 24, 2022 01:11
@alovew alovew temporarily deployed to more-secrets May 24, 2022 01:12 Inactive
@alovew alovew temporarily deployed to more-secrets May 24, 2022 17:05 Inactive
final RecordSchemaValidator recordSchemaValidator) {
final RecordSchemaValidator recordSchemaValidator,
final UUID workspaceId,
final String dockerImage) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename this to sourceDockerImager

@jdpgrailsdev
Copy link
Contributor

cc: @davinchia Here is what was discussed in the dev meeting yesterday and as you commented would be a good fit for recording to DataDog/OTEL instead.

@alovew alovew temporarily deployed to more-secrets May 24, 2022 22:07 Inactive

this.cancelled = new AtomicBoolean(false);
this.hasFailed = new AtomicBoolean(false);
}

public DefaultReplicationWorker(final String jobId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmossman do you know if there have been discussions about moving everything to the container orchestrator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have made container orchestrator the default for kube deployments in OSS, but I found that the container orchestrator logic as written today does not work with docker-compose deployments, and some work will be required to fix that. I created a Spike ticket here to investigate that more: #13142

Though, this DefaultReplicationWorker class is used by container orchestrators as well

Copy link
Contributor

@davinchia davinchia May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmossman nice! this is from 0.39.0-alpha onwards?

Does that mean the docker deployment is not running the container orchestrator for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct - part of the changes the Migrate OSS to temporal scheduler PR was setting the CONTAINER_ORCHESTRATOR_ENABLED env var to true for all kube deploys. It is still not set in the docker-compose env file, so it will default to the value of false for those deployments

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the PR since Jonathan tagged me. I noticed one potential issue around where the workspace id should be injected to keep things clean I want to resolve before we merge this in. Details in the comment. TLDR: we don't want direct db access from a pod involved in the job execution, and probably want to shift workspaceId injection further up the creation chain.

Taking a step back, between the Segment event and the Datadog metric, I feel the DD metric provides more immediate value since it helps us action on this in Cloud. Though the Segment alert is definitely useful, it's also less actionable (we cannot look at OSS user's data to debug) and less urgent (OSS users are likely to open issues when they spot them).

The DD metric LOE is also much lower/simpler as it's a count emission with a connector image tag and doesn't require new information injected throughout the system. If the team is strapped for time, we'd get more value implementing only the DD metric and leaving this for later. If the team has time and is open to learning, of course implementing both is great!

Happy to discuss @lmossman @alovew

@@ -89,6 +107,25 @@ public Optional<String> runJob() throws Exception {
sourceLauncherConfig.getDockerImage().equals(WorkerConstants.RESET_JOB_SOURCE_DOCKER_IMAGE_STUB) ? new EmptyAirbyteSource()
: new DefaultAirbyteSource(workerConfigs, sourceLauncher);

final FeatureFlags featureFlags = new EnvVariableFeatureFlags();
final String driverClassName = "org.postgresql.Driver";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validationErrors.forEach((stream, errorPair) -> {
if (workspaceId != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I can tell, workspaceId never changes, so this should be able to be combined with the validationErrors.isEmpty() check on line 367 to simplify things.

@@ -89,6 +107,25 @@ public Optional<String> runJob() throws Exception {
sourceLauncherConfig.getDockerImage().equals(WorkerConstants.RESET_JOB_SOURCE_DOCKER_IMAGE_STUB) ? new EmptyAirbyteSource()
: new DefaultAirbyteSource(workerConfigs, sourceLauncher);

final FeatureFlags featureFlags = new EnvVariableFeatureFlags();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the job orchestrator runs in the jobs namespace as part of a job in Cloud. in theory, the orchestrator is fire-and-forget, so I don't think we want to allow direct database access from the orchestrator. Doing so also present some security risk, since today we sandbox the jobs namespace off from the ab namespace, so we would have to make some allowances for the orchestrator pod - not the worst but not very clean.

I think the right way to do this is to inject the workspace id via the ReplicationActivityImpl (this runs in the ab namespace). This follow how configs are currently propagated to the jobs - as static files - so we can keep the interface consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining this Davin. This is something Anne and I also discussed over zoom yesterday; I agree that we do not want to have direct db access in the orchestrator pod, so querying for the workspace ID should be moved higher up the chain.

@alovew alovew temporarily deployed to more-secrets May 25, 2022 20:38 Inactive
@alovew alovew temporarily deployed to more-secrets May 26, 2022 00:34 Inactive
@alovew alovew changed the title Track record schema validation errors in Segment Track record schema validation errors in Datadog May 26, 2022
@alovew alovew temporarily deployed to more-secrets May 26, 2022 19:29 Inactive
@alovew alovew temporarily deployed to more-secrets May 31, 2022 16:58 Inactive
@alovew alovew force-pushed the anne/add-segment-tracking-for-validation-errors branch from 908197f to f0935f9 Compare May 31, 2022 17:56
@alovew alovew temporarily deployed to more-secrets May 31, 2022 17:58 Inactive
@alovew alovew temporarily deployed to more-secrets May 31, 2022 18:36 Inactive
@alovew alovew closed this Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants