Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(graph service): only query for entities that should have lineage [Breaking Change] #5539

Conversation

gabe-lyons
Copy link
Contributor

Our graph service historically queried for all entity types. However, the set of relationship types it queries is the set of relationship types that may exist in some lineage. This means entities that are not included in lineage were getting pulled in. Rather than querying for specific entity->relationship type->entity pairs, I opted for a simper approach of changing the graph service interface to accept a list of entity types.

This approach sets us up to be able to select the set of entity types we are interested in querying as a config or from the UI.

This is a breaking change for anyone who has built their own implementation of the GraphService interface

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@@ -147,7 +147,7 @@ export const OperationsTab = () => {
inputs: run?.inputs?.relationships.map((relationship) => relationship.entity),
outputs: run?.outputs?.relationships.map((relationship) => relationship.entity),
externalUrl: run?.externalUrl,
parentTemplate: run?.parentTemplate?.relationships?.[0].entity,
parentTemplate: run?.parentTemplate?.relationships?.[0]?.entity,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small bug fix i noticed while testing

@@ -19,6 +19,8 @@ export const CompactEntityNameList = ({ entities, onClick, linkUrlParams, showTo
return (
<>
{entities.map((entity, index) => {
if (!entity) return <></>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, small bug fix here ^

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is entity not a required field? Didn't realize... Nice!

@github-actions
Copy link

github-actions bot commented Aug 1, 2022

Unit Test Results (build & test)

499 tests  ±0   499 ✔️ ±0   8m 23s ⏱️ -32s
115 suites ±0       0 💤 ±0 
115 files   ±0       0 ±0 

Results for commit 63fb9e2. ± Comparison against base commit 626196d.

♻️ This comment has been updated with latest results.

@maggiehays maggiehays added the product PR or Issue related to the DataHub UI/UX label Aug 2, 2022
@@ -86,6 +89,14 @@ public LineageSpec getLineageSpec(String entityName) {
return _lineageSpecMap.get(entityName.toLowerCase());
}

public Set<String> getEntitiesWithLineage() {
Map<String, EntitySpec> specs = _entityRegistry.getEntitySpecs();
return _lineageSpecMap.keySet().stream().filter(key ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: You maybe could use _lineageSpecMap.entrySet() to avoid the .get(key) calls


@Data
@AllArgsConstructor
public class GraphFilters {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other services (e.g. search service) we use PDLs to model the filter types...Any reason why we shouldn't keep that consistent here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't exposed via API yet so ill keep them just in java for now so its easier to change. Once we've settled on them let's move them to pdl.

@@ -79,11 +80,12 @@ public interface GraphService {
* - RelatedEntity("DownstreamOf", "dataset three")
*/
@Nonnull
RelatedEntitiesResult findRelatedEntities(@Nullable final String sourceType, @Nonnull final Filter sourceEntityFilter,
@Nullable final String destinationType, @Nonnull final Filter destinationEntityFilter,
RelatedEntitiesResult findRelatedEntities(@Nullable final List<String> sourceTypes, @Nonnull final Filter sourceEntityFilter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing : We don't really need to remove this API, we could just have the old API call the new API with the different arguments.

RelatedEntitiesResult findRelatedEntities(final String sourceType, Filter filter,...) {
   return findRelatedEntities(ImmutableList.of(sourceType, ...);
}

RelatedEntitiesResult findRelatedEntities(final List<String> sourceType, Filter filter,...) {

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually creates ambiguity if the user provides null for both fields (which is something we support), so I'll keep it as is for now.

* Unless overridden, it uses the lineage registry to fetch valid edge types and queries for them
*/
@Nonnull
default EntityLineageResult getLineage(@Nonnull Urn entityUrn, @Nonnull LineageDirection direction,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can only filter for entity types that the current URN is related to via a lineage edge?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we are just filtering for anything that has lineage... That will solve our immedaite problem with data process instance but in the long tem could this pop up again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed- only relevant entity types will be pulled

@@ -291,16 +291,20 @@ protected static String getQueryForRelatedEntities(@Nullable String sourceType,
List<String> destinationFilterNames = new ArrayList<>();
List<String> relationshipTypeFilterNames = new ArrayList<>();

if (sourceType != null) {
if (sourceTypes != null && sourceTypes.size() > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!

@@ -294,9 +295,22 @@ public QueryBuilder getQueryForLineage(List<Urn> urns, List<EdgeInfo> lineageEdg
incomingEdgeQuery.must(buildEdgeFilters(incomingEdges));
query.should(incomingEdgeQuery);
}

if (graphFilters != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: this function is getting a bit long.. consider breaking into sub methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -50,8 +51,9 @@ public BulkByScrollResponse deleteByQuery(@Nullable final String sourceType, @No
@Nullable final String destinationType, @Nonnull final Filter destinationEntityFilter,
@Nonnull final List<String> relationshipTypes, @Nonnull final RelationshipFilter relationshipFilter) {
BoolQueryBuilder finalQuery =
buildQuery(sourceType, sourceEntityFilter, destinationType, destinationEntityFilter, relationshipTypes,
relationshipFilter);
buildQuery(sourceType == null ? ImmutableList.of() : ImmutableList.of(sourceType), sourceEntityFilter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also use Collections.emptyList for an empty list : )

also, this method can take null right?

if it shouldn't take null values you can mark the param as @nonnull

@@ -108,9 +109,9 @@ public void addEdge(@Nonnull final Edge edge) {

@Nonnull
public RelatedEntitiesResult findRelatedEntities(
@Nullable final String sourceType,
@Nullable final List<String> sourceTypes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we keep nullable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below- yes

@@ -94,9 +95,9 @@ public void addEdge(@Nonnull final Edge edge) {

@Nonnull
public RelatedEntitiesResult findRelatedEntities(
@Nullable final String sourceType,
@Nullable final List<String> sourceTypes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we keep nullable? if we can avoid less program states we should go for it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, null is going to represent all as it has historically

// Build Statement strings
String whereClause = "";

Boolean hasSourceTypes = sourceTypes != null && !sourceTypes.isEmpty();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance we can break this longer block into a sep method?

maybe

computeEntityTypeWhereClause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -359,6 +389,12 @@ private static String criterionToString(@Nonnull CriterionArray criterionArray)
return joiner.length() <= 2 ? "" : joiner.toString();
}

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

doTestFindRelatedEntitiesEntityType(anyType, null, downstreamOf, outgoingRelationships, service, downstreamOfDatasetOneRelatedEntity);

service.addEdge(new Edge(datasetOneUrn, nullUrn, downstreamOf));
syncAfterWrite();
doTestFindRelatedEntitiesEntityType(anyType, "null", downstreamOf, outgoingRelationships, service, nullRelatedEntity);
doTestFindRelatedEntitiesEntityType(anyType, ImmutableList.of("null"), downstreamOf, outgoingRelationships, service, nullRelatedEntity);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? Immutable list of null?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same as above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! in the test there is actually an entity type of "null"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counterintuitive but okay :)

@@ -67,7 +68,7 @@ private RelatedEntitiesResult getRelatedEntities(String rawUrn, List<String> rel
start = start == null ? 0 : start;
count = count == null ? MAX_DOWNSTREAM_CNT : count;

return _graphService.findRelatedEntities("", newFilter("urn", rawUrn), "", QueryUtils.EMPTY_FILTER,
return _graphService.findRelatedEntities(ImmutableList.of(), newFilter("urn", rawUrn), ImmutableList.of(), QueryUtils.EMPTY_FILTER,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like its going to become confusing to provide either null or list... one other thing we couild do to simplify is just add another method that doesn't take either source types or dest types!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed- made null all and empty list none

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

Few questions or comments about simplifying things. Otherwise looks good

@gabe-lyons gabe-lyons merged commit 2e332d8 into datahub-project:master Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants