Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE: Can we ship / expose DataHub logs to Cloud Platform? #229

Closed
7 of 8 tasks
jemnery opened this issue Apr 2, 2024 · 11 comments
Closed
7 of 8 tasks

SPIKE: Can we ship / expose DataHub logs to Cloud Platform? #229

jemnery opened this issue Apr 2, 2024 · 11 comments
Assignees

Comments

@jemnery
Copy link
Contributor

jemnery commented Apr 2, 2024

Background

There is existing logging from datahub into kibana via standard out, however it isn't clear exactly what is logged.

Key Questions to answer

  • What scenarios do we want to log
    • Metadata Changes, and what user made them
    • User account changes
    • Changes to ingestion pipelines
  • Is it possible to expose the required logs to cloud platform
    • Yes, the stdout logs from DataHub's pods are piped to CP's monitoring elasticsearch cluster and are searchable via kibana
  • What changes need to be made if any
    • None
  • What limitations exist
    • Users are not directly tied to Metadata changes
    • Logs are stored for 30 days
    • Larger log statements are cut off in kibana
    • The Entity Events API DataHub have made is only available in managed DataHub

Hypothesis

Can we ship / expose DH logs to CP / Kibana to consolidate our security logs?
Would be good for future SOC requirements.
Outcome: assess whether we can have an audit trail of actions by users in one place

Conclusion

From my looking at the logs it appears to me that it would be easier to track user behaviour via the frontend logs rather than the backend datahub logs, which are hard to parse.

The entity events api has event logs like we'd want, but it's only available in managed DataHub.

It is still possible to get an idea of user behaviour via the datahub logs when usage is low enough that timestamps can act as a link between metadata changes and frontend interactions.

Scenarios

  • User account creation
  • User role changes
  • Changes to metadata
  • Adding and deleting entries
  • Editing tags
  • Editing the glossary
  • Editing domains
  • Modifying ingestions
@jemnery jemnery converted this from a draft issue Apr 2, 2024
@jemnery jemnery changed the title Can we ship / expose DH logs to CP SPIKE: Can we ship / expose DH logs to CP? Apr 2, 2024
@jemnery jemnery changed the title SPIKE: Can we ship / expose DH logs to CP? SPIKE: Can we ship / expose DataHub logs to CP? Apr 3, 2024
@jemnery jemnery changed the title SPIKE: Can we ship / expose DataHub logs to CP? SPIKE: Can we ship / expose DataHub logs to Cloud Platform? Apr 3, 2024
@tom-webber tom-webber moved this to Todo in Data Catalogue Apr 11, 2024
@YvanMOJdigital
Copy link

Threat Model workshop: https://miro.com/app/board/uXjVKRSefrc=/

@MatMoore MatMoore self-assigned this May 13, 2024
@MatMoore MatMoore moved this from Todo to In Progress in Data Catalogue May 13, 2024
@MatMoore
Copy link
Contributor

MatMoore commented May 14, 2024

Some example logs we can search for

Role changed

INFO  c.l.m.k.MetadataChangeLogProcessor:102 - Invoking MCL hooks for urn: ***, aspect name: roleMembership, entity type: corpuser, change type: UPSERT

2024-05-14 15:16:00,536 [Thread-4116] INFO  c.l.m.entity.EntityServiceImpl:1514 - Producing MCL for ingested aspect roleMembership, urn ***

These do not log who performed the action.

Metadata updates

2024-05-14 15:26:40,671 [ThreadPoolTaskExecutor-1] INFO  c.d.event.PlatformEventProcessor:53 - Got PE event key: entityChangeEvent-urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD),[version=2.0].[type=long].id), topic: PlatformEvent_v1, partition: 0, offset: 1201372, value size: 395, timestamp: 1715700400565

2024-05-14 15:26:40,514 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:102 - Invoking MCL hooks for urn: urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD), aspect name: editableSchemaMetadata, entity type: dataset, change type: UPSERT

2024-05-14 15:26:40,513 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:78 - Got MCL event key: urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD), topic: MetadataChangeLog_Versioned_v1, partition: 0, offset: 469751, value size: 327, timestamp: 1715700400511

2024-05-14 15:26:40,508 [Thread-4996] INFO  c.l.m.entity.EntityServiceImpl:1514 - Producing MCL for ingested aspect editableSchemaMetadata, urn urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD)

2024-05-14 15:26:40,500 [Thread-4996] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD), aspectName='editableSchemaMetadata', recordTemplate={editableSchemaFieldInfo=[{fieldPath=[version=2.0].[type=long].id, description=This is an edited description!}]}, systemMetadata={properties={appSource=ui}}}]

2024-05-14 15:26:40,480 [Thread-4995] INFO  c.l.d.g.r.m.UpdateDescriptionResolver:36 - Updating description. input: { description: "This is an edited description!", resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD)", subResourceType: DATASET_FIELD, subResource: "[version=2.0].[type=long].id" }

From this we can see the input, URN, and aspect. aspectName of editable* indicates something was updated through a UI.

Editing an ingestion

2024-05-14 15:33:05,617 [pool-12-thread-2] INFO  c.l.m.filter.RestliLoggingFilter:52 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 14ms

2024-05-14 15:33:05,603 [qtp161960012-4998] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataHubExecutionRequest:02c00f22-9d93-4855-96e9-e10a5560366c, aspectName='dataHubExecutionRequestResult', recordTemplate={report=~~~~ Execution Summary - RUN_INGEST ~~~~

2024-05-14 15:33:05,603 [qtp161960012-4998] INFO  c.l.m.r.entity.AspectResource:217 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...3663227d)}, entityType=dataHubExecutionRequest, aspect={contentType=application/json, value=ByteString(length=1717,bytes=7b227374...3535397d)}, changeType=UPSERT}

2024-05-14 15:33:05,579 [qtp161960012-4999] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataHubExecutionRequest:02c00f22-9d93-4855-96e9-e10a5560366c, aspectName='dataHubExecutionRequestResult', recordTemplate={startTimeMs=1715700785559, status=RUNNING}, systemMetadata={lastObserved=1715700785578, runId=no-run-id-provided}}]

2024-05-14 15:33:05,578 [qtp161960012-4999] INFO  c.l.m.r.entity.AspectResource:217 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...3663227d)}, entityType=dataHubExecutionRequest, aspect={contentType=application/json, value=ByteString(length=51,bytes=7b227374...3535397d)}, changeType=UPSERT}

2024-05-14 15:33:05,523 [Thread-5161] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataHubExecutionRequest:02c00f22-9d93-4855-96e9-e10a5560366c, aspectName='dataHubExecutionRequestInput', recordTemplate={args={recipe={"run_id":"urn:li:dataHubExecutionRequest:02c00f22-9d93-4855-96e9-e10a5560366c","source":{"type":"glue","config":{"extract_owners":false,"aws_region":"eu-west-1","extract_transforms":false, ...}, source={ingestionSource=urn:li:dataHubIngestionSource:5fc3e102-f475-4da4-a6db-5db88ae482d5, type=MANUAL_INGESTION_SOURCE}, task=RUN_INGEST, requestedAt=1715700785522, executorId=default}, systemMetadata={properties={appSource=ui}}}]

Usage events

This is triggered when I perform an action

Maybe we can consume from this queue directly and log events that way

INFO  c.l.m.k.DataHubUsageEventsProcessor:58 - Got DHUE event key: urn:li:corpuser:Mat.Moore, topic: DataHubUsageEvent_v1, partition: 0, offset: 12554, value size: 641, timestamp: 1715699675851

@MatMoore
Copy link
Contributor

I created two saved searches in kibana:

  • Anything on datahub dev with "Error" or "warning" in
  • Anything on datahub dev with "Ingesting aspects batch to database"

The latter catches all kinds of update I've tested so far, but does distinguish between automated/manual updates and does not capture the currently logged in user.

Possibly DataHubUsageEventsProcessor will be more useful for auditing so will check what we can do with that next.

@MatMoore
Copy link
Contributor

Looking at the source, the datahub usage event is pushed to an elasticsearch data stream called datahub_usage_event

https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html

This is a kind of append only log. Only the keys of these log entries end up in kibana at the moment.

@YvanMOJdigital YvanMOJdigital moved this from In Progress to Todo in Data Catalogue May 15, 2024
@murdo-moj
Copy link
Contributor

murdo-moj commented May 15, 2024

https://datahubproject.io/docs/how/extract-container-logs/

Log files will live inside the container under the following directories for each service:

  • datahub-gms: /tmp/datahub/logs/gms
  • datahub-frontend: /tmp/datahub/logs/datahub-frontend

There are 2 types of logs that are collected:

  1. Info Logs: These include info, warn, error log lines. They are what print to stdout when the container runs.
  2. Debug Logs: These files have shorter retention (past 1 day) but include more granular debug information from the DataHub code specifically. We ignore debug logs from external libraries that DataHub depends on.

Looking in the debug logs, these are logs for the debugging datahub's code and are not useful for tracking datahub usage. Eg they do not contain user events in the same way as the stdout logs we collect in elasticsearch from the GMS pod already.

@murdo-moj
Copy link
Contributor

murdo-moj commented May 15, 2024

  • Domain deletion
2024-05-15 15:44:17,696 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.DataHubUsageEventsProcessor - Got DHUE event key: urn:li:corpuser:Murdo.Moyse1, topic: DataHubUsageEvent_v1, partition: 0, offset: 5793, value size: 857, timestamp: 1715787857681
2024-05-15 15:44:21,815 [Thread-5966] INFO  c.l.d.g.r.d.DeleteDomainResolver - I've successfully deleted the entity urn:li:domain:57101e7d-8af2-4b01-98b0-ac45b10a99dd with urn
2024-05-15 15:44:21,816 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor - Invoking MCL hooks for urn: urn:li:domain:57101e7d-8af2-4b01-98b0-ac45b10a99dd, aspect name: domainKey, entity type: domain, change type: DELETE
  • Domain creation
2024-05-15 16:07:10,105 [Thread-6762] INFO  c.l.m.entity.EntityServiceImpl - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:domain:de5aee03-069c-431c-843f-2bd42a3d4c75, aspectName='domainProperties', recordTemplate={name=test-for-logs-2, created={actor=urn:li:corpuser:Murdo.Moyse1, time=1715789230105}}, systemMetadata={properties={appSource=ui}}}]
2024-05-15 16:07:10,186 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor - Added request id: CreateDomainEvent_urn%3Ali%3Acorpuser%3AMurdo.Moyse1_1715789230091_05801, operation type: CREATE, index: datahub_usage_event
  • "Ingesting aspects batch to database" is a good way of finding logs which show edits to entities. Edits do not show who is editing them.
  • Deleting entities do not show up by searching the above.
  • logs have a character cap on kibana, so will sometimes end before the information we want. For example when I update a dataset entity, I cannot see the part of the log which shows the lastmodified aspect, and if that'd tag the USERT to my user.:
"2024-05-16 09:09:49,428 [Thread-10308] INFO  c.l.m.entity.EntityServiceImpl:634 - @kibana-highlighted-field@Ingesting@/kibana-highlighted-field@ @kibana-highlighted-field@aspects@/kibana-highlighted-field@ @kibana-highlighted-field@batch@/kibana-highlighted-field@ @kibana-highlighted-field@to@/kibana-highlighted-field@ @kibana-highlighted-field@database@/kibana-highlighted-field@, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataset:(urn:li:dataPlatform:dbt,create_a_derived_table.awsdatacatalog.common_platform_derived.sjp_session_summary_fct,PROD), aspectName='editableDatasetProperties', recordTemplate={description=Single Justice Procedure Session Summary derived table. Contains information about the SJP sessions associated with offences."

@murdo-moj murdo-moj assigned murdo-moj and unassigned MatMoore May 16, 2024
@murdo-moj murdo-moj moved this from Todo to In Progress in Data Catalogue May 16, 2024
@murdo-moj
Copy link
Contributor

murdo-moj commented May 16, 2024

When I add a tag to a field of a dataset, logs are made that a field is being upserted.
The log states it is requested from the UI.
However, no user information is logged to tie that UI action to a user.
We can see that my user has been viewing pages at around the same time, but we cannot see which pages I have viewed, or if I requested any metadata changes.

2024-05-16 09:27:04,146 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: PageViewEvent_urn%3Ali%3Acorpuser%3AMurdo.Moyse1_1715851624070_05895, operation type: CREATE, index: datahub_usage_event
2024-05-16 09:27:25,077 [Thread-11788] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataset:(urn:li:dataPlatform:athena,opg_sirius_derived.cases_attorneys,PROD), aspectName='editableSchemaMetadata', recordTemplate={editableSchemaFieldInfo=[{globalTags={tags=[{tag=urn:li:tag:test-for-logs}]}, fieldPath=caseitem_id}]}, systemMetadata={properties={appSource=ui}}}]

How the logs seem to work is that user's requests to the frontend (eg SearchPageView, HomePageView) are logged against the user, but if they request a metadata change, this directly becomes a request to the datahub backend, with no logging of who made the request.

@murdo-moj murdo-moj moved this from In Progress to Review in Data Catalogue May 16, 2024
@murdo-moj murdo-moj moved this from Review to In Progress in Data Catalogue May 16, 2024
@murdo-moj
Copy link
Contributor

murdo-moj commented May 16, 2024

The opensearch cluster can be called from the gms pod, eg
curl 10.100.152.177:8080/_cat/indices?v (get the hostname and the port from the environment variables in the pod)

What I was trying to query was the datahub_usage_event index which is written to from gms.
curl 10.100.152.177:8080/datahub_usage_event/_settings?pretty=true
I can see individual datahub usage events indexed but I can't pull out any useful data from them like a user.

@murdo-moj
Copy link
Contributor

From my looking at the logs it appears to me that it would be easier to track user behaviour via the frontend logs rather than the backend datahub logs, which are hard to parse.

It is still possible to get an idea of user behaviour via the datahub logs when usage is low enough that timestamps can act as a link between metadata changes and frontend interactions.

@murdo-moj murdo-moj moved this from In Progress to Review in Data Catalogue May 16, 2024
@murdo-moj
Copy link
Contributor

https://datahubproject.io/docs/managed-datahub/datahub-api/entity-events-api/ The entity events api has event logs like we'd want, but it's only available in managed datahub

{
  "entityUrn": "urn:li:dataset:abc",
  "entityType": "dataset",
  "category": "TAG",
  "operation": "ADD",
  "modifier": "urn:li:tag:PII",
  "parameters": {
    "tagUrn": "urn:li:tag:PII"
  },
  "auditStamp": {
    "actor": "urn:li:corpuser:jdoe",
    "time": 1649953100653   
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done ✅
Development

No branches or pull requests

4 participants