SPIKE: Can we ship / expose DataHub logs to Cloud Platform? #229

jemnery · 2024-04-02T13:14:39Z

Background

There is existing logging from datahub into kibana via standard out, however it isn't clear exactly what is logged.

Key Questions to answer

What scenarios do we want to log
- Metadata Changes, and what user made them
- User account changes
- Changes to ingestion pipelines
Is it possible to expose the required logs to cloud platform
- Yes, the stdout logs from DataHub's pods are piped to CP's monitoring elasticsearch cluster and are searchable via kibana
What changes need to be made if any
- None
What limitations exist
- Users are not directly tied to Metadata changes
- Logs are stored for 30 days
- Larger log statements are cut off in kibana
- The Entity Events API DataHub have made is only available in managed DataHub

Hypothesis

Can we ship / expose DH logs to CP / Kibana to consolidate our security logs?
Would be good for future SOC requirements.
Outcome: assess whether we can have an audit trail of actions by users in one place

Conclusion

From my looking at the logs it appears to me that it would be easier to track user behaviour via the frontend logs rather than the backend datahub logs, which are hard to parse.

The entity events api has event logs like we'd want, but it's only available in managed DataHub.

It is still possible to get an idea of user behaviour via the datahub logs when usage is low enough that timestamps can act as a link between metadata changes and frontend interactions.

Scenarios

The text was updated successfully, but these errors were encountered:

MatMoore · 2024-04-10T13:37:47Z

Anything that is going to stdout is already being collected: https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#log-collection

Example: https://kibana.cloud-platform.service.justice.gov.uk/_plugin/kibana/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(columns:!(kubernetes.pod_name,kubernetes.labels.app,kubernetes.container_image,kubernetes.namespace_name,log,stream),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'167701b0-f8c0-11ec-b95c-1d65c3682287',key:kubernetes.container_name,negate:!f,params:(query:datahub-gms),type:phrase),query:(match_phrase:(kubernetes.container_name:datahub-gms))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'167701b0-f8c0-11ec-b95c-1d65c3682287',key:kubernetes.namespace_name,negate:!f,params:(query:data-platform-datahub-catalogue-test),type:phrase),query:(match_phrase:(kubernetes.namespace_name:data-platform-datahub-catalogue-test)))),index:'167701b0-f8c0-11ec-b95c-1d65c3682287',interval:auto,query:(language:kuery,query:''),sort:!())

YvanMOJdigital · 2024-05-07T11:06:52Z

Threat Model workshop: https://miro.com/app/board/uXjVKRSefrc=/

MatMoore · 2024-05-14T15:22:22Z

Some example logs we can search for

Role changed

INFO  c.l.m.k.MetadataChangeLogProcessor:102 - Invoking MCL hooks for urn: ***, aspect name: roleMembership, entity type: corpuser, change type: UPSERT

2024-05-14 15:16:00,536 [Thread-4116] INFO  c.l.m.entity.EntityServiceImpl:1514 - Producing MCL for ingested aspect roleMembership, urn ***

These do not log who performed the action.

Metadata updates

2024-05-14 15:26:40,671 [ThreadPoolTaskExecutor-1] INFO  c.d.event.PlatformEventProcessor:53 - Got PE event key: entityChangeEvent-urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD),[version=2.0].[type=long].id), topic: PlatformEvent_v1, partition: 0, offset: 1201372, value size: 395, timestamp: 1715700400565

2024-05-14 15:26:40,514 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:102 - Invoking MCL hooks for urn: urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD), aspect name: editableSchemaMetadata, entity type: dataset, change type: UPSERT

2024-05-14 15:26:40,513 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:78 - Got MCL event key: urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD), topic: MetadataChangeLog_Versioned_v1, partition: 0, offset: 469751, value size: 327, timestamp: 1715700400511

2024-05-14 15:26:40,508 [Thread-4996] INFO  c.l.m.entity.EntityServiceImpl:1514 - Producing MCL for ingested aspect editableSchemaMetadata, urn urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD)

2024-05-14 15:26:40,500 [Thread-4996] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD), aspectName='editableSchemaMetadata', recordTemplate={editableSchemaFieldInfo=[{fieldPath=[version=2.0].[type=long].id, description=This is an edited description!}]}, systemMetadata={properties={appSource=ui}}}]

2024-05-14 15:26:40,480 [Thread-4995] INFO  c.l.d.g.r.m.UpdateDescriptionResolver:36 - Updating description. input: { description: "This is an edited description!", resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:athena,_investigations_dev.investigations,PROD)", subResourceType: DATASET_FIELD, subResource: "[version=2.0].[type=long].id" }

From this we can see the input, URN, and aspect. aspectName of editable* indicates something was updated through a UI.

Editing an ingestion

2024-05-14 15:33:05,617 [pool-12-thread-2] INFO  c.l.m.filter.RestliLoggingFilter:52 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 14ms

2024-05-14 15:33:05,603 [qtp161960012-4998] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataHubExecutionRequest:02c00f22-9d93-4855-96e9-e10a5560366c, aspectName='dataHubExecutionRequestResult', recordTemplate={report=~~~~ Execution Summary - RUN_INGEST ~~~~

2024-05-14 15:33:05,603 [qtp161960012-4998] INFO  c.l.m.r.entity.AspectResource:217 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...3663227d)}, entityType=dataHubExecutionRequest, aspect={contentType=application/json, value=ByteString(length=1717,bytes=7b227374...3535397d)}, changeType=UPSERT}

2024-05-14 15:33:05,579 [qtp161960012-4999] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataHubExecutionRequest:02c00f22-9d93-4855-96e9-e10a5560366c, aspectName='dataHubExecutionRequestResult', recordTemplate={startTimeMs=1715700785559, status=RUNNING}, systemMetadata={lastObserved=1715700785578, runId=no-run-id-provided}}]

2024-05-14 15:33:05,578 [qtp161960012-4999] INFO  c.l.m.r.entity.AspectResource:217 - INGEST PROPOSAL proposal: {aspectName=dataHubExecutionRequestResult, entityKeyAspect={contentType=application/json, value=ByteString(length=46,bytes=7b226964...3663227d)}, entityType=dataHubExecutionRequest, aspect={contentType=application/json, value=ByteString(length=51,bytes=7b227374...3535397d)}, changeType=UPSERT}

2024-05-14 15:33:05,523 [Thread-5161] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataHubExecutionRequest:02c00f22-9d93-4855-96e9-e10a5560366c, aspectName='dataHubExecutionRequestInput', recordTemplate={args={recipe={"run_id":"urn:li:dataHubExecutionRequest:02c00f22-9d93-4855-96e9-e10a5560366c","source":{"type":"glue","config":{"extract_owners":false,"aws_region":"eu-west-1","extract_transforms":false, ...}, source={ingestionSource=urn:li:dataHubIngestionSource:5fc3e102-f475-4da4-a6db-5db88ae482d5, type=MANUAL_INGESTION_SOURCE}, task=RUN_INGEST, requestedAt=1715700785522, executorId=default}, systemMetadata={properties={appSource=ui}}}]

Usage events

This is triggered when I perform an action

Maybe we can consume from this queue directly and log events that way

INFO  c.l.m.k.DataHubUsageEventsProcessor:58 - Got DHUE event key: urn:li:corpuser:Mat.Moore, topic: DataHubUsageEvent_v1, partition: 0, offset: 12554, value size: 641, timestamp: 1715699675851

MatMoore · 2024-05-14T15:43:38Z

I created two saved searches in kibana:

Anything on datahub dev with "Error" or "warning" in
Anything on datahub dev with "Ingesting aspects batch to database"

The latter catches all kinds of update I've tested so far, but does distinguish between automated/manual updates and does not capture the currently logged in user.

Possibly DataHubUsageEventsProcessor will be more useful for auditing so will check what we can do with that next.

MatMoore · 2024-05-14T15:59:47Z

Looking at the source, the datahub usage event is pushed to an elasticsearch data stream called datahub_usage_event

https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html

This is a kind of append only log. Only the keys of these log entries end up in kibana at the moment.

murdo-moj · 2024-05-15T15:42:50Z

https://datahubproject.io/docs/how/extract-container-logs/

Log files will live inside the container under the following directories for each service:

datahub-gms: /tmp/datahub/logs/gms

datahub-frontend: /tmp/datahub/logs/datahub-frontend

There are 2 types of logs that are collected:

Info Logs: These include info, warn, error log lines. They are what print to stdout when the container runs.

Debug Logs: These files have shorter retention (past 1 day) but include more granular debug information from the DataHub code specifically. We ignore debug logs from external libraries that DataHub depends on.

Looking in the debug logs, these are logs for the debugging datahub's code and are not useful for tracking datahub usage. Eg they do not contain user events in the same way as the stdout logs we collect in elasticsearch from the GMS pod already.

murdo-moj · 2024-05-15T15:48:41Z

Domain deletion

2024-05-15 15:44:17,696 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.DataHubUsageEventsProcessor - Got DHUE event key: urn:li:corpuser:Murdo.Moyse1, topic: DataHubUsageEvent_v1, partition: 0, offset: 5793, value size: 857, timestamp: 1715787857681
2024-05-15 15:44:21,815 [Thread-5966] INFO  c.l.d.g.r.d.DeleteDomainResolver - I've successfully deleted the entity urn:li:domain:57101e7d-8af2-4b01-98b0-ac45b10a99dd with urn
2024-05-15 15:44:21,816 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor - Invoking MCL hooks for urn: urn:li:domain:57101e7d-8af2-4b01-98b0-ac45b10a99dd, aspect name: domainKey, entity type: domain, change type: DELETE

Domain creation

2024-05-15 16:07:10,105 [Thread-6762] INFO  c.l.m.entity.EntityServiceImpl - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:domain:de5aee03-069c-431c-843f-2bd42a3d4c75, aspectName='domainProperties', recordTemplate={name=test-for-logs-2, created={actor=urn:li:corpuser:Murdo.Moyse1, time=1715789230105}}, systemMetadata={properties={appSource=ui}}}]
2024-05-15 16:07:10,186 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor - Added request id: CreateDomainEvent_urn%3Ali%3Acorpuser%3AMurdo.Moyse1_1715789230091_05801, operation type: CREATE, index: datahub_usage_event

"Ingesting aspects batch to database" is a good way of finding logs which show edits to entities. Edits do not show who is editing them.
Deleting entities do not show up by searching the above.
logs have a character cap on kibana, so will sometimes end before the information we want. For example when I update a dataset entity, I cannot see the part of the log which shows the lastmodified aspect, and if that'd tag the USERT to my user.:

"2024-05-16 09:09:49,428 [Thread-10308] INFO  c.l.m.entity.EntityServiceImpl:634 - @kibana-highlighted-field@Ingesting@/kibana-highlighted-field@ @kibana-highlighted-field@aspects@/kibana-highlighted-field@ @kibana-highlighted-field@batch@/kibana-highlighted-field@ @kibana-highlighted-field@to@/kibana-highlighted-field@ @kibana-highlighted-field@database@/kibana-highlighted-field@, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataset:(urn:li:dataPlatform:dbt,create_a_derived_table.awsdatacatalog.common_platform_derived.sjp_session_summary_fct,PROD), aspectName='editableDatasetProperties', recordTemplate={description=Single Justice Procedure Session Summary derived table. Contains information about the SJP sessions associated with offences."

murdo-moj · 2024-05-16T09:43:07Z

When I add a tag to a field of a dataset, logs are made that a field is being upserted.
The log states it is requested from the UI.
However, no user information is logged to tie that UI action to a user.
We can see that my user has been viewing pages at around the same time, but we cannot see which pages I have viewed, or if I requested any metadata changes.

2024-05-16 09:27:04,146 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: PageViewEvent_urn%3Ali%3Acorpuser%3AMurdo.Moyse1_1715851624070_05895, operation type: CREATE, index: datahub_usage_event
2024-05-16 09:27:25,077 [Thread-11788] INFO  c.l.m.entity.EntityServiceImpl:634 - Ingesting aspects batch to database, items: [ChangeItemImpl{changeType=UPSERT, urn=urn:li:dataset:(urn:li:dataPlatform:athena,opg_sirius_derived.cases_attorneys,PROD), aspectName='editableSchemaMetadata', recordTemplate={editableSchemaFieldInfo=[{globalTags={tags=[{tag=urn:li:tag:test-for-logs}]}, fieldPath=caseitem_id}]}, systemMetadata={properties={appSource=ui}}}]

How the logs seem to work is that user's requests to the frontend (eg SearchPageView, HomePageView) are logged against the user, but if they request a metadata change, this directly becomes a request to the datahub backend, with no logging of who made the request.

murdo-moj · 2024-05-16T14:41:10Z

The opensearch cluster can be called from the gms pod, eg
curl 10.100.152.177:8080/_cat/indices?v (get the hostname and the port from the environment variables in the pod)

What I was trying to query was the datahub_usage_event index which is written to from gms.
curl 10.100.152.177:8080/datahub_usage_event/_settings?pretty=true
I can see individual datahub usage events indexed but I can't pull out any useful data from them like a user.

murdo-moj · 2024-05-16T15:12:57Z

From my looking at the logs it appears to me that it would be easier to track user behaviour via the frontend logs rather than the backend datahub logs, which are hard to parse.

It is still possible to get an idea of user behaviour via the datahub logs when usage is low enough that timestamps can act as a link between metadata changes and frontend interactions.

murdo-moj · 2024-05-20T15:21:38Z

https://datahubproject.io/docs/managed-datahub/datahub-api/entity-events-api/ The entity events api has event logs like we'd want, but it's only available in managed datahub

{
  "entityUrn": "urn:li:dataset:abc",
  "entityType": "dataset",
  "category": "TAG",
  "operation": "ADD",
  "modifier": "urn:li:tag:PII",
  "parameters": {
    "tagUrn": "urn:li:tag:PII"
  },
  "auditStamp": {
    "actor": "urn:li:corpuser:jdoe",
    "time": 1649953100653   
  }
}

jemnery added this to Data Catalogue Apr 2, 2024

jemnery converted this from a draft issue Apr 2, 2024

jemnery changed the title ~~Can we ship / expose DH logs to CP~~ SPIKE: Can we ship / expose DH logs to CP? Apr 2, 2024

jemnery changed the title ~~SPIKE: Can we ship / expose DH logs to CP?~~ SPIKE: Can we ship / expose DataHub logs to CP? Apr 3, 2024

jemnery changed the title ~~SPIKE: Can we ship / expose DataHub logs to CP?~~ SPIKE: Can we ship / expose DataHub logs to Cloud Platform? Apr 3, 2024

tom-webber moved this to Todo in Data Catalogue Apr 11, 2024

MatMoore self-assigned this May 13, 2024

MatMoore moved this from Todo to In Progress in Data Catalogue May 13, 2024

YvanMOJdigital moved this from In Progress to Todo in Data Catalogue May 15, 2024

murdo-moj assigned murdo-moj and unassigned MatMoore May 16, 2024

murdo-moj moved this from Todo to In Progress in Data Catalogue May 16, 2024

murdo-moj moved this from In Progress to Review in Data Catalogue May 16, 2024

murdo-moj moved this from Review to In Progress in Data Catalogue May 16, 2024

murdo-moj moved this from In Progress to Review in Data Catalogue May 16, 2024

murdo-moj moved this from Review to Done in Data Catalogue May 22, 2024

murdo-moj closed this as completed by moving to Done in Data Catalogue May 22, 2024

murdo-moj mentioned this issue Jun 5, 2024

Spike: Explore analytics options for tracking user journeys in DataHub #396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPIKE: Can we ship / expose DataHub logs to Cloud Platform? #229

SPIKE: Can we ship / expose DataHub logs to Cloud Platform? #229

jemnery commented Apr 2, 2024 •

edited by murdo-moj

Loading

MatMoore commented Apr 10, 2024

YvanMOJdigital commented May 7, 2024

MatMoore commented May 14, 2024 •

edited

Loading

MatMoore commented May 14, 2024

MatMoore commented May 14, 2024

murdo-moj commented May 15, 2024 •

edited

Loading

murdo-moj commented May 15, 2024 •

edited

Loading

murdo-moj commented May 16, 2024 •

edited

Loading

murdo-moj commented May 16, 2024 •

edited

Loading

murdo-moj commented May 16, 2024

murdo-moj commented May 20, 2024

SPIKE: Can we ship / expose DataHub logs to Cloud Platform? #229

SPIKE: Can we ship / expose DataHub logs to Cloud Platform? #229

Comments

jemnery commented Apr 2, 2024 • edited by murdo-moj Loading

Background

Key Questions to answer

Hypothesis

Conclusion

Scenarios

MatMoore commented Apr 10, 2024

YvanMOJdigital commented May 7, 2024

MatMoore commented May 14, 2024 • edited Loading

Role changed

Metadata updates

Editing an ingestion

Usage events

MatMoore commented May 14, 2024

MatMoore commented May 14, 2024

murdo-moj commented May 15, 2024 • edited Loading

murdo-moj commented May 15, 2024 • edited Loading

murdo-moj commented May 16, 2024 • edited Loading

murdo-moj commented May 16, 2024 • edited Loading

murdo-moj commented May 16, 2024

murdo-moj commented May 20, 2024

jemnery commented Apr 2, 2024 •

edited by murdo-moj

Loading

MatMoore commented May 14, 2024 •

edited

Loading

murdo-moj commented May 15, 2024 •

edited

Loading

murdo-moj commented May 15, 2024 •

edited

Loading

murdo-moj commented May 16, 2024 •

edited

Loading

murdo-moj commented May 16, 2024 •

edited

Loading