-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPIKE: Can we ship / expose DataHub logs to Cloud Platform? #229
Comments
Anything that is going to stdout is already being collected: https://user-guide.cloud-platform.service.justice.gov.uk/documentation/logging-an-app/log-collection-and-storage.html#log-collection |
Threat Model workshop: https://miro.com/app/board/uXjVKRSefrc=/ |
Some example logs we can search for Role changed
These do not log who performed the action. Metadata updates
From this we can see the input, URN, and aspect. aspectName of Editing an ingestion
Usage eventsThis is triggered when I perform an action Maybe we can consume from this queue directly and log events that way
|
I created two saved searches in kibana:
The latter catches all kinds of update I've tested so far, but does distinguish between automated/manual updates and does not capture the currently logged in user. Possibly DataHubUsageEventsProcessor will be more useful for auditing so will check what we can do with that next. |
Looking at the source, the datahub usage event is pushed to an elasticsearch data stream called https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html This is a kind of append only log. Only the keys of these log entries end up in kibana at the moment. |
https://datahubproject.io/docs/how/extract-container-logs/
Looking in the debug logs, these are logs for the debugging datahub's code and are not useful for tracking datahub usage. Eg they do not contain user events in the same way as the stdout logs we collect in elasticsearch from the GMS pod already. |
|
When I add a tag to a field of a dataset, logs are made that a field is being upserted.
How the logs seem to work is that user's requests to the frontend (eg SearchPageView, HomePageView) are logged against the user, but if they request a metadata change, this directly becomes a request to the datahub backend, with no logging of who made the request. |
The opensearch cluster can be called from the gms pod, eg What I was trying to query was the |
From my looking at the logs it appears to me that it would be easier to track user behaviour via the frontend logs rather than the backend datahub logs, which are hard to parse. It is still possible to get an idea of user behaviour via the datahub logs when usage is low enough that timestamps can act as a link between metadata changes and frontend interactions. |
https://datahubproject.io/docs/managed-datahub/datahub-api/entity-events-api/ The entity events api has event logs like we'd want, but it's only available in managed datahub
|
Background
There is existing logging from datahub into kibana via standard out, however it isn't clear exactly what is logged.
Key Questions to answer
stdout
logs from DataHub's pods are piped to CP's monitoring elasticsearch cluster and are searchable via kibanaHypothesis
Can we ship / expose DH logs to CP / Kibana to consolidate our security logs?
Would be good for future SOC requirements.
Outcome: assess whether we can have an audit trail of actions by users in one place
Conclusion
From my looking at the logs it appears to me that it would be easier to track user behaviour via the frontend logs rather than the backend datahub logs, which are hard to parse.
The entity events api has event logs like we'd want, but it's only available in managed DataHub.
It is still possible to get an idea of user behaviour via the datahub logs when usage is low enough that timestamps can act as a link between metadata changes and frontend interactions.
Scenarios
The text was updated successfully, but these errors were encountered: