Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Failing ES Remote Logging #32438

Merged
merged 7 commits into from
Jul 15, 2023
Merged

Conversation

Owen-CH-Leung
Copy link
Contributor

@Owen-CH-Leung Owen-CH-Leung commented Jul 8, 2023

Previous PR (link here) removes all non-official elasticsearch library and use only the official elasticsearch library. After the PR was merged, reading remote logs from ES has failed due to a bug introduced in the PR.

In particular, when reading remote logs, the webserver produced the following elasticsearch exception when sending a POST request to ES endpoint /_all/count:

elasticsearch.exceptions.RequestError: RequestError(400, 'parsing_exception', 'request does not support [sort]')

which basically says the endpoint didn't accept the sort parameter from the query

Hence this PR aims to fix the bug and re-enable reading remote logs. With this PR merged, airflow should resume its ability to read remote logs from ES. Local test result:

image

Additional implementation details:

The changes I made was just to remove the sort parameter in the query to be posted to elasticsearch.

I inspected the original source code in provider 4.5.1 :
https://github.com/apache/airflow/blob/providers-elasticsearch/4.5.1/airflow/providers/elasticsearch/log/es_task_handler.py#L290-L294

While the sort() method was triggered from elasticsearch_dsl, the actual query posted to elasticsearch NEVER includes the sort parameter. So even in 4.5.1 , the result returned is not sorted. The sort() method in 4.5.1 just provides a false sense that the returned result is sorted, but it is actually not.

Here's how the function code gets called:

  1. the count() method from elasticsearch_dsl, class Search gets called.

https://github.com/apache/airflow/blob/providers-elasticsearch/4.5.1/airflow/providers/elasticsearch/log/es_task_handler.py#L297

  1. Inside the count() method, the to_dict() method gets called with count = True

https://github.com/elastic/elasticsearch-dsl-py/blob/main/elasticsearch_dsl/search.py#L694

  1. In the to_dict() method, since count = True , the sort parameter didn't get appended

https://github.com/elastic/elasticsearch-dsl-py/blob/main/elasticsearch_dsl/search.py#L640-L664

@Owen-CH-Leung
Copy link
Contributor Author

Owen-CH-Leung commented Jul 8, 2023

@pankajkoti Could you use this branch to test on your end and see if the remote logging works ? I tested locally (using the airflow.cfg file that you shared before) and the http post error to ES has gone away. Thanks

@pankajkoti
Copy link
Member

pankajkoti commented Jul 8, 2023

hi @Owen-CH-Leung Thanks for trying to provide quick resolution here. I can see that the error for sort has gone away. But, unfortunately it is not able to fetch the corresponding logs for the task for me(it used to work fine previously). I can see that logs are getting shipped successfully to ES but the search query somehow is not able to match and find those.

I am trying to see meantime if I can somehow fix it by comparing the previous code https://github.com/apache/airflow/pull/31920/files#diff-dd898ab2ed4bca853f1ce5cf52b6fbb37d5fc3545f28967c03dc499fabd3a746 and the current query, but wanted to inform you about the current progress.

@Owen-CH-Leung
Copy link
Contributor Author

@pankajkoti Ok I tried to use filebeat to ship logs to ES and also encountered the same issue. Logs are not being fetched. Looking at it now

@Owen-CH-Leung
Copy link
Contributor Author

@pankajkoti Hey - could you also share how you set up filebeat to ship logs to ES ?

I was using 4.5.1 and still I don't have luck fetching the logs. FYI I was using the tag v2-5-stable under the airflow repo (which is using 4.5.1 I think. I saw the code is the old one with non-official ES library). I suspect there might be some tricks in using filebeat to ship logs (just wild guess)

let me know. Thanks!

@pankajkoti
Copy link
Member

pankajkoti commented Jul 8, 2023

hi @Owen-CH-Leung .

Please find below steps I used to set up Filebeat to ship logs. @jedcunningham helped me to set that up, so thanks a ton to him.

I am using Docker Desktop's Kubernetes cluster on MacOS, so you might unfortunately need Docker Desktop's Kubernetes cluster for the below steps to work.

I have attached 4 files to be used, please remove the .txt suffix from each of the files after downloading(had to append .txt as GitHub does not allow to append YAML files).

  1. es.yaml: setup ES on Kubernetes
    helm repo add elastic https://helm.elastic.co
    helm repo update
    helm upgrade --install -f es.yaml pk-elasticsearch elastic/elasticsearch
    kubectl port-forward svc/elasticsearch-master 9200:9200

  2. kibana.yaml: Kibana on Kubernetes for UI to verify logs are getting shipped to ES
    helm upgrade --install -f kibana.yaml pk-kibana elastic/kibana --version 7.17.3
    kubectl port-forward svc/pk-kibana-kibana 5601:5601

  3. filebeat_pv.yaml : For volumes required for Filebeat
    In this file, please replace the hostpath -> path key's value to your local airflow repo path's logs directory. When you use breeze, the logs will be available in the <your_clone_repo_path>/airflow/logs directory and Filebeat will ship logs from this directory to ES then.
    kubectl apply -f filebeat_pv.yaml

  4. filebeat.yaml : Filebeat on Kubernetes
    helm upgrade --install -f filebeat.yaml pk-filebeat elastic/filebeat

There might be a slight delay in shipping the logs but you will see them eventually in a minute using Kibana UI's Logs feature.

es.yaml.txt
filebeat_pv.yaml.txt
filebeat.yaml.txt
kibana.yaml.txt

Airflow Config changes for this stack:

[logging]
remote_logging = True

[elasticsearch]
json_format = True
host = host.docker.internal:9200
host_field = host.name
offset_field = log.offset

@Owen-CH-Leung
Copy link
Contributor Author

Owen-CH-Leung commented Jul 8, 2023

@pankajkoti Can you check on your side like in Kibana dashboard, does the documents have the field log_id and dag_id ?

I set up ES, filebeat and kibana accordingly and inside kibana dashbaord, I can see the logs are streamed to ES, including logs of the new runs that I triggered in airflow web UI. My airflow.cfg has been configured as well. However, when I tried to inspect individual message in Kibana, I didn't see the above 2 fields in each individual document.

I think if the field log_id is missing in the ES, we cannot retrieve msg from ES properly

@Owen-CH-Leung
Copy link
Contributor Author

@pankajkoti No worries it's just my mistake - I can see log_id and dag_id now

potiuk
potiuk previously approved these changes Jul 9, 2023
@potiuk
Copy link
Member

potiuk commented Jul 9, 2023

@pankajkoti -> would love to get confirmation that this one fixes the issue :)

@pankajkoti
Copy link
Member

@potiuk Sorry Sir. Unfortunately, it does not solve the issue for me yet

@pankajkoti
Copy link
Member

I might try to see if I can get it working by trying to make few changes here but have an occupied day unfortunately today, so cannot commit will be able to solve this today

@Owen-CH-Leung
Copy link
Contributor Author

@potiuk @pankajkoti The current code change still didn't solve the issue I think. I'm making few changes locally to test out if it fixes the stuff.

@Owen-CH-Leung
Copy link
Contributor Author

Owen-CH-Leung commented Jul 9, 2023

@pankajkoti I finally have it working locally after making some more code change. I'll do some clean up and push the code to this PR in coming few days. will let you know once it's ready for you to test.

image

@potiuk potiuk dismissed their stale review July 9, 2023 21:55

Waiting for final fix

@pankajkoti
Copy link
Member

Okay thank you @Owen-CH-Leung

@Owen-CH-Leung
Copy link
Contributor Author

@pankajkoti Can you test again ? It should work now.

@pankajkoti
Copy link
Member

@Owen-CH-Leung Thanks I tested now and it seems to work fine 🎉

@pankajkoti
Copy link
Member

cc: @dstandish @sunank200

@eladkal
Copy link
Contributor

eladkal commented Jul 11, 2023

@Owen-CH-Leung Great job!
@pankajkoti thanks for helping test it!

I think it would be good to have some test coverage for the changes so we can avoid regressions

@Owen-CH-Leung
Copy link
Contributor Author

@eladkal Sure. Let me add some tests for the es_response in coming few days

Copy link
Member

@pankajkoti pankajkoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add type annotations in method signatures and also some docstrings for the methods to understand what the methods do :)

@Owen-CH-Leung
Copy link
Contributor Author

Owen-CH-Leung commented Jul 12, 2023

@pankajkoti Sure thing. Will do =D

@Owen-CH-Leung
Copy link
Contributor Author

@eladkal @pankajkoti @potiuk Added test, docstrings & method signature. However the docker-compose CI job is failing

Copy link
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants