-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate "Serialization" code #40974
Comments
cc @gyli since you waned to take this issue. Could you add a comment to this GitHub issues, please? Your name only shows up for me to add in Assignees once you have a comment on it. |
Hi @kaxil , I have been reviewing serialization codes recently, while I haven't noticed any significant parts that need to be consolidated, especially after the improvement of lazy load serialization module. |
cc @uranusjr who might also know about it since he worked with Bolke on it |
I remember different ways of Serialization used for Xcom, Airflow Webserver/ORM but Bolke or TP can add more specific details: airflow/airflow/models/dagpickle.py Line 45 in 14e9759
Lines 269 to 278 in 14e9759
airflow/airflow/models/xcom.py Lines 673 to 697 in 14e9759
Line 46 in 14e9759
|
The problem as I understand it is tha we currently have two serializations:
Some of our code uses 1) - some uses 2). There are a few problems there - for example 2) does not yet support DAG serialization, and it also is not used for example in internal-api (but internal-api will be gone in Airlfow 3). There are also other places where we use other serialization mechanisms (dill, cloudpickle) and there is a potential tha we could consolidate that as well. Ideally (this was the long term vision of @bolkedebruin) we should have "one to rule them all" - 2) should become universal serializer used by everything. |
Ha! There are some things to consider when wanting to do "consolidation". The main leading principle in the past was that we do not want to have executable code when deserializing. This is what all third-party (de)serializers seem to do, pickle, dill, cloudpickle etc. The second principle was to have a human readable format and the third principle was to have it versioned. I've added "serde" (no 2. that @potiuk is speaking of in the past) to have a generic way of serializing any object with the principles in mind. This is particularly useful for XCom as that shares arbitrary data across workers. The 'other' serializer which I would call the DAG serializer has three main short-comings: 1) It is slow - serde takes about 10% of the time the DAG serializer takes, 2) it is hard to extend, you would need to change the core code to add an extension and 3) it will add O(n) in time to do so. The upside is that it is tried and tested, serializes DAGs and does JSON schema validation. It might then seem the obvious route to add DAG serialization to "serde". Which it did try, but also felt a bit like squeezing something into something else where it doesnt entirely fit (keeping backwards compatibility in mind). It is possible, but a lot of past cruft would need to be re-implemented to make it work. Now I see other projects like Spark Connect settle on Cloudpickle and they forego the issue of arbitrary code execution. The question then becomes how relevant is that attack vector? Is the tradeoff in maintaining our own serializer worth it? Also it will not generate a human readable format. Do we need to review our principles (which I think have never been officially settled, but correct me if I am wrong). Concluding: if you add DAG seralization to "serde" it is probably the most Airflow way to go. It gives you extensibility for the future as it has the better format (over the DAG serializer) and can with a little bit of help serialize any kind of object. It seems to serve us well nowadays. If you take a step back and want to re-evaluate it might be worth re-visiting our principles and checkng what we can do to reduce the attack vector and maybe go for cloudpickle. This would externalize the support and reduce the maintenance burden. However, we might run into issues when we cannot serialize through cloudpickle and we do not control how it works. (I'll add some additional notes later that are historically important, but slightly less relevant). My 2 cents. |
Also (to add to what @bolkedebruin wrote ) in internal-API we used Pydantic for some of that - mostly because it alows to serialize ORM SQLalchemy objects together with their relations. But this is possible to plug-in in serde as well as another serializers and treat it similarly to "to_dict" or "to_json".. Also I think there was a lot of misunderstandings and clashes (and I am one of the guilty ones there) about which and why and how we are using serializers - precisely because we have not agreed on a common strategy we want to take, direction and vision we have. It has never been agreed and writen down, and I think we should start with that. For example right now when I look at But with Airflow 3, I'd be personally for making That would be my 1c . |
Thanks for the explanation!
|
Sounds good. Although I don't understand your point 1 fully. Yes to consolidation, but "making airflow modules select serialization engines" I do not understand. Care to the clarify? |
I've edited my language in my comment above. I was thinking about supporting the old DAG serializer and the consolidated version at the same time for smoother migration, while I also feel it could be unnecessary if the serialization format is not changed. |
Also as mentioned in #31213 -> it might be a good exercise to migrate AIP-44 for Airflow 2.10 to use |
@gyli It might be possible, but note there are a lot of assumptions within the DAG serializers about the output format. I'm not sure it is worth the effort to re-use the old serializer code. Serde has a strict schema / format (versioning specification), while the serialized objects of the DAG serializer do not. In your place I would focus on 'on-the-wire' compatibility so that the serialized format ends up compatible with the JSON schema used. This will require a second pass serializer. As @potiuk mentioned ensuring AIP-44 doesn't use the legacy path any more might be an easier way to get used to the code and has more benefits. The DAG serializer is quite contained now except by its use of AIP-44. |
Yeah I can start from #31213 first. Note that by doing so, internal_api_call has to temporarily call either old serializer or |
I am not sure we allow DAG or Operator to be passed over the internal-api. I think it should not be the case and we should likely change it. I will review it shortly as well. |
Here are some internal API endpoints I found with DAG/operator arguments: airflow/airflow/cli/commands/task_command.py Line 163 in 6684481
airflow/airflow/models/taskinstance.py Line 910 in f9c56cb
|
I will take a look today :) |
@bolkedebruin Still thinking about the best approach here. The tricky part is how to handle different encoding if old code is used indeed. While even if we decide not to use the old codes, we will still need to implement classes similar to Approach 1, calling Approach 2, porting I gave a second thought and second approach makes more sense to me. Although it requires more work, we might want a complete migration from old to new serializer considering this is an Airflow 2 to 3 change. Please let me know what you think. |
I'm mostly AFK - Olympics etc - so a bit brief and can't really deep dive into the code. I think targeting a new schema, your approach 2 is best, while for deserialization having detection of the old format and using the old deserializers for some time is smart. We do want people to be able to move from Airflow 2 -> 3 but I don't think we require them to entirely start from scratch. Having the updated schema be as close to serde's native output format, basically dicts with classnames and versioning information, requires the least passes so will be the fastest. It's also easier then so switch to another output format (say protobuf) if we would like that in the future. It might be worthwhile to have a mix of a serde module and porting of the classes. Classes are slow, the overhead on the stack is significant. This is one of the reasons why serde does what it does. Another reason is that the deserialized versions of DAG and Operator are not the same as DAG and Operator. So it night be cleaner to keep them separated. Another one is that the "serialize" and "deserialize" methods in serde's modules have priority over the ones in the classes, so that allows you to keep the existing infrastructure in place while replacing it with serde's. Summarizing my thoughts:
For now my thoughts. Thanks for taking this on. I'll help where I can. |
@gyli : Could I assign this task to you? Do you have enough clarity? |
Hi @kaxil, yes, please assign it to me. It's progressing slowly though. |
We have different ways to do Serialisation in Airflow --- and also multiple way of pickling (dill, stdlib pickle, cloudpickle).
dill
tocloudpickle
for advanced serialization #7870Would you like to add additional color here @bolkedebruin @amoghrajesh
The text was updated successfully, but these errors were encountered: