Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LibYAML bindings by default if installed #436

Closed
wants to merge 1 commit into from

Conversation

psphicas
Copy link
Contributor

@psphicas psphicas commented Sep 8, 2020

PyYAML users generally want the fast performance of LibYAML.

This change automatically rebinds yaml.Dumper to CDumper, and
yaml.Loader to CLoader, if LibYAML bindings are available.

This allows the use of the "sugar" methods (safe_load, full_load,
unsafe_load, etc.) as is, benefiting from the performance boost of
LibYAML if it is installed.

@psphicas
Copy link
Contributor Author

psphicas commented Sep 8, 2020

Not sure if there is any interest in doing something like this, or if there is a better way to "automatically" use LibYAML? It seems somewhat relevant to the efforts going on with #303 ..

(The failed tests are expected, as they would normally be skipped.)

@nitzmahone
Copy link
Member

nitzmahone commented Sep 10, 2020

It's definitely a common "developer quality of life" problem that everyone has to solve, but changing the defaults or tweaking them globally could possibly break things (eg, a pip-installed pyyaml built against an OS-packaged libyaml that later got removed or (worse) updated to an incompatible version would fail incomprehensibly). One idea that might be worth exploring is something like a parallel set of loader/dumper defs that just map to the C versions if available and the pure-Python ones if not (basically what you have here with the DefaultX prefixed ones, but not the existing names). That'd prevent people from having to manually do the fallback dance everywhere. Another option might be to allow the global defaults to be set explicitly, but that seems potentially dangerous with the amount of library code out there that uses pyyaml- combine two things that do it differently and boom.

@psphicas
Copy link
Contributor Author

Thank you for the feedback!

I guess my premise was that on a system with libyaml installed and functioning, it doesn't make sense for some code to use pure-Python and some libraries to use the C versions - it would usually be all or nothing, and typically I would want it enabled, unless there was some specific reason I needed to disable it.

Any thoughts on how a parallel set of loaders/dumpers might work? I started with just adding the DefaultX-prefixed loader and dumper (without messing with the existing names), which is fine if only the "sugar" methods are used, but doesn't help when pulling in any of the many libraries that use pyyaml and do explicitly specify Loader= and Dumper= in their function calls. This is especially challenging for the add resolver/constructor/representer methods - if a library adds a representer with Dumper=yaml.SafeDumper, and then later I try to use safe_dump() (with the DefaultSafeDumper, aka. CSafeDumper), it won't work, unless add_representer assumes that yaml.SafeDumper really means all available safe dumpers. As ugly as it is, redefining the existing names at import was the best way I could find to make it work globally without affecting existing code.

Implementation details aside, is it a reasonable feature to be able to replace all usages of the python loader and dumper with the C versions, including in any imported libraries, or is this unlikely to be acceptable in the core pyyaml code?

@nitzmahone
Copy link
Member

doesn't help when pulling in any of the many libraries that use pyyaml and do explicitly specify Loader= and Dumper= in their function calls.

For better or worse, there are enough observable behavior differences between the libyaml-backed bits and the pure-Python bits that I think it's a bad idea to force existing code to use a different one without an explicit opt-in. I can think of several chunks of pyyaml code I've written that would break if the default impl were swapped to the libyaml-backed classes. What you've added with just the new DefaultX loader and dumper aliases provides an easy way for someone that doesn't care about those differences (likely the vast majority) to float the implementation at runtime based on what's available, while avoiding exposing existing code to a potentially fatal change.

Implementation details aside, is it a reasonable feature to be able to replace all usages of the python loader and dumper with the C versions, including in any imported libraries, or is this unlikely to be acceptable in the core pyyaml code?

Short answer: probably not.

Longer answer: At first blush, providing a globally settable default Loader/Dumper (and maybe others) sounds like a good idea. Anytime something like this comes up though, the question should be: "what would happen if two libraries did this?" Unfortunately, there are parts of the pyyaml API that already encourage mutation of global state for configuration and behavioral changes, but we shouldn't be making more of them, and this one in particular has the capability to do real damage. eg, your library sets it to DefaultSafeLoader and mine sets it to DefaultUnsafeLoader- who wins? Depends on when it got set, which one was loaded first, might float, might switch to something that causes a security hole... Basically, the convenience methods just need to go away and people need to be specific about the features they need, but that ship has already sailed. Providing an easy way to opt in to the C-backed versions for code that's already being specific is probably the best we can do.

Some pyyaml users prefer the fast performance of libyaml, and are less
concerned about minor incompatibilities. There is no existing method to
globally enable the libyaml bindings, and different libraries may do
things in different ways.

This change provides a way for users to request global replacement of
the default Python Loaders and Dumpers with the libyaml bindings, prior
to the initial import of the pyyaml library.

To enable in Python3:
    import builtins
    builtins._yaml__use_libyaml_by_default__ = True
    import yaml

To enable in Python2:
    import __builtin__
    __builtin__._yaml__use_libyaml_by_default__ = True
    import yaml

Assuming that libyaml is installed, the result will be that:
    yaml.BaseLoader == yaml.cyaml.CBaseLoader
    yaml.SafeLoader == yaml.cyaml.CSafeLoader
    yaml.FullLoader == yaml.cyaml.CFullLoader
    yaml.UnsafeLoader == yaml.cyaml.CUnsafeLoader
    yaml.Loader == yaml.cyaml.CLoader
    yaml.BaseDumper == yaml.cyaml.CBaseDumper
    yaml.SafeDumper == yaml.cyaml.CSafeDumper
    yaml.Dumper == yaml.cyaml.CDumper

This allows the use of the "sugar" methods (safe_load, full_load,
unsafe_load, etc.) as is, benefiting from the performance boost of
libyaml if it is installed. This should also provide very broad
compatibility with existing code that may explicitly reference specific
loaders and dumpers by name.
@psphicas
Copy link
Contributor Author

Fair points. Definitely an opt-in would be a requirement, I just don't know what the right mechanics are. The latest update uses __builtins__, which seems super hacky, but I don't know another way to do this early enough (to correctly bind all the functions definitions and create the YAMLObject class).

Thinking about this more, the yaml.Dumper and yaml.Loader are already just aliases to yaml.loader.Loader and yaml.dumper.Dumper, so introducing the Default names doesn't really buy anything. Really we just want to rebind those aliases to yaml.cyaml.Dumper and yaml.cyaml.Loader as soon as possible (or not at all).

I had not considered replacing Safe with Unsafe variants, only equivalent ones. I've stumbled across some modules that do this (https://github.com/DataDog/integrations-core/pull/3089/commits), and I don't think this change would prevent that from working.

In terms of "what would happen if two libraries did this?" there are already some problems that would actually go away if we rebind the names early.

Here is an example of some code (borrowed from the docs) that works if libyaml is not present, but breaks if libyaml is available. It's not clear to me that either the module or the user code is wrong, or how it can be fixed without messing with the 3rd-party library.

monster.py (3rd-party library that provides a class that is a YAMLObject):

import yaml

class Monster(yaml.YAMLObject):
    yaml_tag = u'!Monster'
    def __init__(self, name, hp, ac, attacks):
        self.name = name
        self.hp = hp
        self.ac = ac
        self.attacks = attacks
    def __repr__(self):
        return "%s(name=%r, hp=%r, ac=%r, attacks=%r)" % (
            self.__class__.__name__, self.name, self.hp, self.ac, self.attacks)

use_monster.py (user code):

from monster import *

from yaml import load, dump
try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper

m = """
!Monster
ac: 16
attacks: [BITE, HURT]
hp: [3, 6]
name: Cave lizard
"""

print(yaml.load(m, Loader=Loader))

Without libyaml:

# python use_monster.py
Monster(name='Cave lizard', hp=[3, 6], ac=16, attacks=['BITE', 'HURT'])

With libyaml:

Traceback (most recent call last):
  File "use_monster.py", line 17, in <module>
    print(yaml.load(m, Loader=Loader))
  File "/home/ubuntu/pyyaml/.tox/py36/lib/python3.6/site-packages/yaml/__init__.py", line 137, in load
    return loader.get_single_data()
  File "/home/ubuntu/pyyaml/.tox/py36/lib/python3.6/site-packages/yaml/constructor.py", line 51, in get_single_data
    return self.construct_document(node)
  File "/home/ubuntu/pyyaml/.tox/py36/lib/python3.6/site-packages/yaml/constructor.py", line 55, in construct_document
    data = self.construct_object(node)
  File "/home/ubuntu/pyyaml/.tox/py36/lib/python3.6/site-packages/yaml/constructor.py", line 100, in construct_object
    data = constructor(self, node)
  File "/home/ubuntu/pyyaml/.tox/py36/lib/python3.6/site-packages/yaml/constructor.py", line 429, in construct_undefined
    node.start_mark)
yaml.constructor.ConstructorError: could not determine a constructor for the tag '!Monster'
  in "<unicode string>", line 2, column 1

This is at least fixable from an end-user perspective, by not trying to use libyaml at all. However, if you have two libraries that each provide YAMLObject classes, one which can only be constructed with the Python constructors and one that can only be constructed with C constructors, then you might be stuck.

With the proposed patch, this can actually work.

try:
    import builtins
    builtins._yaml__use_libyaml_by_default__ = True
except:
    import __builtin__
    __builtin__._yaml__use_libyaml_by_default__ = True

from yaml import load, dump
try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper

m = """
!Monster
ac: 16
attacks: [BITE, HURT]
hp: [3, 6]
name: Cave lizard
"""

print(yaml.load(m, Loader=Loader))

The try/except for cyaml checking can be dropped, although leaving it is fine.

try:
    import builtins
    builtins._yaml__use_libyaml_by_default__ = True
except:
    import __builtin__
    __builtin__._yaml__use_libyaml_by_default__ = True

from yaml import Loader, Dumper
from monster import *

m = """
!Monster
ac: 16
attacks: [BITE, HURT]
hp: [3, 6]
name: Cave lizard
"""

print(yaml.load(m, Loader=Loader))

I don't love this, but but it seems not-too-horrible. It's opt-in, it only kicks in the first time import yaml is run, so the end-user really has control of whether or not to enable it, and it does what I think people would want it to do. Definitely open to additional suggestions on alternative approaches.

@nitzmahone
Copy link
Member

I think you're missing one of my primary points: a global opt-in shouldn't be on the table, since if two libraries doing that with different needs get combined in a project, somebody loses. The best we can do in a way that doesn't encourage bad behavior is basically what you've already done: provide a new set of aliased types that will use libyaml if present and fall back if not. The way I was using "opt-in" means every caller still needs to opt-in to that behavior by using the new aliases. Old code that's unaware or unwilling won't be affected, and new code that won't work (for whatever reason) can still select a specific type that will work regardless of the choices made by another developer or library.

@psphicas
Copy link
Contributor Author

I think you're missing one of my primary points: a global opt-in shouldn't be on the table, since if two libraries doing that with different needs get combined in a project, somebody loses.

I guess it depends on how we are defining "need" and "loses".

There is definitely code that would break if it were forced to use libyaml, which would be bad. There are probably many libraries that try to use libyaml, and some may be unacceptably slow without it, but I suspect that most of them can fall back to pyyaml. We could categorize and matrix them and see if they could possibly share a common set of bindings.

  1. requires pyyaml (=Py) : not easily identifiable or distinguishable (other than lack of cyaml imports)
  2. prefers pyyaml (~Py) : does this exist?
  3. doesn't care (*) : probably the majority
  4. prefers libyaml (~C) : does the try: from yaml import cyaml dance
  5. requires libyaml (=C) : from yaml import cyaml or dies trying
=Py ~Py * ~C =C
=Py Py Py Py Py- 👎
~Py Py Py Py * -C
* Py Py * C C
~C -Py * C C C
=C 👎 C- C C C

In most cases, we can find a compromise, although one library may be unhappy (-).

Hypothetically, say we have two libraries in a combined project with incompatible requirements (=Py and =C), but everything is working, because each explicitly references the desired bindings.

What would happen if the feature were introduced?

  • Default: nothing - all existing bindings work the way they did before.
  • User integrating the libraries tries to set __use_libyaml_by_default__ = True: the =Py library would break, and they would probably just back out the change.
  • The =Py library author decides to set __use_libyaml_by_default__ = False: nothing, everything would work as before.
  • The =C library author decides to set __use_libyaml_by_default__ = True: this could possibly break the =Py library, depending on the order of imports. The most likely scenario is that this will do nothing. The user code probably already does import yaml before importing 3rd-party libraries, so the option will not have any effect. If it does break, the user can fix it by setting __use_libyaml_by_default__ = False before the import of the =C library, or by importing yaml before the =C library.

I'm definitely approaching this more from and end-user perspective, where I want to influence the behavior of all the libraries I'm importing without any updates to those libraries. It was intended to address the "floating" cases - a way to request libyaml as a best-effort. If you look at it from the library author approach, this is not really a solution at all, since in most cases the library would be setting the option too late.

The best we can do in a way that doesn't encourage bad behavior is basically what you've already done: provide a new set of aliased types that will use libyaml if present and fall back if not.

The complexity with the aliased types is the add_ methods for resolvers, contructors, and representers. It would probably be more intuitive if "like" classes (Loader and CLoader, SafeLoader and CSafeLoader, etc.) shared the same list of yaml_constructors (for example), but they don't. I would be surprised to see:

add_contructor(Foo, foo_constructor, Loader=yaml.SafeLoader)
add_contructor(Foo, foo_constructor, Loader=yaml.CSafeLoader)

This is more likely:

try:
    from yaml import CSafeLoader as SafeLoader
except:
    from yaml import SafeLoader as SafeLoader

add_contructor(Foo, foo_constructor, Loader=SafeLoader)

And this can already cause breakage, because it's not really possible to know which loader knows how to do the right thing. (Again, I'm mostly considering the use case of importing libraries that define classes of YAMLObjects, and trying to use them all together in a single set of YAML documents that can be dumped and loaded, not libraries that each do their own thing separately.) I think the problem may actually get worse if a third option is introduced, unless we adjust the class definitions to share (or do some sort of cross-lookup).

Anyway ... all that being said, the "global" option feels more appropriate as a monkey-patch outside of pyyaml. How would you feel about moving the function definitions and YAMLObject class out of __init__.py into a separate file? It might make something like this possible ... to be done at one's own risk, of course.

import importlib
import yaml
yaml.Loader = yaml.CLoader
yaml.Dumper = yaml.Dumper
importlib.reload(yaml.functions)

@psphicas
Copy link
Contributor Author

Closing the pull request .. if anyone else is interested in this functionality, they can try this out:
https://pypi.org/project/pylibyaml/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants