Check _RAW_BSON_DOCUMENT_MARKER at CodecOptions.document_class first #324

wronglink · 2017-05-29T08:57:26Z

Hi there.
I'm trying to implement custom BSON decoding logic, that decodes data to different object types. And I need to store some decoder state at different recursion level. I think it would be handy if CodecOptions would allow document_class to be either class or instance, having a special _type_marker attribute. For that I've reordered conditions because issubclass call fails when it's called on non-class object.

behackett · 2017-05-30T18:07:12Z

Off the top of my head I can't think of a problem with re-ordering those checks, but the bson module instantiates the document class, so document_class can't be an instance. Can you clarify what you're trying to do?

wronglink · 2017-05-31T09:33:10Z

instantiates the document class

If you mean decode_all function, it appends the result the document_class call both in python and c implementations. If there is a class, it would create instance via __new__ call and then init it via __init__ call. If there is a callable object (function or object with __call__ interface), it would just call it. I don't see any problem here, except unobvious naming (I'd rather like something like wrapper, than document_class in that case). Is there any other cases, when bson/pymongo calls document_class and then treat it like a class instance?

Can you clarify what you're trying to do

I'm writing an ORM for mongo, that uses pymongo as a driver. I want to wrap my decoding results with data structures other than dict depending on some conditions. Here is a little example. My typical BSON-encoded data from mongo looks like:

{
    // Several levels of metadata (like cursor id, namespace, request result) wrapping my real data
    "cursor": {
        "firstBatch": [
            // Here the real data starts
            // First document
            {
                "_id": ObjectId("59256c8a5e49a721938935a3"),
                "str_field": "Hello",
                "int_field": 42,
            },
            // Second document
            {
                "_id": ObjectId("59256c8a5e49a721938935a4"),
                "str_field": "Hello world",
                "int_field": 24,
            }
            // Real data ended, now some more metadata
        ],
        "id": 0,
        "ns": "test_db.test_col"
    },
    "ok": 1
}

Lets say, I want to wrap my document with class called Model, and there is also no need to wrap metadata, so I want my decoder to return something like:

{
    # its a dict
    "cursor": {
        "firstBatch": [ 
            <__main__.Model with _id=59256c8a5e49a721938935a3>,
            <__main__.Model with _id=59256c8a5e49a721938935a4>
        ],
        "id": 0,
        "ns": "test_db.test_col"
    },
    "ok": 1
}

My idea is to use custom BSONDecoder, that also has some context between

import bson
from bson.codec_options import _RAW_BSON_DOCUMENT_MARKER

class Model(object):
    # Model class implementation
    pass

class CustomBSONDecoder(object):
    _type_marker = _RAW_BSON_DOCUMENT_MARKER

    def __init__(self, model_class):
        self.model_class = model_class   # It stores my custom decoding type
        self.depth = 0  # This is the depth of a step

    def __call__(self, data, codec_options):
        object_size = bson._UNPACK_INT(data[:4])[0] - 1
        if self.depth == 2:
            # If the depth is 2, we are at the document data level
            container = self.model_class()
        else:
            # Else use regular dictionary
            container = dict()
        self.depth += 1

        for key, value, pos in bson._iterate_elements(data, 4, object_size, codec_options):
            container[key] = value

        self.depth -= 1
        return container.unwrap() if do_wrap else container

codec_options = CodecOptions(document_class=CustomBSONDecoder(Model))

If CodecOptions allows me to use a class instance as document_class, it would be easier for me to have store some context (like depth attribute), between custom object wrapping. Otherwise I'll have to put my decoding logic inside __new__ method, and store my context, for example, inside my class, that is not the best idea as classes are singletons.

behackett · 2017-05-31T18:09:36Z

Sorry, I still don't understand how this pull request solves your problem.

I think it would be handy if CodecOptions would allow document_class to be either class or instance

This change doesn't allow document_class to be an instance. The issubclass call will still fail if the class doesn't inherit from RawBSONDocument, and an instance of RawBSONDocument isn't callable:

>>> co = CodecOptions(document_class=RawBSONDocument(""))
>>> import bson
>>> bs = bson.BSON.encode({"foo": "bar"})
>>> bs.decode(co)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bson/__init__.py", line 1021, in decode
    return _bson_to_dict(self, codec_options)
  File "bson/__init__.py", line 382, in _bson_to_dict
    reraise(InvalidBSON, exc_value, exc_tb)
  File "bson/__init__.py", line 375, in _bson_to_dict
    return opts.document_class(data, opts)
bson.errors.InvalidBSON: 'RawBSONDocument' object is not callable

wronglink · 2017-05-31T19:35:25Z

_raw_document_class checks only _type_marker attribute. So, custom BSON decoder is not forced to be RawBSONDocument subclass.
My decoder is object subclass, it has correct _type_marker attribute, so it passes _raw_document_class check, and is properly called. I've made a working example, to explain my idea https://gist.github.com/wronglink/a55ef92b8e75a16fee2424cf1b7e7b19.

It turns this:

{
    "foo": "bar",
    "nested": {
        "data": [
            {"a": 1, "b": 2},
            {"a": "Hello", "b": "world!"},
        ]
    }
}

To this (note Model instances at 2 level of depth):

{
    'foo': 'bar',
    'nested': {
        'data': [
            <Model a=1, b=2>,
            <Model a=Hello, b=world!>
        ]
    }
}

wronglink · 2017-06-07T20:32:06Z

@behackett this PR will fix my current problem (passing a class instance to BSON decoder). I think there could be a further step: to treat document_class.__call__() as a factory method. It would require some util functions like here to check only correct _type_marker attribute, and never check if is a class. And also it should require some documentation notes. If you like this idea, I can provide another PR.

behackett · 2017-06-16T00:13:43Z

Thanks for your contribution, and patience waiting for review. Though the change doesn't appear to make any practical difference to existing applications, it does mean we can never change the internals of raw document support without breaking your application. Raw documents are meant to get raw BSON into and out of MongoDB. Using them to implement a codec mapping is a clever hack, but it's still a hack. Historically we've provided son manipulators to achieve your goal, but they are currently deprecated. We intend to deliver a replacement solution in a future release. That work is currently slated for the next release, but I can't guarantee it will make it.

wronglink · 2017-06-18T19:43:50Z

we've provided son manipulators to achieve your goal

It seems to me that SON manipulators are not the best option as they are killing performance. It's better to put decoded data into needed data structures on the fly. I think it would be handy if I could tell pymongo.helpers._unpack_response to do it with my custom BSON decoder implementation. Right now I think I can reach it by subclassing Cursor, but maybe there could be CodecOptions parameter? I'd also like if bson would be a little bit more hook-friendly.

Raw documents are meant to get raw BSON into and out of MongoDB.

Is there any reason why RawBSONDocument could be used instead of built-in BSON decoding mechanics. I thought that it's just the base class for custom decoding logics. Unfortunately the requirement of RawBSONDocument to be a class makes it less flexible as it doesn't allow to store some context between decoding stages.

Check _RAW_BSON_DOCUMENT_MARKER at CodecOptions.document_class first

6a44855

behackett self-assigned this May 30, 2017

wronglink mentioned this pull request Jun 7, 2017

Pass current field name to raw BSON decoder #326

Closed

behackett closed this Jun 16, 2017

ShaneHarvey added a commit to ShaneHarvey/mongo-python-driver that referenced this pull request Apr 28, 2023

PYTHON-3686 Try mongodb#324

e8975a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check _RAW_BSON_DOCUMENT_MARKER at CodecOptions.document_class first #324

Check _RAW_BSON_DOCUMENT_MARKER at CodecOptions.document_class first #324

wronglink commented May 29, 2017

behackett commented May 30, 2017

wronglink commented May 31, 2017

behackett commented May 31, 2017

wronglink commented May 31, 2017

wronglink commented Jun 7, 2017

behackett commented Jun 16, 2017

wronglink commented Jun 18, 2017

Check _RAW_BSON_DOCUMENT_MARKER at CodecOptions.document_class first #324

Check _RAW_BSON_DOCUMENT_MARKER at CodecOptions.document_class first #324

Conversation

wronglink commented May 29, 2017

behackett commented May 30, 2017

wronglink commented May 31, 2017

behackett commented May 31, 2017

wronglink commented May 31, 2017

wronglink commented Jun 7, 2017

behackett commented Jun 16, 2017

wronglink commented Jun 18, 2017