
.. _handling-uuid-data-example:

Handling UUID Data
==================

PyMongo ships with built-in support for dealing with UUID types.
It is straightforward to store native :class:`uuid.UUID` objects
to MongoDB and retrieve them as native :class:`uuid.UUID` objects::

  from pymongo import MongoClient
  from bson.binary import UuidRepresentation
  from uuid import uuid4

  # use the 'standard' representation for cross-language compatibility.
  client = MongoClient(uuidRepresentation='standard')
  collection = client.get_database('uuid_db').get_collection('uuid_coll')

  # remove all documents from collection
  collection.delete_many({})

  # create a native uuid object
  uuid_obj = uuid4()

  # save the native uuid object to MongoDB
  collection.insert_one({'uuid': uuid_obj})

  # retrieve the stored uuid object from MongoDB
  document = collection.find_one({})

  # check that the retrieved UUID matches the inserted UUID
  assert document['uuid'] == uuid_obj

Native :class:`uuid.UUID` objects can also be used as part of MongoDB
queries::

  document = collection.find({'uuid': uuid_obj})
  assert document['uuid'] == uuid_obj

The above examples illustrate the simplest of use-cases - one where the
UUID is generated by, and used in the same application. However,
the situation can be significantly more complex when dealing with a MongoDB
deployment that contains UUIDs created by other drivers as the Java and CSharp
drivers have historically encoded UUIDs using a byte-order that is different
from the one used by PyMongo. Applications that require interoperability across
these drivers must specify the appropriate
:class:`~bson.binary.UuidRepresentation`.

In the following sections, we describe how drivers have historically differed
in their encoding of UUIDs, and how applications can use the
:class:`~bson.binary.UuidRepresentation` configuration option to maintain
cross-language compatibility.

.. attention:: New applications that do not share a MongoDB deployment with
   any other application and that have never stored UUIDs in MongoDB
   should use the ``standard`` UUID representation for cross-language
   compatibility. See :ref:`configuring-uuid-representation` for details
   on how to configure the :class:`~bson.binary.UuidRepresentation`.

.. _example-legacy-uuid:

Legacy Handling of UUID Data
----------------------------

Historically, MongoDB Drivers have used different byte-ordering
while serializing UUID types to :class:`~bson.binary.Binary`.
Consider, for instance, a UUID with the following canonical textual
representation::

  00112233-4455-6677-8899-aabbccddeeff

This UUID would historically be serialized by the Python driver as::

  00112233-4455-6677-8899-aabbccddeeff

The same UUID would historically be serialized by the C# driver as::

  33221100-5544-7766-8899-aabbccddeeff

Finally, the same UUID would historically be serialized by the Java driver as::

  77665544-3322-1100-ffee-ddccbbaa9988

.. note:: For in-depth information about the the byte-order historically
   used by different drivers, see the `Handling of Native UUID Types
   Specification
   <https://github.com/mongodb/specifications/blob/master/source/uuid.rst>`_.

This difference in the byte-order of UUIDs encoded by different drivers can
result in highly unintuitive behavior in some scenarios. We detail two such
scenarios in the next sections.

Scenario 1: Applications Share a MongoDB Deployment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Consider the following situation:

* Application ``C`` written in C# generates a UUID and uses it as the ``_id``
  of a document that it proceeds to insert into the ``uuid_test`` collection of
  the ``example_db`` database. Let's assume that the canonical textual
  representation of the generated UUID is::

    00112233-4455-6677-8899-aabbccddeeff

* Application ``P`` written in Python attempts to ``find`` the document
  written by application ``C`` in the following manner::

    from uuid import UUID
    collection = client.example_db.uuid_test
    result = collection.find_one({'_id': UUID('00112233-4455-6677-8899-aabbccddeeff')})

  In this instance, ``result`` will never be the document that
  was inserted by application ``C`` in the previous step. This is because of
  the different byte-order used by the C# driver for representing UUIDs as
  BSON Binary. The following query, on the other hand, will successfully find
  this document::

    result = collection.find_one({'_id': UUID('33221100-5544-7766-8899-aabbccddeeff')})

This example demonstrates how the differing byte-order used by different
drivers can hamper interoperability. To workaround this problem, users should
configure their ``MongoClient`` with the appropriate
:class:`~bson.binary.UuidRepresentation` (in this case, ``client`` in application
``P`` can be configured to use the
:data:`~bson.binary.UuidRepresentation.CSHARP_LEGACY` representation to
avoid the unintuitive behavior) as described in
:ref:`configuring-uuid-representation`.

Scenario 2: Round-Tripping UUIDs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the following examples, we see how using a misconfigured
:class:`~bson.binary.UuidRepresentation` can cause an application
to inadvertently change the :class:`~bson.binary.Binary` subtype, and in some
cases, the bytes of the :class:`~bson.binary.Binary` field itself when
round-tripping documents containing UUIDs.

Consider the following situation::

  from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
  from bson.binary import Binary, UuidRepresentation
  from uuid import uuid4

  # Using UuidRepresentation.PYTHON_LEGACY stores a Binary subtype-3 UUID
  python_opts = CodecOptions(uuid_representation=UuidRepresentation.PYTHON_LEGACY)
  input_uuid = uuid4()
  collection = client.testdb.get_collection('test', codec_options=python_opts)
  collection.insert_one({'_id': 'foo', 'uuid': input_uuid})
  assert collection.find_one({'uuid': Binary(input_uuid.bytes, 3)})['_id'] == 'foo'

  # Retrieving this document using UuidRepresentation.STANDARD returns a Binary instance
  std_opts = CodecOptions(uuid_representation=UuidRepresentation.STANDARD)
  std_collection = client.testdb.get_collection('test', codec_options=std_opts)
  doc = std_collection.find_one({'_id': 'foo'})
  assert isinstance(doc['uuid'], Binary)

  # Round-tripping the retrieved document yields the exact same document
  std_collection.replace_one({'_id': 'foo'}, doc)
  round_tripped_doc = collection.find_one({'uuid': Binary(input_uuid.bytes, 3)})
  assert doc == round_tripped_doc


In this example, round-tripping the document using the incorrect
:class:`~bson.binary.UuidRepresentation` (``STANDARD`` instead of
``PYTHON_LEGACY``) changes the :class:`~bson.binary.Binary` subtype as a
side-effect. **Note that this can also happen when the situation is reversed -
i.e. when the original document is written using ``STANDARD`` representation
and then round-tripped using the ``PYTHON_LEGACY`` representation.**

In the next example, we see the consequences of incorrectly using a
representation that modifies byte-order (``CSHARP_LEGACY`` or ``JAVA_LEGACY``)
when round-tripping documents::

  from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
  from bson.binary import Binary, UuidRepresentation
  from uuid import uuid4

  # Using UuidRepresentation.STANDARD stores a Binary subtype-4 UUID
  std_opts = CodecOptions(uuid_representation=UuidRepresentation.STANDARD)
  input_uuid = uuid4()
  collection = client.testdb.get_collection('test', codec_options=std_opts)
  collection.insert_one({'_id': 'baz', 'uuid': input_uuid})
  assert collection.find_one({'uuid': Binary(input_uuid.bytes, 4)})['_id'] == 'baz'

  # Retrieving this document using UuidRepresentation.JAVA_LEGACY returns a native UUID
  # without modifying the UUID byte-order
  java_opts = CodecOptions(uuid_representation=UuidRepresentation.JAVA_LEGACY)
  java_collection = client.testdb.get_collection('test', codec_options=java_opts)
  doc = java_collection.find_one({'_id': 'baz'})
  assert doc['uuid'] == input_uuid

  # Round-tripping the retrieved document silently changes the Binary bytes and subtype
  java_collection.replace_one({'_id': 'baz'}, doc)
  assert collection.find_one({'uuid': Binary(input_uuid.bytes, 3)}) is None
  assert collection.find_one({'uuid': Binary(input_uuid.bytes, 4)}) is None
  round_tripped_doc = collection.find_one({'_id': 'baz'})
  assert round_tripped_doc['uuid'] == Binary(input_uuid.bytes, 3).as_uuid(UuidRepresentation.JAVA_LEGACY)


In this case, using the incorrect :class:`~bson.binary.UuidRepresentation`
(``JAVA_LEGACY`` instead of ``STANDARD``) changes the
:class:`~bson.binary.Binary` bytes and subtype as a side-effect.
**Note that this happens when any representation that
manipulates byte-order (``CSHARP_LEGACY`` or ``JAVA_LEGACY``) is incorrectly
used to round-trip UUIDs written with ``STANDARD``. When the situation is
reversed - i.e. when the original document is written using ``CSHARP_LEGACY``
or ``JAVA_LEGACY`` and then round-tripped using ``STANDARD`` -
only the :class:`~bson.binary.Binary` subtype is changed.**

.. note:: Starting in PyMongo 4.0, these issue will be resolved as
   the ``STANDARD`` representation will decode Binary subtype 3 fields as
   :class:`~bson.binary.Binary` objects of subtype 3 (instead of
   :class:`uuid.UUID`), and each of the ``LEGACY_*`` representations will
   decode Binary subtype 4 fields to :class:`~bson.binary.Binary` objects of
   subtype 4 (instead of :class:`uuid.UUID`).

.. _configuring-uuid-representation:

Configuring a UUID Representation
---------------------------------

Users can workaround the problems described above by configuring their
applications with the appropriate :class:`~bson.binary.UuidRepresentation`.
Configuring the representation modifies PyMongo's behavior while
encoding :class:`uuid.UUID` objects to BSON and decoding
Binary subtype 3 and 4 fields from BSON.

Applications can set the UUID representation in one of the following ways:

#. At the ``MongoClient`` level using the ``uuidRepresentation`` URI option,
   e.g.::

     client = MongoClient("mongodb://a:27107/?uuidRepresentation=standard")

   Valid values are:

   .. list-table::
      :header-rows: 1

      * - Value
        - UUID Representation

      * - ``unspecified``
        - :ref:`unspecified-representation-details`

      * - ``standard``
        - :ref:`standard-representation-details`

      * - ``pythonLegacy``
        - :ref:`python-legacy-representation-details`

      * - ``javaLegacy``
        - :ref:`java-legacy-representation-details`

      * - ``csharpLegacy``
        - :ref:`csharp-legacy-representation-details`

#. At the ``MongoClient`` level using the ``uuidRepresentation`` kwarg
   option, e.g.::

     from bson.binary import UuidRepresentation
     client = MongoClient(uuidRepresentation=UuidRepresentation.STANDARD)

#. At the ``Database`` or ``Collection`` level by supplying a suitable
   :class:`~bson.codec_options.CodecOptions` instance, e.g.::

     from bson.codec_options import CodecOptions
     csharp_opts = CodecOptions(uuid_representation=UuidRepresentation.CSHARP_LEGACY)
     java_opts = CodecOptions(uuid_representation=UuidRepresentation.JAVA_LEGACY)

     # Get database/collection from client with csharpLegacy UUID representation
     csharp_database = client.get_database('csharp_db', codec_options=csharp_opts)
     csharp_collection = client.testdb.get_collection('csharp_coll', codec_options=csharp_opts)

     # Get database/collection from existing database/collection with javaLegacy UUID representation
     java_database = csharp_database.with_options(codec_options=java_opts)
     java_collection = csharp_collection.with_options(codec_options=java_opts)

Supported UUID Representations
------------------------------

.. list-table::
   :header-rows: 1

   * - UUID Representation
     - Default?
     - Encode :class:`uuid.UUID` to
     - Decode :class:`~bson.binary.Binary` subtype 4 to
     - Decode :class:`~bson.binary.Binary` subtype 3 to

   * - :ref:`standard-representation-details`
     - No
     - :class:`~bson.binary.Binary` subtype 4
     - :class:`uuid.UUID`
     - :class:`~bson.binary.Binary` subtype 3

   * - :ref:`unspecified-representation-details`
     - Yes, in PyMongo>=4
     - Raise :exc:`ValueError`
     - :class:`~bson.binary.Binary` subtype 4
     - :class:`~bson.binary.Binary` subtype 3

   * - :ref:`python-legacy-representation-details`
     - No
     - :class:`~bson.binary.Binary` subtype 3 with standard byte-order
     - :class:`~bson.binary.Binary` subtype 4
     - :class:`uuid.UUID`

   * - :ref:`java-legacy-representation-details`
     - No
     - :class:`~bson.binary.Binary` subtype 3 with Java legacy byte-order
     - :class:`~bson.binary.Binary` subtype 4
     - :class:`uuid.UUID`

   * - :ref:`csharp-legacy-representation-details`
     - No
     - :class:`~bson.binary.Binary` subtype 3 with C# legacy byte-order
     - :class:`~bson.binary.Binary` subtype 4
     - :class:`uuid.UUID`

We now detail the behavior and use-case for each supported UUID
representation.

.. _unspecified-representation-details:

``UNSPECIFIED``
^^^^^^^^^^^^^^^

.. attention:: Starting in PyMongo 4.0,
   :data:`~bson.binary.UuidRepresentation.UNSPECIFIED` is the default
   UUID representation used by PyMongo.

The :data:`~bson.binary.UuidRepresentation.UNSPECIFIED` representation
prevents the incorrect interpretation of UUID bytes by stopping short of
automatically converting UUID fields in BSON to native UUID types. Decoding
a UUID when using this representation returns a :class:`~bson.binary.Binary`
object instead. If required, users can coerce the decoded
:class:`~bson.binary.Binary` objects into native UUIDs using the
:meth:`~bson.binary.Binary.as_uuid` method and specifying the appropriate
representation format. The following example shows
what this might look like for a UUID stored by the C# driver::

  from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
  from bson.binary import Binary, UuidRepresentation
  from uuid import uuid4

  # Using UuidRepresentation.CSHARP_LEGACY
  csharp_opts = CodecOptions(uuid_representation=UuidRepresentation.CSHARP_LEGACY)

  # Store a legacy C#-formatted UUID
  input_uuid = uuid4()
  collection = client.testdb.get_collection('test', codec_options=csharp_opts)
  collection.insert_one({'_id': 'foo', 'uuid': input_uuid})

  # Using UuidRepresentation.UNSPECIFIED
  unspec_opts = CodecOptions(uuid_representation=UuidRepresentation.UNSPECIFIED)
  unspec_collection = client.testdb.get_collection('test', codec_options=unspec_opts)

  # UUID fields are decoded as Binary when UuidRepresentation.UNSPECIFIED is configured
  document = unspec_collection.find_one({'_id': 'foo'})
  decoded_field = document['uuid']
  assert isinstance(decoded_field, Binary)

  # Binary.as_uuid() can be used to coerce the decoded value to a native UUID
  decoded_uuid = decoded_field.as_uuid(UuidRepresentation.CSHARP_LEGACY)
  assert decoded_uuid == input_uuid

Native :class:`uuid.UUID` objects cannot directly be encoded to
:class:`~bson.binary.Binary` when the UUID representation is ``UNSPECIFIED``
and attempting to do so will result in an exception::

  unspec_collection.insert_one({'_id': 'bar', 'uuid': uuid4()})
  Traceback (most recent call last):
  ...
  ValueError: cannot encode native uuid.UUID with UuidRepresentation.UNSPECIFIED. UUIDs can be manually converted to bson.Binary instances using bson.Binary.from_uuid() or a different UuidRepresentation can be configured. See the documentation for UuidRepresentation for more information.

Instead, applications using :data:`~bson.binary.UuidRepresentation.UNSPECIFIED`
must explicitly coerce a native UUID using the
:meth:`~bson.binary.Binary.from_uuid` method::

  explicit_binary = Binary.from_uuid(uuid4(), UuidRepresentation.STANDARD)
  unspec_collection.insert_one({'_id': 'bar', 'uuid': explicit_binary})

.. _standard-representation-details:

``STANDARD``
^^^^^^^^^^^^

.. attention:: This UUID representation should be used by new applications or
   applications that are encoding and/or decoding UUIDs in MongoDB for the
   first time.

The :data:`~bson.binary.UuidRepresentation.STANDARD` representation
enables cross-language compatibility by ensuring the same byte-ordering
when encoding UUIDs from all drivers. UUIDs written by a driver with this
representation configured will be handled correctly by every other provided
it is also configured with the ``STANDARD`` representation.

``STANDARD`` encodes native :class:`uuid.UUID` objects to
:class:`~bson.binary.Binary` subtype 4 objects.

.. _python-legacy-representation-details:

``PYTHON_LEGACY``
^^^^^^^^^^^^^^^^^

.. attention:: This uuid representation should be used when reading UUIDs
   generated by existing applications that use the Python driver
   but **don't** explicitly set a UUID representation.

.. attention:: :data:`~bson.binary.UuidRepresentation.PYTHON_LEGACY`
   was the default uuid representation in PyMongo 3.

The :data:`~bson.binary.UuidRepresentation.PYTHON_LEGACY` representation
corresponds to the legacy representation of UUIDs used by PyMongo. This
representation conforms with
`RFC 4122 Section 4.1.2 <https://tools.ietf.org/html/rfc4122#section-4.1.2>`_.

The following example illustrates the use of this representation::

  from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
  from bson.binary import Binary, UuidRepresentation

  # No configured UUID representation
  collection = client.python_legacy.get_collection('test', codec_options=DEFAULT_CODEC_OPTIONS)

  # Using UuidRepresentation.PYTHON_LEGACY
  pylegacy_opts = CodecOptions(uuid_representation=UuidRepresentation.PYTHON_LEGACY)
  pylegacy_collection = client.python_legacy.get_collection('test', codec_options=pylegacy_opts)

  # UUIDs written by PyMongo 3 with no UuidRepresentation configured
  # (or PyMongo 4.0 with PYTHON_LEGACY) can be queried using PYTHON_LEGACY
  uuid_1 = uuid4()
  pylegacy_collection.insert_one({'uuid': uuid_1})
  document = pylegacy_collection.find_one({'uuid': uuid_1})

``PYTHON_LEGACY`` encodes native :class:`uuid.UUID` objects to
:class:`~bson.binary.Binary` subtype 3 objects, preserving the same
byte-order as :attr:`~uuid.UUID.bytes`::

  from bson.binary import Binary

  document = collection.find_one({'uuid': Binary(uuid_2.bytes, subtype=3)})
  assert document['uuid'] == uuid_2

.. _java-legacy-representation-details:

``JAVA_LEGACY``
^^^^^^^^^^^^^^^

.. attention:: This UUID representation should be used when reading UUIDs
   written to MongoDB by the legacy applications (i.e. applications that don't
   use the ``STANDARD`` representation) using the Java driver.

The :data:`~bson.binary.UuidRepresentation.JAVA_LEGACY` representation
corresponds to the legacy representation of UUIDs used by the MongoDB Java
Driver.

.. note:: The ``JAVA_LEGACY`` representation reverses the order of bytes 0-7,
   and bytes 8-15.

As an example, consider the same UUID described in :ref:`example-legacy-uuid`.
Let us assume that an application used the Java driver without an explicitly
specified UUID representation to insert the example UUID
``00112233-4455-6677-8899-aabbccddeeff`` into MongoDB. If we try to read this
value using ``PYTHON_LEGACY``, we end up with an entirely different UUID::

  UUID('77665544-3322-1100-ffee-ddccbbaa9988')

However, if we explicitly set the representation to
:data:`~bson.binary.UuidRepresentation.JAVA_LEGACY`, we get the correct result::

  UUID('00112233-4455-6677-8899-aabbccddeeff')

PyMongo uses the specified UUID representation to reorder the BSON bytes and
load them correctly. ``JAVA_LEGACY`` encodes native :class:`uuid.UUID` objects
to :class:`~bson.binary.Binary` subtype 3 objects, while performing the same
byte-reordering as the legacy Java driver's UUID to BSON encoder.

.. _csharp-legacy-representation-details:

``CSHARP_LEGACY``
^^^^^^^^^^^^^^^^^

.. attention:: This UUID representation should be used when reading UUIDs
   written to MongoDB by the legacy applications (i.e. applications that don't
   use the ``STANDARD`` representation) using the C# driver.

The :data:`~bson.binary.UuidRepresentation.CSHARP_LEGACY` representation
corresponds to the legacy representation of UUIDs used by the MongoDB Java
Driver.

.. note:: The ``CSHARP_LEGACY`` representation reverses the order of bytes 0-3,
   bytes 4-5, and bytes 6-7.

As an example, consider the same UUID described in :ref:`example-legacy-uuid`.
Let us assume that an application used the C# driver without an explicitly
specified UUID representation to insert the example UUID
``00112233-4455-6677-8899-aabbccddeeff`` into MongoDB. If we try to read this
value using PYTHON_LEGACY, we end up with an entirely different UUID::

  UUID('33221100-5544-7766-8899-aabbccddeeff')

However, if we explicitly set the representation to
:data:`~bson.binary.UuidRepresentation.CSHARP_LEGACY`, we get the correct result::

  UUID('00112233-4455-6677-8899-aabbccddeeff')

PyMongo uses the specified UUID representation to reorder the BSON bytes and
load them correctly. ``CSHARP_LEGACY`` encodes native :class:`uuid.UUID`
objects to :class:`~bson.binary.Binary` subtype 3 objects, while performing
the same byte-reordering as the legacy C# driver's UUID to BSON encoder.
