Embed data

When you start modeling data in Azure Cosmos DB, try to treat your entities as self-contained items represented as JSON documents.

For comparison, let's first see how we might model data in a relational database.

The following example shows how a person might be stored in a relational database.

The strategy, when working with relational databases, is to normalize all your data. Normalizing your data typically involves taking an entity, such as a person, and breaking it down into discrete components.

Now let's take a look at how we would model the same data as a self-contained entity in Azure Cosmos DB.

{
    "id": "1",
    "firstName": "Thomas",
    "lastName": "Andersen",
    "addresses": [
        {
            "line1": "100 Some Street",
            "line2": "Unit 1",
            "city": "Seattle",
            "state": "WA",
            "zip": 98012
        }
    ],
    "contactDetails": [
        {"email": "thomas@andersen.com"},
        {"phone": "+1 555 555-5555", "extension": 5555}
    ]
}

Using the approach above we've denormalized the person record, by embedding all the information related to this person, such as their contact details and addresses, into a single JSON document.

We're not confined to a fixed schema, so we have the flexibility to do things like having contact details of different shapes entirely.

Retrieving a complete person record from the database is now a single read operation against a single container and for a single item. Updating the contact details and addresses of a person record is also a single write operation against a single item.

When not to embed

While the rule of thumb in Azure Cosmos DB is to denormalize everything and embed all data into a single item, this can lead to some situations that should be avoided.

Take this JSON snippet.

{
    "id": "1",
    "name": "What's new in the coolest Cloud",
    "summary": "A blog post by someone real famous",
    "comments": [
        {"id": 1, "author": "anon", "comment": "something useful, I'm sure"},
        {"id": 2, "author": "bob", "comment": "wisdom from the interwebs"},
        …
        {"id": 100001, "author": "jane", "comment": "and on we go ..."},
        …
        {"id": 1000000001, "author": "angry", "comment": "blah angry blah angry"},
        …
        {"id": ∞ + 1, "author": "bored", "comment": "oh man, will this ever end?"},
    ]
}

This might be what a post entity with embedded comments would look like if we were modeling a typical blog, or CMS, system. The problem with this example is that the comments array is unbounded, meaning that there's no (practical) limit to the number of comments any single post can have. This may become a problem as the size of the item could grow infinitely large so is a design you should avoid.

As the size of the item grows the ability to transmit the data over the wire and reading and updating the item, at scale, will be impacted.

In this case, it would be better to consider the following data model.

Post item:
{
    "id": "1",
    "name": "What's new in the coolest Cloud",
    "summary": "A blog post by someone real famous",
    "recentComments": [
        {"id": 1, "author": "anon", "comment": "something useful, I'm sure"},
        {"id": 2, "author": "bob", "comment": "wisdom from the interwebs"},
        {"id": 3, "author": "jane", "comment": "....."}
    ]
}

Comment items:
[
    {"id": 4, "postId": "1", "author": "anon", "comment": "more goodness"},
    {"id": 5, "postId": "1", "author": "bob", "comment": "tails from the field"},
    ...
    {"id": 99, "postId": "1", "author": "angry", "comment": "blah angry blah angry"},
    {"id": 100, "postId": "2", "author": "anon", "comment": "yet more"},
    ...
    {"id": 199, "postId": "2", "author": "bored", "comment": "will this ever end?"}   
]

This model has a document for each comment with a property that contains the post identifier. This allows posts to contain any number of comments and can grow efficiently. Users wanting to see more than the most recent comments would query this container passing the postId, which should be the partition key for the comments container.

Another case

Another case where embedding data isn't a good idea is when the embedded data is used often across items and will change frequently.

Take this JSON snippet.

{
    "id": "1",
    "firstName": "Thomas",
    "lastName": "Andersen",
    "holdings": [
        {
            "numberHeld": 100,
            "stock": { "symbol": "zbzb", "open": 1, "high": 2, "low": 0.5 }
        },
        {
            "numberHeld": 50,
            "stock": { "symbol": "xcxc", "open": 89, "high": 93.24, "low": 88.87 }
        }
    ]
}

This could represent a person's stock portfolio. We have chosen to embed the stock information into each portfolio document. In an environment where related data is changing frequently, like a stock trading application, embedding data that changes frequently is going to mean that you're constantly updating each portfolio document every time a stock is traded.

Stock zbzb may be traded many hundreds of times in a single day and thousands of users could have zbzb on their portfolio. With a data model like the above we would have to update many thousands of portfolio documents many times every day leading to a system that won't scale well.

So, in such cases it is best to reference data.

PreviousData modeling NextReference data

Last updated 1 year ago