How to handling schema evolution and versioning | DoingCloudStuff

How to handling schema evolution and versioning

author: Vincent Chan

The problem

Suppose you are to design AWS lambda function for processing incoming data. And, for simplicity, let's just say that all your lambda function needs to do is make a few pre-defined calculations from the input data (for example, the average of the houses listed).

Straight-forward enough.

Now, suppose you don't actually know the format of the incoming data.

Or, perhaps there are multiple formats.

Or, perhaps the formatting continues to be worked on and your boss isn't sure how it'll look yet.

Or, perhaps the input data comes from a myriad of different sources and those sources want to be able to update the schema to their liking without having to get your (or anyone else's) approval first.

This problem is what I am calling a schema evolution problem and, in this post, I wish propose a solution to this problem.

A brief description of the solution

In brief, the solution that I eventually came up has three main components:

push the responsibility of describing how to properly parse a certain data format to the ones who created/decided on said data format,
require that each data format contains a unique versioning, and
make use of the Abstract Factory Pattern to help avoid having to explicitly specify the form the parsed data will take.

Hopefully, you were able to understand what I meant by those three points.

If so, then there isn't anything in the rest of this blog post that you didn't already surmise.

However, in case any part of that wasn't clear (and my sentence structures can get a bit convoluted), let me try explaining what I meant.

Detailed explanation

Abstract factory pattern

In the computer science classic, "Design Patterns: Elements of Reusable Object-Oriented Software," authors Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides provided a list of 23 design patterns, programming solutions to common problems via object-oriented programming.

The Abstract Factory Pattern is the name of a creation pattern from that famous book.

Allow me to overly simplify things (which is how I go about understanding things). The idea is that, if you don't know until runtime what class you should use to define your object, then make use of an interface or base class so that you may defer the decision until the necessary data are known.

An example for the Abstract Factory Pattern

Suppose you're a cook at a restaurant in charge of one and only one thing: put prepared food into the oven at the correct temperature and take it out after the specified time interval.

Then, for you, any and all food objects should contain two pieces of information: cook time and cook temperature.

So, suppose you create a Pizza object and a Cake object:

Pizza object

class Pizza():
    def __init__(self):
        self.cook_time = 10
        self.cook_temperature = 750

Cake object

class Cake():
    def __init__(self):
        self.cook_time = 35
        self.cook_temperature = 350

Then, it doesn't matter to you which object was passed to you. The object's cook time will always be retrievable by x.cook_time. And, similarly, the cook temperature can always be retieved as x.cook_temperature.

Of course, how to type hint these two distinct objects in Python can be annoying (since they are separate objects). However, if you want, you can always define a third object that kinda encapsulates one of the two defined above.

For example,

Food object

class Food:
    def __init__(self, name: str):
        if name.lower() == "pizza":
            self.object = Pizza()
        elif name.lower() == "cake":
            self.object = Cake()
        else:
            raise ValueError("Food name must be either 'pizza' or 'cake.'")

    @property
    def cook_time(self):
        return self.object.cook_time

    @property
    def cook_temperature(self):
        return self.object.cook_temperature

Now, pizzas can be instantiated using

pizza = Food("pizza")

and cakes can be instantiated via

cake = Food("cake")

And suppose you wanna know when to take the food out of the oven, then you can write the function as

from datetime import datetime, timedelta

def end_time(food: Food, start_time: datetime) -> datetime:
    cook_time_in_seconds = 60 * food.cook_time
    return start_time + timedelta(seconds=cook_time_in_seconds)

How it relates to our problem at hand

For the cook, so long as she is provided with

the name of the food object (either pizza or cake in our example above) and
a class with certain needed methods defined (cook_time and cook_temperature in our example above), she is able to do her job.

Applying this idea to our problem of the ever-changing schema, we realize that we can solve the problem so long as we

have some way of uniquely identifying the schema format (say, versioning) and
have a class (or something) that has certain required functions / methods implemented,

then the problem is not a problem. It's essentially solved.

Wrapping up the proposed solution

Unfortunately, "essentially solved" in our case does not actually mean "solved." We still require a way for people to give us the schema format versioning, especially the ones who are choosing / creating the format of the data that we are supposed to process.

This can be done in one of two ways:

allow them to hand you their data format and versioning ID or
provide an API for them to submit their data format to (and have your API provide an unique versioning ID in return).

Applying the proposed solution to a toy problem

Suppose I have the following:

an S3 bucket containing the various Parser class class definition (e.g., /parser_1.py, /parser_2.py, and so on, with each .py file contain a different definition of the data class Parser.)
a table that allows me to map the data's version to the proper filename
a server that figures out the proper file to return to me and return it (I assumed API Gateway + Lambda in my example below)

We can then define a download_parser function as follows.

def download_parser(file: str) -> bool:
    """Download the parser python file from S3 into /tmp."""
    url = f"{Env.apigw}/parsers/{file}"
    response = requests.get(url).json()
    logger.info("response: %s", json.dumps(response, default=str))
    result = response["Items"][0]

    bucket_name = result["bucket"]
    key = result["key"]

    if os.path.exists("/tmp/parser.py"):
        os.remove("/tmp/parser.py")

    try:
        s3.download_file(bucket_name, key, "/tmp/parser.py")
    except:
        return False
    return True

Then, in the main function, we would

parse the incoming data for the version,
download the appropriate Parser class implementation,
remove any already-imported Parser class (else Python would ignore your later import statement),
import the Parser class, and
do whatever calculation it is that you wanna do.

Here's an example lambda_handler function:


def handler(data: dict, context):
    logger.info("Event: %s", json.dumps(data, default=str))

    # remove parser if previously imported
    if "parser" in sys.modules:
        del sys.modules["parser"]

    # Parse out the appropriate information
    parser_name = get_parser_name(data)

    # Download parser from S3 into /tmp
    success = download_parser(parser_name)
    if not success:
        raise Exception(f"Failed to download parser {parser_name}")

    # Import Parser dataclass from /tmp
    sys.path.insert(0, os.path.abspath("/tmp"))
    from parser import Parser

    # Instantiate dataclass
    parser = Parser(data=data)

    # Print info
    logger.info("First name: %s", parser.first_name())
    logger.info("Last name: %s", parser.last_name())
    logger.info("Age: %s", parser.age())
    logger.info("Loan ammount: %s", parser.loan_amount())