Suppose you are to design AWS lambda function for processing incoming data. And, for simplicity, let's just say that all your lambda function needs to do is make a few pre-defined calculations from the input data (for example, the average of the houses listed).
Straight-forward enough.
Now, suppose you don't actually know the format of the incoming data.
Or, perhaps there are multiple formats.
Or, perhaps the formatting continues to be worked on and your boss isn't sure how it'll look yet.
Or, perhaps the input data comes from a myriad of different sources and those sources want to be able to update the schema to their liking without having to get your (or anyone else's) approval first.
This problem is what I am calling a schema evolution problem and, in this post, I wish propose a solution to this problem.
In brief, the solution that I eventually came up has three main components:
Hopefully, you were able to understand what I meant by those three points.
If so, then there isn't anything in the rest of this blog post that you didn't already surmise.
However, in case any part of that wasn't clear (and my sentence structures can get a bit convoluted), let me try explaining what I meant.
In the computer science classic, "Design Patterns: Elements of Reusable Object-Oriented Software," authors Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides provided a list of 23 design patterns, programming solutions to common problems via object-oriented programming.
The Abstract Factory Pattern is the name of a creation pattern from that famous book.
Allow me to overly simplify things (which is how I go about understanding things). The idea is that, if you don't know until runtime what class you should use to define your object, then make use of an interface or base class so that you may defer the decision until the necessary data are known.
Suppose you're a cook at a restaurant in charge of one and only one thing: put prepared food into the oven at the correct temperature and take it out after the specified time interval.
Then, for you, any and all food objects should contain two pieces of information: cook time and cook temperature.
So, suppose you create a Pizza
object and a Cake
object:
class Pizza():
def __init__(self):
self.cook_time = 10
self.cook_temperature = 750
class Cake():
def __init__(self):
self.cook_time = 35
self.cook_temperature = 350
Then, it doesn't matter to you which object was passed to you. The object's cook time will always be retrievable by x.cook_time
.
And, similarly, the cook temperature can always be retieved as x.cook_temperature
.
Of course, how to type hint these two distinct objects in Python can be annoying (since they are separate objects). However, if you want, you can always define a third object that kinda encapsulates one of the two defined above.
For example,
class Food:
def __init__(self, name: str):
if name.lower() == "pizza":
self.object = Pizza()
elif name.lower() == "cake":
self.object = Cake()
else:
raise ValueError("Food name must be either 'pizza' or 'cake.'")
@property
def cook_time(self):
return self.object.cook_time
@property
def cook_temperature(self):
return self.object.cook_temperature
Now, pizzas can be instantiated using
pizza = Food("pizza")
and cakes can be instantiated via
cake = Food("cake")
And suppose you wanna know when to take the food out of the oven, then you can write the function as
from datetime import datetime, timedelta
def end_time(food: Food, start_time: datetime) -> datetime:
cook_time_in_seconds = 60 * food.cook_time
return start_time + timedelta(seconds=cook_time_in_seconds)
For the cook, so long as she is provided with
pizza
or cake
in our example above) andcook_time
and cook_temperature
in our example above),
she is able to do her job.Applying this idea to our problem of the ever-changing schema, we realize that we can solve the problem so long as we
then the problem is not a problem. It's essentially solved.
Unfortunately, "essentially solved" in our case does not actually mean "solved." We still require a way for people to give us the schema format versioning, especially the ones who are choosing / creating the format of the data that we are supposed to process.
This can be done in one of two ways:
Suppose I have the following:
Parser
class class definition (e.g., /parser_1.py
, /parser_2.py
, and so on, with each .py
file contain a different definition of the data class Parser
.)We can then define a download_parser
function as follows.
def download_parser(file: str) -> bool:
"""Download the parser python file from S3 into /tmp."""
url = f"{Env.apigw}/parsers/{file}"
response = requests.get(url).json()
logger.info("response: %s", json.dumps(response, default=str))
result = response["Items"][0]
bucket_name = result["bucket"]
key = result["key"]
if os.path.exists("/tmp/parser.py"):
os.remove("/tmp/parser.py")
try:
s3.download_file(bucket_name, key, "/tmp/parser.py")
except:
return False
return True
Then, in the main function, we would
Parser
class implementation,Parser
class (else Python would ignore your later import statement),Parser
class, andHere's an example lambda_handler
function:
def handler(data: dict, context):
logger.info("Event: %s", json.dumps(data, default=str))
# remove parser if previously imported
if "parser" in sys.modules:
del sys.modules["parser"]
# Parse out the appropriate information
parser_name = get_parser_name(data)
# Download parser from S3 into /tmp
success = download_parser(parser_name)
if not success:
raise Exception(f"Failed to download parser {parser_name}")
# Import Parser dataclass from /tmp
sys.path.insert(0, os.path.abspath("/tmp"))
from parser import Parser
# Instantiate dataclass
parser = Parser(data=data)
# Print info
logger.info("First name: %s", parser.first_name())
logger.info("Last name: %s", parser.last_name())
logger.info("Age: %s", parser.age())
logger.info("Loan ammount: %s", parser.loan_amount())