Validating your Data with Hamilton
Since we open-sourced Hamilton late last year, we have received an outpouring of community engagement and support. It turns out the problems we needed to solve at Stitch Fix were well aligned with the rest of the industry’s approach to transforming data!
With the goal of being a one-stop-shop for all dataflow-related needs, we have constantly received feedback and contributions on improving the product. The most consistent feedback we received was that, while the API for managing dataflows was clean and scalable, nothing in Hamilton ensured the robustness or quantitative stability of execution. If garbage data went in, dataflows broke or garbage data came out.
We have solved this in
sf-hamilton>=1.9.0 with the introduction of two decorators. Our goal was to build an experience that hit the majority of use cases, while setting up extensions and integrations for future capabilities…
In case you aren’t familiar with the product, here’s a brief introduction to Hamilton. For more details, check out the initial introduction, as well as the update on scaling up dataflows.
Stitch Fix’s demand forecasting team manages an ever-expanding set of complex, highly-configurable forecasts. In the before-times, these pipelines took the form of an ancient codebase that continually manipulated the same dataframe to transform actuals data into predictions.
The systems that massaged features to produce the team’s forecasts were among the oldest at Stitch Fix. While they had been business-critical for years, the organic, sprawling nature of the code’s development meant that nobody fully understood the software in all its complexity.
Hamilton was born to enable this team to scale. At its core, it allows the manager of a dataflow to express every element as a simple python function. Instead of writing code that looks like this:
df['c'] = df['b'] + df['a']
they would write the following:
def c(a: pd.Series, b: pd.Series) -> pd.Series: """Column c - this sums up a and b""" return a+b
The Hamilton framework compiles these functions to model a dataflow as nodes in a Directed Acyclic Graph (DAG). It uses the names of the parameters as references to upstream nodes, and the name of the function itself to specify the node. In this case,
b are implied to be other nodes in the DAG (possibly defined by further upstream functions…).
This simple abstraction, powered by a variety of extensions in the framework, made it remarkably easy to work on these vast, complex feature pipelines without incurring additional cognitive burden. Their newly migrated codebase turned into a self-documenting feature store, and the forecasting team was able to scale, and therefore able to focus more importantly on building better models for the business.
While the abstraction that Hamilton provides is uniquely powerful, it initially did nothing to ensure that the data processed within Hamilton’s dataflow was what one expected. Data quality issues generally creep into dataflows abruptly or over time. For example, either something changed with the inputs (e.g. a value drops off in appearance), or a change was made to transformation logic that invalidated downstream assumptions, e.g. allowing NaNs. While debugging such issues is more straightforward with Hamilton, we felt that we could provide the dataflow developer with a better experience. One should be able to specify and have confidence that each function produces sane results upon execution.
There are three popular open source frameworks that come to mind with respect to tabular data:
The first two are heavy weight frameworks that require a suite of services & databases to be spun up. The last one, Pandera, takes a much simpler approach. It does away with the need for managing a service, and instead focuses on defining simple expectations on dataframe-like objects. All three frameworks generally require you to think about data quality as an explicit step in your dataflow that you need to express.
While the above all provide powerful capabilities, we wanted to take a lighter-weight approach. Specifically, we believed:
- Most data validation checks do not need to rely on stored state (hence the success of Pandera).
- Actions upon validation failures should be simple (e.g. log a warning or fail out).
- Defining the tests together with the code ensures expectations and code are less likely to be out of sync and are self-documenting.
- Using version control to handle test expectations is the optimal way to manage the lifecycle of a dataflow.
With these principles we set out to extend the decorator framework in Hamilton to allow for simple tests and assertions. We ended up with the following…
@check_output( data_type=np.int64, data_in_range=(0,100), importance="warn", ) def some_int_data_between_0_and_100(some_input: pd.Series) -> pd.Series: # Do some computation
And it’s as simple as that! Behind the scenes, this does a few things:
- Creates two additional nodes (data_type and data_in_range validators)
- Executes the nodes after running the function
- Logs a warning (as specified)
By querying the final results of the DAG, one can query and store the test results (see the documentation for an example). There are a variety of arguments available, each corresponding to a unique validator. You can explore them here.
Note, however, that these basic validation arguments are the tip of the iceberg when it comes to data quality. There is an extensive pluggable framework to enable integrations. Pandera even comes out of the box!
To take full advantage of Pandera, all one has to do is pass a schema into the
import Pandera as pa @check_output( schema=pa.SeriesSchema( pa.Column(int), pa.Check.in_range(0,100)), importance="warn" ) def some_int_data_between_0_and_100(some_input: pd.Series) -> pd.Series: # Do some computation
Now you can leverage the full power of Pandera and Hamilton, all with a simple validator.
For more reference, read the tutorial on gitbook and more detailed documentation on github.
Hamilton’s design for data quality is intended to be highly flexible. We wanted to solve 80% of the problem (run basic assertions on the output), while enabling future contributors to solve the more complex ones (running stateful computations against previous runs, for example).
Thus we built a framework for extensions. Anyone who wants more power out of their data validation can implement a subclass of DataValidator, and validate to their heart’s content. We leave this as an exercise to the reader, but have some exciting integrations planned soon…
Data validation in Hamilton is still in its early stages. In the upcoming weeks, we’ll be working on (and would love contributions for):
- Configuration – allowing for fine-grained control over the behavior of specific checks. This will enable the user to turn on and off certain checks at runtime.
- Flexibility – enabling usage of more underlying data validation systems. While we were mainly avoiding the complexity of the more powerful systems, the extensions framework could happily integrate with tooling such as Great Expectations, Deequ, or Whylogs.
- Learning – generating expected tests from profiles of previous runs.
In the meanwhile, however, we’re very excited about the current offering and can’t wait to get feedback from all of you! Join our community on slack, read the docs on gitbook, check out the repo on github, and follow us on twitter.