If you are not already doing Data Quality Testing, here are 3 good reasons for starting:
- Do not assume your data is correct!
2. Make the right business decision!
3. And finally, sleep well all night!
We cannot promise to improve your sleeping habits but we can help you with data quality testing – the basis for not making any assumptions but making the right business decisions. And your sleep will be better automatically 🙂
One of the many possible solutions on how to test the quality of your data is GreatExpectations – a solution for all data-driven businesses.
What is and does GreatExpectations exactly? Read further!
How does GreatExpectations help to your business
With GreatExpectations you will be able to expect a shape, format and content of your data in development. Whereas the usage in production is secured by testing and running your Expectations currently on your data.
Furthermore, due to testing, Great Expectations reduces pipeline debt and determines how to respond. This is a huge benefit for the teams and your whole business.
Speaking about data pipelines, Great Expectations enables you to monitor data quality and simplify debugging. Also, it automates the verification of new data.
Use GreatExpecations when sharing data with other teams in the organisation. Also, codify assumption builds models and data documentation is automatically created as a part of setting up GreatExpectations. Finally, implicit knowledge is transferred to explicit.
Data Quality Testing using GreatExpectations
GreatExpectations is an open-source tool (available on GitHub) that provides:
- Data testing
But why is it more and more difficult to identify a problem in your data pipelines? It is because of:
- Source systems change,
- data systems are evolving to and becoming more interconnected and
- teams need to share data across the organisation.
Due to pipeline debt your productivity and analytic integrity might be jeopardized. The pipeline debt is a technical debt which affects your data, Business Intelligence and Machine Learning. GreatExpectations can solve this issue by testing your data sets and data quality.
GreatExpectations is therefore a sophisticated approach how to identify your pipeline debt. Otherwise, you risk:
- Productivity drain – due to uncertainty and frustration with the data quality
- Operational risk – the usefulness and trustworthy of your analytics is low
Fortunately, every possible problem has a solution due to the effectivity and simplicity of GE.
- At first, Great Expectations configures and deploys your business rules and documentation framework.
- After that, it validates all new data against the existing business rules.
To describe the overal system a bit deeper, let’s explore the quarantine flow on Airflow:
- The first step is to import a new raw data.
- Then, it validates them through GreatExpectations and updates these data either with 0 or 1 for each row – it depends on whether they ‘fail’ or ‘pass’ the business rule.
- The next step is to split the data and move the invalid rows into a separate quarantine table. The valid data are moved into a clean table where data pipelines continue.
- Last but not least, it removes all data from the raw stage table. Therefore, it is empty and ready to repeat the whole flow with new monthly data.
How to get Data Quality Testing
Our team is very excited about GreatExpectations. We would be happy to show you what GreatExpectations can do for you!
We can help you to identify the issues of your data quality and improve your CI/CD processes to make better business decisions – afterall, the goal of everything you do!
In general, GreatExpectations saves your time during data cleaning and setting up data pipelines. It also accelerates ETL and data normalization and manages the complexity within your data pipeline.
Finally, a comment from out CTO – “Great Expectation is easily extended using Python and works seamlessly with Airflow, GitHub and GitHub Actions.”