Ensuring Data Quality in Data Warehouses
- February 4, 2016
“Virtually everything in business today is an undifferentiated commodity, except how a company manages its information. How you manage information determines whether you win or lose.”
Modern day business applications & systems produce tons of data that can be smartly used by decision makers and managers, but in many cases the data sources are fragmented and inconsistent. So, despite all the hard work done at the application level, many IT initiatives fail to deliver because of inconsistency and poor quality of underlying data.
Practices such as data quality management (DQM), master data management (MDM) and data warehouse or ETL testing can be used to ensure that high quality data is being used to derive timely, accurate and interesting business patterns for the decision makers.
In this article, my focus will be on importance of data quality testing, and how data warehouse or ETL testing can help in identifying hidden problems within your data and ultimately improving data quality processes of your organization in the long run.
Today, large organizations need data warehouse testing more than ever before. Without testing, the data warehouse could end up producing invalid results and as a result lose the trust of the end business users and decision makers. Effective data quality assurance and testing requires putting together the set of right processes, people and technology and deploying them in productive ways.
Before we start data warehouse testing, it is important to have a proper test plan and a test team that is well versed with data quality management skills. There are several levels and phases of data warehouse and ETL testing; some of which are mentioned below:
Business Requirements Gathering & Testing:
Successful testing starts with gathering and documentation of business and system requirements. Without proper requirements, it is very difficult to measure the system correctness. So, in this phase we gather the requirements; check if they are complete and testable. If there are any ambiguities in requirements they are cleared before moving on to the next phase.
Test documentation & Environment Setup:
Just like any other testing project, DWH testing also involves test planning, and test documentation phase. In this phase test team designs and develops test cases on the basis of the requirements. Further; the test environment is also set up. In this phase; automation and data generation tools can be selected to facilitate the testing activities as well.
Once the test documentation is complete and environment for testing is ready we proceed to the test execution stage. Various types of testing can be performed on DWH or ETL applications, some of which are as follows:
Integration Testing is done to verify whether or not the system acts as per expectations once the components of the solution are put together. It tests the systems’ functionalities from start to end, focusing on how the data flows throughout. This is also called “system testing” or “end-to-end testing”. In this type of testing; following activities are performed specifically:
- Verify the sequence & output of ETL batch processes
- Check that ETL processes work properly with upstream and downstream processes
- Test the discarded records that fail ETL rules
Data Validation Testing:
Data Validation is the process of testing the data within system. In this phase we check the data by applying different testing techniques to see whether data extraction, transformation and loading functionalities are working properly or not. Some of the activities that we perform during this phase are mentioned below:
- Compare record counts between different data sources
- Verify that ETL application properly rejects and replaces invalid data with default values
- Confirm that data is transformed correctly as per the requirements
- Check that all data is loaded into the DWH without any data loss or truncation
- Verify that data intended for transformation has been transformed successfully
Properly validating the performance of DWH solution under real world conditions is very important. There are a number of factors to consider while doing DWH/ETL Performance testing such as architecture, hardware specifications, scalability, query complexity, concurrent users, etc. We test the performance of the system from following perspectives:
- Performance of the system while extracting huge amount of data
- Performance of the system while transforming and loading a large amount of data
- Performance of the system while manipulating the data through calculations
User Acceptance Testing:
The aim of User Acceptance Testing (UAT) is to ensure two things: (i) ensure that the data which is being provided to the end user is as per the requirements or not. (ii) ensure that the tools provided to the end user are easy to use and meet their expectations. It should also be confirmed that DWH is producing accurate reports that are easy to interpret for end users.
Challenges in DWH/ETL Testing:
Testing a DWH/ETL application is a bit different from conventional testing but basic concepts remain the same. Test teams may face below mentioned challenges but these challenges can be overcome by doing proper planning and using suitable techniques.
- Setting up a dedicated test environment
- Missing business flows & requirements
- Preparing huge test data
- Privileges issues to execute ETL jobs
- Incompatible, invalid or missing data
In the light of the above discussion we can easily conclude that testing ETL process and DWH is very critical for the success of DWH system. If quality is not ensured and wrong results are being produced, then not only the users will lose faith in the system but business can also suffer huge losses because of the wrong decisions made on the basis of wrong results and wrong data.