|David Haertzen, Principal Enterprise Architect|
A data warehousing and business intelligence effort is only as good as the data that is put into it. The saying "Garbage In, Garbage Out" is all too true. A leading cause of data warehousing and business intelligence project failures is to obtain the wrong or poor quality data.
This article will provide an efficient and effective workflow for obtaining the right source data and using it in the data warehousing and business intelligence project.
Managing data warehouse input sources includes a number of steps organized into two phases. In the first phase the following activities are undertaken:
When the major data sources have been identified it is time to quickly gain detailed understanding of each one:
The source identification process is critical to the success of data warehousing and business intelligence projects. It is important to move through this effort quickly, obtaining enough information about the data sources without being bogged down in excess detail while still obtaining the needed information.
Start out with a list of the entities planned for the data warehouse / data mart. This can be managed with a spreadsheet containing these columns:
Complete the entity name, data mart role and subject area entries. Assign an analyst to each entity who will find data sources and subject matter experts for each entity.
Consider the following questions when determining the sources and costs of data for the Data Warehouse:
Dimensions enable business intelligence users to put information in context They focus on questions of: who, when, where and what. Typical dimensions include:
Master data is a complementary concept and may provide the best source of dimensional data for the data warehouse. Master data is data shared between systems that describes entities like: product, customer and household. Master data is managed using a Master Data Management (MDM) system and stored in an MDM-Hub. Benefits of this approach include:
If an MDM-Hub does not exist consider creating one. It will have many uses beyond supporting the data warehouse and business intelligence.
If no MDM-Hub is available, you will need to examine source systems and determine which system contains the data most suitable for dimensions. If the data is not stored in a managed database, you may need to define the data locally, in a spreadsheet or desktop database, and then provide to the data warehousing system.
The Fact contains quantitative measurements while the Dimension contains classification information. The data sources for Fact tends to be transactional software systems. For example:
Example Fact Data
|Sales Order Entry||
Larger enterprises may have multiple systems for the same kind data. In that case, you will need to determine the best source of data - the System of Record (SOR) as the source of data warehousing data.
When the major data sources have been identified it is time to quickly gain detailed understanding of each one. Consolidate the spreadsheet developed in the identification phase by data source, then create a new spreadsheet to track and control detailed understanding:
This approach provides an effective workflow as well as a project planning and control method. Due dates are assigned and actual complete dates and status are tracked.
When seeking to understand a data source, the first thing to do is look at existing documentation. This avoids "re-inventing the wheel". If a data source is fully documented, data profiled and of high quality most of the job of data source discovery is complete.
Existing documentation may include:
Check through the documentation to assess its completeness and usefulness.
The data source analyst should study the existing documentation before any in depth discussions with the SMEs. This improves the credibility of the data analyst and save time for the SMEs.
The data model is a graphic representation of data structures that improves understanding and provides automation linking database design to physical implementation. This section assumes that the data source is stored in a relational database that modeled using typical relational data modeling tools.
If there is an existing data model, start with that, otherwise use the reverse engineering capability of the data modeling to build a physical data model. Next, group the tables that are of interest into a subject area for analysis. Unless, a large percentage of the data source is needed for the data warehouse avoid studying the entire data source. Stay focused on the current project.
For each selected data source table define:
For each selected data source column define:
The actual use and behavior of data sources often tends not to match the name or definition of the data. Sometimes this is called "dirty data" or "unrefined data" that may have problems such as:
Data profile is an organized approach to examining data to better understand and later use it. This can be accomplished by querying the data using tools like:
For code values such as gender code and account status code do a listing showing value and count such as this gender code listing:
Other systems may represent female and male as 1 and 2 rather than F and T, and so may require standardization when stored in the data warehouse. When data from multiple sources is integrated in the data warehouse it is expected that it will be standardized and integrated.
Statistical measures are a good way to better understand numeric information such as revenue amounts. Helpful statistics are:
Consistency within a database is another important factor to determine through data profiling. For example, there may be an order table which should only have orders for customers established in the customer table. Perform queries to determine whether this is true.
Data profiling may reveal problems in data quality. For example, it might show invalid values are be entered for a particular column, such as entering 'Z' for gender when 'F' and 'M' are the valid values. Some steps that could be taken to improve data quality include:
The information gathered during the data source discovery process is valuable metadata that can be useful for future data warehousing or other projects. Be sure to save the results and make available for future efforts. This work can be a great step toward building an improved data resource.