Data Warehousing Input Sources

Data Sources for Data Warehousing

Learn How to Manage Data Sources for Input to Data Warehousing and Business Intelligence

David Haertzen David Haertzen, Principal Enterprise Architect

A data warehousing and business intelligence effort is only as good as the data that is put into it. The saying "Garbage In, Garbage Out" is all too true. A leading cause of data warehousing and business intelligence project failures is to obtain the wrong or poor quality data.

This article will provide an efficient and effective workflow for obtaining the right source data and using it in the data warehousing and business intelligence project.

Managing data warehouse input sources includes a number of steps organized into two phases. In the first phase the following activities are undertaken:

  • Manage the Data Source Identification Process
  • Identify Subject Matter Experts (SMEs)
  • Identify Dimension Data Sources
  • Identify Fact Data Sources

When the major data sources have been identified it is time to quickly gain detailed understanding of each one:

  • Obtain Existing Documentation
  • Model and Define the Input
  • Profile the Input
  • Improve Data Quality
  • Save Results for Further Reuse

Manage the Data Warehousing Data Source Identification Process

The source identification process is critical to the success of data warehousing and business intelligence projects. It is important to move through this effort quickly, obtaining enough information about the data sources without being bogged down in excess detail while still obtaining the needed information.

Start out with a list of the entities planned for the data warehouse / data mart. This can be managed with a spreadsheet containing these columns:

  • Entity name
  • Data mart role (Fact, Dimension, Bridge, etc.)
  • Subject Area
  • Data Source(s)
  • Analyst Name(s)
  • Subject Matter Expert(s)
  • Status

Data Warehousing Data Source Entity Plan

Complete the entity name, data mart role and subject area entries. Assign an analyst to each entity who will find data sources and subject matter experts for each entity.

Identify Data Warehousing Data Source Subject Matter Experts

Consider the following questions when determining the sources and costs of data for the Data Warehouse:

  • Where does the data come from?
  • What processes are used to obtain the data?
  • What does it cost to obtain the data?
  • What does it cost to store the data?
  • What does it cost to maintain the data?

Identify Dimension Data Sources for the Data Mart

Dimensions enable business intelligence users to put information in context  They focus on questions of: who, when, where and what.  Typical dimensions include:

  • Time period / calendar
  • Product
  • Customer
  • Household
  • Market Segment
  • Geographic Area

Master data is a complementary concept and may provide the best source of dimensional data for the data warehouse.  Master data is data shared between systems that describes entities like: product, customer and household.  Master data is managed using a Master Data Management (MDM) system and stored in an MDM-Hub.  Benefits of this approach include:

  • It is less expensive to access data from a single source (MDM-Hub) than extracting from multiple sources.
  • MDM data is rationalized.
  • MDM data is of high quality

If an MDM-Hub does not exist consider creating one.  It will have many uses beyond supporting the data warehouse and business intelligence.

If no MDM-Hub is available, you will need to examine source systems and determine which system contains the data most suitable for dimensions.  If the data is not stored in a managed database, you may need to define the data locally, in a spreadsheet or desktop database, and then provide to the data warehousing system.

Identify Fact Data Sources for the Data Mart

The Fact contains quantitative measurements while the Dimension contains classification information.  The data sources for Fact tends to be transactional software systems.  For example:


Example Fact Data

Sales Order Entry
  • Sales Transaction
  • Return Transaction
Customer Service
  • Service Episode
  • Service Result
Accounts Payable
  • Payment Transaction
Sales Campaign
  • Sales Campaign Event

Larger enterprises may have multiple systems for the same kind data.   In that case, you will need to determine the best source of data - the System of Record (SOR) as the source of data warehousing data.


Detailed Data Source Understanding for Data Warehousing

When the major data sources have been identified it is time to quickly gain detailed understanding of each one. Consolidate the spreadsheet developed in the identification phase by data source, then create a new spreadsheet to track and control detailed understanding:

  • Data Subject Name
  • Obtain Doc Date
  • Define Input Date
  • Profile Input Date
  • Map Date
  • Data Quality Date
  • Save Results
  • Analyst Name
  • SME Name(s)
  • Status

Data Source Plan P2

This approach provides an effective workflow as well as a project planning and control method. Due dates are assigned and actual complete dates and status are tracked.

Obtain Existing Documentation

When seeking to understand a data source, the first thing to do is look at existing documentation. This avoids "re-inventing the wheel". If a data source is fully documented, data profiled and of high quality most of the job of data source discovery is complete.

Existing documentation may include:

  • Data models
  • Data dictionary
  • Internal / technical documentation
  • Business user guides
  • Data profiles and data quality assessments

Check through the documentation to assess its completeness and usefulness.

The data source analyst should study the existing documentation before any in depth discussions with the SMEs. This improves the credibility of the data analyst and save time for the SMEs.

Model and Define the Input

The data model is a graphic representation of data structures that improves understanding and provides automation linking database design to physical implementation. This section assumes that the data source is stored in a relational database that modeled using typical relational data modeling tools.

If there is an existing data model, start with that, otherwise use the reverse engineering capability of the data modeling to build a physical data model. Next, group the tables that are of interest into a subject area for analysis. Unless, a large percentage of the data source is needed for the data warehouse avoid studying the entire data source. Stay focused on the current project.

For each selected data source table define:

  • Physical Name
  • Logical Name
  • Definition
  • Notes

For each selected data source column define:

  • Physical Name
  • Logical Name
  • Order in Table
  • Datatype
  • Length
  • Decimal Positions
  • Nullable/Required
  • Default Value
  • Edit Rules
  • Definition
  • Notes

Profile the Data Source

The actual use and behavior of data sources often tends not to match the name or definition of the data. Sometimes this is called "dirty data" or "unrefined data" that may have problems such as:

  • Invalid code values
  • Missing data values
  • Multiple uses of a single data item
  • Inconsistent code values
  • Incorrect values such as sales revenue amounts

Data profile is an organized approach to examining data to better understand and later use it. This can be accomplished by querying the data using tools like:

  • SQL Queries
  • Reporting tools
  • Data quality tools
  • Data exploration tools

For code values such as gender code and account status code do a listing showing value and count such as this gender code listing:















Other systems may represent female and male as 1 and 2 rather than F and T, and so may require standardization when stored in the data warehouse. When data from multiple sources is integrated in the data warehouse it is expected that it will be standardized and integrated.

Statistical measures are a good way to better understand numeric information such as revenue amounts. Helpful statistics are:

  • Mean (average)
  • Median
  • Mode
  • Maximum
  • Minimum
  • Quartile Averages
  • Standard Deviation
  • Variance

Consistency within a database is another important factor to determine through data profiling. For example, there may be an order table which should only have orders for customers established in the customer table. Perform queries to determine whether this is true.

Improve Data Quality

Data profiling may reveal problems in data quality.  For example, it might show invalid values are be entered for a particular column, such as entering 'Z' for gender when 'F' and 'M' are the valid values.  Some steps that could be taken to improve data quality include:

  • Work with data owners to define the appropriate level of data quality.  Build this into a data governance program.
  • Determine why there are data quality problems -- do a root cause analysis.
  • Correct the data in the source system through manual or automated efforts.
  • Add edits or database rules to prevent the problem.
  • Change business processes to enter correct data.
  • Make data quality visible to the business through scorecards, dashboards and reports.

Save Results for Further Reuse

The information gathered during the data source discovery process is valuable metadata that can be useful for future data warehousing or other projects.  Be sure to save the results and make available for future efforts.  This work can be a great step toward building an improved data resource.



Copyright© 1999-2015, First Place Software, Inc.