Data Sources for Data Warehousing
Learn How to Manage Data Sources for Input to Data
Warehousing and Business Intelligence
A data warehousing and business intelligence effort is only
as good as the data that is put into it. The saying "Garbage
In, Garbage Out" is all too true. A leading cause of data
warehousing and business intelligence project failures is to
obtain the wrong or poor quality data.
This article will provide an efficient and effective
workflow for obtaining the right source data and using it in
the data warehousing and business intelligence project.
|
|
Managing data warehouse input sources
includes a number of steps organized into two
phases. In the first phase the following
activities are undertaken:
- Manage the Data Source Identification
Process
- Identify Subject Matter Experts
(SMEs)
- Identify Dimension Data Sources
- Identify Fact Data Sources
|
When the major data sources have been identified it is time
to quickly gain detailed understanding of each one:
- Obtain Existing Documentation
- Model and Define the Input
- Profile the Input
- Improve Data Quality
- Save Results for Further Reuse
Manage the Data Warehousing Data Source Identification
Process
The source identification process is critical to the success
of data warehousing and business intelligence projects. It is
important to move through this effort quickly, obtaining enough
information about the data sources without being bogged down in
excess detail while still obtaining the needed information.
Start out with a list of the entities planned for the data
warehouse / data mart. This can be managed with a spreadsheet
containing these columns:
- Entity name
- Data mart role (Fact, Dimension, Bridge, etc.)
- Subject Area
- Data Source(s)
- Analyst Name(s)
- Subject Matter Expert(s)
- Status

Complete the entity name, data mart role and subject area
entries. Assign an analyst to each entity who will find data
sources and subject matter experts for each entity.
Identify Data Warehousing Data Source Subject Matter
Experts
Consider the following questions when determining the
sources and costs of data for the Data Warehouse:
- Where does the data come from?
- What processes are used to obtain the data?
- What does it cost to obtain the data?
- What does it cost to store the data?
- What does it cost to maintain the data?
Identify Dimension Data Sources for the Data Mart
Dimensions enable business intelligence users to put
information in context They focus on questions of: who,
when, where and what. Typical dimensions include:
- Time period / calendar
- Product
- Customer
- Household
- Market Segment
- Geographic Area
Master data is a complementary concept and may provide the
best source of dimensional data for the data warehouse.
Master data is data shared between systems that describes
entities like: product, customer and household. Master
data is managed using a Master Data Management (MDM)
system and stored in an MDM-Hub. Benefits of this
approach include:
- It is less expensive to access data from a single
source (MDM-Hub) than extracting from multiple
sources.
- MDM data is rationalized.
- MDM data is of high quality
If an MDM-Hub does not exist consider creating one. It
will have many uses beyond supporting the data warehouse and
business intelligence.
If no MDM-Hub is available, you will need to examine source
systems and determine which system contains the data most
suitable for dimensions. If the data is not stored in a
managed database, you may need to define the data locally, in a
spreadsheet or desktop database, and then provide to the data
warehousing system.
Identify Fact Data Sources for the Data Mart
The Fact contains quantitative measurements while the
Dimension contains classification information. The data
sources for Fact tends to be transactional software
systems. For example:
|
System
|
Example Fact
Data
|
| Sales Order Entry |
- Sales Transaction
- Return Transaction
|
| Customer Service |
- Service Episode
- Service Result
|
| Accounts Payable |
|
| Sales Campaign |
|
Larger enterprises may have multiple systems for the same
kind data. In that case, you will need to determine
the best source of data - the System of Record (SOR) as the
source of data warehousing data.
|