Detailed Data Source Understanding for Data
Warehousing
When the major data sources have been identified it is time
to quickly gain detailed understanding of each one. Consolidate
the spreadsheet developed in the identification phase by data
source, then create a new spreadsheet to track and control
detailed understanding:
- Data Subject Name
- Obtain Doc Date
- Define Input Date
- Profile Input Date
- Map Date
- Data Quality Date
- Save Results
- Analyst Name
- SME Name(s)
- Status

This approach provides an effective workflow as well as a
project planning and control method. Due dates are assigned and
actual complete dates and status are tracked.
Obtain Existing Documentation
When seeking to understand a data source, the first thing to
do is look at existing documentation. This avoids "re-inventing
the wheel". If a data source is fully documented, data profiled
and of high quality most of the job of data source discovery is
complete.
Existing documentation may include:
- Data models
- Data dictionary
- Internal / technical documentation
- Business user guides
- Data profiles and data quality assessments
Check through the documentation to assess its completeness
and usefulness.
The data source analyst should study the existing
documentation before any in depth discussions with the SMEs.
This improves the credibility of the data analyst and save time
for the SMEs.
Model and Define the Input
The data model is a graphic representation of data
structures that improves understanding and provides automation
linking database design to physical implementation. This
section assumes that the data source is stored in a relational
database that modeled using typical relational data modeling
tools.
If there is an existing data model, start with that,
otherwise use the reverse engineering capability of the data
modeling to build a physical data model. Next, group the tables
that are of interest into a subject area for analysis. Unless,
a large percentage of the data source is needed for the data
warehouse avoid studying the entire data source. Stay focused
on the current project.
For each selected data source table define:
- Physical Name
- Logical Name
- Definition
- Notes
For each selected data source column define:
- Physical Name
- Logical Name
- Order in Table
- Datatype
- Length
- Decimal Positions
- Nullable/Required
- Default Value
- Edit Rules
- Definition
- Notes
Profile the Data Source
The actual use and behavior of data sources often tends not
to match the name or definition of the data. Sometimes this is
called "dirty data" or "unrefined data" that may have problems
such as:
- Invalid code values
- Missing data values
- Multiple uses of a single data item
- Inconsistent code values
- Incorrect values such as sales revenue amounts
Data profile is an organized approach to examining data to
better understand and later use it. This can be accomplished by
querying the data using tools like:
- SQL Queries
- Reporting tools
- Data quality tools
- Data exploration tools
For code values such as gender code and account status code
do a listing showing value and count such as this gender code
listing:
|
Code
|
Count
|
Notes
|
| F |
500
|
Female |
| M |
510
|
Male |
| T |
12
|
Transgender? |
| Z |
5
|
??? |
| NULL |
1000
|
Missing |
Other systems may represent female and male as 1 and 2
rather than F and T, and so may require standardization when
stored in the data warehouse. When data from multiple sources
is integrated in the data warehouse it is expected that it will
be standardized and integrated.
Statistical measures are a good way to better understand
numeric information such as revenue amounts. Helpful statistics
are:
- Mean (average)
- Median
- Mode
- Maximum
- Minimum
- Quartile Averages
- Standard Deviation
- Variance
Consistency within a database is another important factor to
determine through data profiling. For example, there may be an
order table which should only have orders for customers
established in the customer table. Perform queries to determine
whether this is true.
Improve Data Quality
Data profiling may reveal problems in data quality.
For example, it might show invalid values are be entered for a
particular column, such as entering 'Z' for gender when 'F' and
'M' are the valid values. Some steps that could be taken
to improve data quality include:
- Work with data owners to define the appropriate level
of data quality. Build this into a data governance
program.
- Determine why there are data quality problems -- do a
root cause analysis.
- Correct the data in the source system through manual or
automated efforts.
- Add edits or database rules to prevent the
problem.
- Change business processes to enter correct data.
- Make data quality visible to the business through
scorecards, dashboards and reports.
Save Results for Further Reuse
The information gathered during the data source discovery
process is valuable metadata that can be useful for future data
warehousing or other projects. Be sure to save the
results and make available for future efforts. This work
can be a great step toward building an improved data
resource.
|