Data Warehousing and Business Intelligence at the Data Management Center 

Data Warehousing Tutorial

Learn Data Warehousing and Business Intelligence.
Jump start your data warehousing efforts.
Bookmark and Share

 
<< Previous    1  [2]    Next >>

Detailed Data Source Understanding for Data Warehousing

When the major data sources have been identified it is time to quickly gain detailed understanding of each one. Consolidate the spreadsheet developed in the identification phase by data source, then create a new spreadsheet to track and control detailed understanding:

  • Data Subject Name
  • Obtain Doc Date
  • Define Input Date
  • Profile Input Date
  • Map Date
  • Data Quality Date
  • Save Results
  • Analyst Name
  • SME Name(s)
  • Status

Data Source Plan P2

This approach provides an effective workflow as well as a project planning and control method. Due dates are assigned and actual complete dates and status are tracked.

Obtain Existing Documentation

When seeking to understand a data source, the first thing to do is look at existing documentation. This avoids "re-inventing the wheel". If a data source is fully documented, data profiled and of high quality most of the job of data source discovery is complete.

Existing documentation may include:

  • Data models
  • Data dictionary
  • Internal / technical documentation
  • Business user guides
  • Data profiles and data quality assessments

Check through the documentation to assess its completeness and usefulness.

The data source analyst should study the existing documentation before any in depth discussions with the SMEs. This improves the credibility of the data analyst and save time for the SMEs.

Model and Define the Input

The data model is a graphic representation of data structures that improves understanding and provides automation linking database design to physical implementation. This section assumes that the data source is stored in a relational database that modeled using typical relational data modeling tools.

If there is an existing data model, start with that, otherwise use the reverse engineering capability of the data modeling to build a physical data model. Next, group the tables that are of interest into a subject area for analysis. Unless, a large percentage of the data source is needed for the data warehouse avoid studying the entire data source. Stay focused on the current project.

For each selected data source table define:

  • Physical Name
  • Logical Name
  • Definition
  • Notes

For each selected data source column define:

  • Physical Name
  • Logical Name
  • Order in Table
  • Datatype
  • Length
  • Decimal Positions
  • Nullable/Required
  • Default Value
  • Edit Rules
  • Definition
  • Notes

Profile the Data Source

The actual use and behavior of data sources often tends not to match the name or definition of the data. Sometimes this is called "dirty data" or "unrefined data" that may have problems such as:

  • Invalid code values
  • Missing data values
  • Multiple uses of a single data item
  • Inconsistent code values
  • Incorrect values such as sales revenue amounts

Data profile is an organized approach to examining data to better understand and later use it. This can be accomplished by querying the data using tools like:

  • SQL Queries
  • Reporting tools
  • Data quality tools
  • Data exploration tools

For code values such as gender code and account status code do a listing showing value and count such as this gender code listing:

Code

Count

Notes

F

500

Female
M

510

Male
T

12

Transgender?
Z

5

???
NULL

1000

Missing

Other systems may represent female and male as 1 and 2 rather than F and T, and so may require standardization when stored in the data warehouse. When data from multiple sources is integrated in the data warehouse it is expected that it will be standardized and integrated.

Statistical measures are a good way to better understand numeric information such as revenue amounts. Helpful statistics are:

  • Mean (average)
  • Median
  • Mode
  • Maximum
  • Minimum
  • Quartile Averages
  • Standard Deviation
  • Variance

Consistency within a database is another important factor to determine through data profiling. For example, there may be an order table which should only have orders for customers established in the customer table. Perform queries to determine whether this is true.

Improve Data Quality

Data profiling may reveal problems in data quality.  For example, it might show invalid values are be entered for a particular column, such as entering 'Z' for gender when 'F' and 'M' are the valid values.  Some steps that could be taken to improve data quality include:

  • Work with data owners to define the appropriate level of data quality.  Build this into a data governance program.
  • Determine why there are data quality problems -- do a root cause analysis.
  • Correct the data in the source system through manual or automated efforts.
  • Add edits or database rules to prevent the problem.
  • Change business processes to enter correct data.
  • Make data quality visible to the business through scorecards, dashboards and reports.

Save Results for Further Reuse

The information gathered during the data source discovery process is valuable metadata that can be useful for future data warehousing or other projects.  Be sure to save the results and make available for future efforts.  This work can be a great step toward building an improved data resource.

 

 

<< Previous    1  [2]    Next >>