Data cleansing is one of the most demanding parts of data warehousing. It includes data scrubbing and discovering business rules embedded within data.
Before you start, you need to know what to expect and how many mops and buckets you'll need.
First, consider this. About 75 per cent of the time spent on data warehouse/data mart projects is spent on back-end issues.
Some other rules to consider:
- You probably can't clean all the
data; there's too much of it. But
you won't need to clean it all
because not all of it needs to be
in the warehouse. Twenty per cent
of the data may meet 80 per cent
of the essential needs.
- Cleanse only those portions of
data that have real business
value. Business analysts should
be primarily responsible for
determining the business value of
data. Also, look at usage. Watch
out for users who say a chunk of
data is essential but is never even
looked at. The users should help
ensure data cleanliness by
working side-by-side with IS.
- The first symptom of dirty data is
a pattern of inconsistent data.
You may notice that the same
customers are listed under a
dozen different spellings or
abbreviations. Data elements may
have different names in different
systems. Or the same name may
be used for different data
elements.
- When you clean legacy data,
don't put the clean data back into
the source system. Legacy
systems are often delicate, and
"fixing" their data may break the
old system. Put the clean data in
the warehouse.
- Data cleaning can involve
re-engineering legacy data. If you
must deal with more than a small
amount of legacy data, you would
be wise to consider automated
software tools for conversion and
migration. An automated soft-
ware tool can extract mapping
rules from actual data in min
utes, instead of the hours it would
take to do it manually.