Rule #1: Load detailed
atomic data into dimensional structures.
Dimensional models
should be populated with bedrock atomic details to support the unpredictable
filtering and grouping required by business user queries. Users typically don't
need to see a single record at a time, but you can't predict the somewhat
arbitrary ways they'll want to screen and roll up the details. If only
summarized data is available, then you've already made
assumptions about data usage patterns that will cause users to run into a brick
wall when they want to dig deeper into the details. Of course, atomic details
can be complemented by summary dimensional models that provide performance
advantages for common queries of aggregated data, but business users cannot
live on summary data alone; they need the gory details to answer their
ever-changing questions.
Rule #2: Structure
dimensional models around business processes.
Business processes are
the activities performed by your organization; they represent measurement
events, like taking an order or billing a customer. Business processes
typically capture or generate unique performance metrics associated with each
event. These metrics translate into facts, with each business process
represented by a single atomic fact table. In addition to single process fact
tables, consolidated fact tables are sometimes created that combine metrics
from multiple processes into one fact table at a common level of detail. Again,
consolidated fact tables are a complement to the detailed single-process fact
tables, not a substitute for them.
Rule #3: Ensure that
every fact table has an associated date dimension table.
The measurement events
described in Rule #2 always have a date stamp of some variety associated with
them, whether it's a monthly balance snapshot or a monetary transfer captured
to the hundredth of a second. Every fact table should have at least one foreign
key to an associated date dimension table, whose grain is a single day, with
calendar attributes and nonstandard characteristics about the measurement event
date, such as the fiscal month and corporate holiday indicator. Sometimes
multiple date foreign keys are represented in a fact table.
Rule #4: Ensure that all
facts in a single fact table are at the same grain or level of detail.
There are three
fundamental grains to categorize all fact tables: transactional, periodic
snapshot, or accumulating snapshot. Regardless of its grain type, every
measurement within a fact table must be at the exact same level of detail. When
you mix facts representing multiple levels of granularity in the same fact
table, you are setting yourself up for business user confusion and making the
BI applications vulnerable to overstated or otherwise erroneous results.
Rule #5: Resove
many-to-many relationships in fact tables.
Since a fact table
stores the results of a business process event, there's inherently a
many-to-many (M:M) relationship between its foreign keys, such as multiple
products being sold in multiple stores on multiple days. These foreign key fields
should never be null. Sometimes dimensions can take on multiple values for a
single measurement event, such as the multiple diagnoses associated with a
health care encounter or multiple customers with a bank account. In these
cases, it's unreasonable to resolve the many-valued dimensions directly in the
fact table, as this would violate the natural grain of the measurement event.
Thus, we use a many-to-many, dual-keyed bridge table in conjunction with the
fact table.
Rule #6: Resolve
many-to-one relationships in dimension tables.
Hierarchical,
fixed-depth many-to-one (M:1) relationships between attributes are typically
denormalized or collapsed into a flattened dimension table. If you've spent
most of your career designing entity-relationship models for transaction
processing systems, you'll need to resist your instinctive tendency to
normalize or snowflake a M:1 relationship into smaller subdimensions; dimension
denormalization is the name of the game in dimensional modeling.
It is relatively common
to have multiple M:1 relationships represented in a single dimension table.
One-to-one relationships, like a unique product description associated with a
product code, are also handled in a dimension table. Occasionally many-to-one
relationships are resolved in the fact table, such as the case when the
detailed dimension table has millions of rows and its roll-up attributes are
frequently changing. However, using the fact table to resolve M:1 relationships
should be done sparingly
Rule #7: Store report
labels and filter domain values in dimension tables.
The codes and, more
importantly, associated decodes and descriptors used for labeling and query
filtering should be captured in dimension tables. Avoid storing cryptic code
fields or bulky descriptive fields in the fact table itself; likewise, don't
just store the code in the dimension table and assume that users don't need
descriptive decodes or that they'll be handled in the BI application. If it's a
row/column label or pull-down menu filter, then it should be handled as a
dimension attribute.
Though we stated in Rule
#5 that fact table foreign keys should never be null, it's also advisable to
avoid nulls in the dimension tables' attribute fields by replacing the null
value with "NA" (not applicable) or another default value, determined
by the data steward, to reduce user confusion if possible.
Rule #8: Make certain
that dimension tables use a surrogate key.
Meaningless,
sequentially assigned surrogate keys (except for the date dimension, where
chronologically assigned and even more meaningful keys are acceptable) deliver
a number of operational benefits, including smaller keys which mean smaller
fact tables, smaller indexes, and improved performance. Surrogate keys are
absolutely required if you're tracking dimension attribute changes with a new
dimension record for each profile change. Even if your business users don't
initially visualize the value of tracking attribute changes, using surrogates
will make a downstream policy change less onerous. The surrogates also allow
you to map multiple operational keys to a common profile, plus buffer you from
unexpected operational activities, like the recycling of an obsolete product
number or acquisition of another company with its own coding schemes.
Rule #9: Create conformed
dimensions to integrate data across the enterprise.
Conformed dimensions
(otherwise known as common, master, standard or reference dimensions) are
essential for enterprise data warehousing. Managed once in the ETL system and
then reused across multiple fact tables, conformed dimensions deliver
consistent descriptive attributes across dimensional models and support the
ability to drill across and integrate data from multiple business processes.
The Enterprise Data Warehouse Bus Matrix is the key architecture blueprint for
representing the organization's core business processes and associated
dimensionality. Reusing conformed dimensions ultimately shortens the
time-to-market by eliminating redundant design and development efforts;
however, conformed dimensions require a commitment and investment in data
stewardship and governance, even if you don't need everyone to agree on every
dimension attribute to leverage conformity.
Rule #10: Continuously
balance requirements and realities to deliver a DW/BI solution that's accepted
by business users and that supports their decision-making.
Dimensional modelers
must constantly straddle business user requirements along with the underlying
realities of the associated source data to deliver a design that can be implemented
and that, more importantly, stands a reasonable chance of business adoption.
The requirements-versus-realities balancing act is a fact of life for DW/BI
practitioners, whether you're focused on the dimensional model, project
strategy, technical/ETL/BI architectures or deployment/maintenance plan.
Specifically, types of
facts: Accumulating, Factless, Transactions, Snapshots, Additive, SemiAdditive
and Non Additive
Accumulating Fact Table
An accumulating fact table is where all of the
dimensional attributes are not available at the time of creation and the
dimensions that are linked to a fact table change over time. The most common
implementation of this is in the recording of dates against facts.
Take a "Sales" fact, typical dates you
may be intersted in when tracking an individual sale is maybe, order_date,
ship_date, delivery_date and payment_date. These would not all be available
when the fact is first created. Over time the fact record would accumulatemore relationships
with the dimensions as the relevant date milestones were passed for the sale.
Factless Facts
A factless fact is where the fact does not store
an actual numerical measure, the mere existance of a fact record indicates that
an event has happened that you wish to track.
The classic example of this would be an
"Attendance" fact. If you had dimensions to record date,
scheduled_course, instructor and delegate then you could create a fact table
that held the permutations of these dimensions. From this you could evaluate
the number of courses you run, the number of delegates, the number of courses
by instructor etc.
I would never simply leave a factess fact as a
bare collection of foreign key columns I would always add a dummy measure
column which would be set to 1 which you would then sum.
Transaction Grain
This is the most common type of fact. You would
declare the grain of the fact, ie the level of detail and then this is what
would be stored. For example you may have a sales_order fact, every time a new
sales order a new row would be created in the fact table. alternatively you may
have a "monthly_sales" fact. At the end of every month you would
aggregate up all the sales that happened in that month and record the single
total value.
Snapshot Facts
The snapshot fact contains
a reflection of the state of an entity at a given point in time. A classic
example of this would be a "daily_balance" fact in a banking system.
This would, on a daily basis record the balance of each account, it would NOT
list the individual transactions that happened on the account.
Additive Facts
A fully additive fact is one where the measures can be aggregated.
For example our Sales fact above would be fully additive as you can aggregate the sales amout over time, by product, by region or by salesman and still get the correct answer.
A fully additive fact is one where the measures can be aggregated.
For example our Sales fact above would be fully additive as you can aggregate the sales amout over time, by product, by region or by salesman and still get the correct answer.
Semi Additive Facts
A semi additive fact is one where the measure
can either have only a subset of aggregations applied to it, it you can count
the measures but not sum them, or the measures are only additive over a subset
of the dimensions.
Using our "daily_balances" fact above
would be a good example of a semi additive fact. The daily balances can be
aggregated by customer if the customer has multiple accounts to give the
customrs daily balance, however the balances could not be aggregated over time
as adding last weeks balance onto this weeks balance would result in a
nonsensical figure.
Non Additive Facts
A non additive fact is
one where the measure is non aggregable over any dimensions. These are commonly
where percentages have been calculated and stored in the fact. Another example
could be a profit margin on a sale, there is this figure other than at an
individual sale level.
No comments:
Post a Comment