Data Warehouse Modeling Theory for a Business Domain
Data Warehouse Layering Theory
- ODS (Operational Data Store): An intermediate data storage layer that holds real-time operational data from multiple source systems. ODS typically contains raw data and supports real-time queries and reporting. It serves as the precursor to the data warehouse, providing a centralized store for further transformation, cleansing, and loading into the warehouse.
- DWD (Data Warehouse Detail): A layer in the data warehouse architecture that stores cleansed, transformed, and integrated detailed data. Data at this layer is typically organized as fact tables and dimension tables to support multidimensional analysis. DWD data has high quality and consistency and can be used for further data mining and reporting.
- DWM (Data Warehouse Mart): Another layer in the data warehouse architecture that stores subsets of data customized for specific business domains or departments. DWM data is typically extracted, aggregated, and transformed from the DWD layer. It provides a higher level of data aggregation to support analysis and reporting for specific business needs.
- DM (Data Mart): A small data warehouse targeting a specific business domain or department. Data marts typically cover a limited set of subjects to more quickly and efficiently support data analysis and reporting for specific business needs. A data mart can be standalone or part of a larger data warehouse architecture.
Business Data Warehouse Case Study
Data warehouse modeling: business → domain → logical → physical.
The practical design process consists of seven steps: map business processes, define subject domains, organize metrics, investigate entity relationships, define dimensions, design layering, and build the data warehouse.
- Map Business Processes
Identify: who, at which stage, performs what key action, and produces what result.
Locate data nodes within the existing business flow.
- Define Subject Domains
A subject domain is an atomic, indivisible behavioral event in the business process.
For example, the subscription subject domain may include pre-approval, subscription, and confirmation of subscription results.
Subject domain definitions should cover all business requirements as comprehensively as possible.
- Organize Metrics
Objective: establish a unified language and unified calibration.
- Analyze Entity Relationships
Modeling approach: describe the business from an end-to-end perspective using an entity-relationship model.
Use the existing entity-relationship diagram to determine which DWD tables need to be built.
- Define Dimensions
Objective: build consistent shared tables and avoid dimension concepts scattered throughout the business data warehouse.
Examples: account type, entry type, event type, etc.
Benefits: facilitates unified construction of dimension tables and simplifies subsequent dimension modeling.
- Layering Design
The core principle is progressive decoupling across layers.
The deeper the layer, the closer it is to raw business records.
The higher the layer, the closer it is to business objectives.
ODS stores raw data with an unlimited default retention period.
DWD is built based on the entity-relationship diagram and faithfully reflects the business. It must be stable. This layer is the critical gateway into the data warehouse and requires strict data quality control.
DWM is designed according to dimensional modeling theory (e.g., star schema, snowflake schema) to effectively improve data usability.
DM is designed as a wide-table model — trading space for time to improve the efficiency of product metric retrieval.
- Physical Modeling
The actual coding phase. Key points:
Design first: model the table structure in a Tencent document first, prioritize review, and avoid rework.
Based on the reviewed table models, configure computation tasks and task dependencies on the xdata platform.
Task names and table names should ideally be identical.
Lifecycle management: ODS should retain all historical data as much as possible; for DWD/DWM/DM, set lifecycle policies — financial entity data is typically retained for 5 years, user behavior data for 3 years.
How to improve data quality:
Key principle: detect early, recover early.
Add validation tasks and build end-to-end data monitoring views.
DWD Table Cleansing
Purpose: Efficiently and reliably support business data needs. Downstream consumers must be provided with tables at the DWD layer or above — not raw ODS source data.
Cleansing requirements:
-
Deduplicate ODS data (duplicate data complicates upstream processing and causes data skew). Configure data uniqueness quality alerts on DWD.
-
Base64 decoding (Base64 is a method for encoding binary data as ASCII text. It uses 64 characters (A–Z, a–z, 0–9, +, /) plus an optional padding character (=). Binary data is divided into groups of 3 bytes (24 bits), which are split into four 6-bit chunks, each corresponding to one Base64 character. If the binary data length is not a multiple of 3, padding characters are added.)
-
Encrypted sensitive information may be decrypted during computation but must not be stored in decrypted form.
-
Normalize timestamp formats, e.g.,
"yyyy-mm-dd hh:mm:ss", with a default value of"1970-01-01 08:00:00". -
All structured/attribute fields must have their schema expanded. Any NULL values must be discussed with the product team before handling.
-
Summary: A DWD table is a business analysis table that abstracts away technical implementation details (e.g., auto-increment IDs, checksums, raw timestamps, and other purely technical fields must not enter the DWD layer).
Requirements:
- Configure alerts (quality alerts, failure/delay alerts).
- Complete data dictionaries (table description, domain, owner, design documentation, field descriptions, and other metadata).
- Include the table in the main data warehouse pipeline.