Introduction to Data Lineage
Data Lineage enables you to document the movement of data from point A to point B, and any intermediate steps in between. This movement is sometimes referred to as Extraction, Transformation and Load (ETL). Points A and B can be anything from flat files, high-end databases like Oracles and DB2, XML files, Access databases, and Excel worksheets. This is sometimes referred to as source and target mapping. A model produced in ER/Studio can represent any point along the way. Data Architects need the ability to specify the "source" or "target" of data, down to the column/attribute level. Along with the metadata that defines the source and target mapping are rules for how the data is manipulated along the way.
The task of moving the data is often assigned to specialized developers who have little data modeling experience. They need good documentation to guide them in deciding what data goes where. Documenting the movement or mappings for the developers is usually the task of a data architect. Data Architects need to track data between very complex systems. For example, an organization may have a data warehouse that consists of data sourced from a number of different systems, such as CRM (Customer Relationship Management), payroll, general ledger, accounting, product, and inventory systems. The data warehouse may in turn have on-line data marts that receive data from it to produce reports for Sales Executives, HR Executives and marketing teams that are generating reports about the performance of various aspects of the business.
The following illustrates a generalized data movement process:
In this illustration, the data is sourced from various systems on the left and fed to a data warehouse that stores the data in a format that is more conducive to reporting. This reduces the amount of overhead on the source systems so resources are not used for reporting directly on them. The data also must be cleansed to ensure the quality of the data used for reports. Reports can be generated directly from the Enterprise Data Warehouse (EDW), but many times are fed to other specialized data marts targeted for specific audiences. Further cleansing or manipulation of the data is performed in the process of moving it to the Online Data Manager (ODM). An organization using ERStudio would most likely have a model of each part of this process, the Customer Relation Management (CRM), General Ledger/Accounting, Orders, Inventory, EDW, and ODMs. It is the job of the data architect to "map" these systems together. They manage the models and need to define the mappings which produce reports that assist the data warehouse engineers. An ETL developer feeding data into one of the data marts needs to know what columns and tables to get the data from in the EDW. For consistency sake, they may also need the mappings from the legacy or Online Transaction Processing (OLTP) systems.
ER/Studio Data Architect provides several ways for you to document how your data moves from system to system.
In Reports and generated SQL you can view the ETL information you have specified as follows:
- In the Table editor, you can document how often the data is extracted along with the rules that regulate data movement
- In the Table Column editor, you can document how the data is transformed as it moves from source to target, known as source-to-target mapping.
Using the Data Lineage tab you can create a visualization of the data movement and transformation, so you can see the relationships between the source and target, how the data flows from one table to another, and how the data is transformed.