Normalizing the Database

From InterBase
Jump to: navigation, search

Go Up to Establishing Relationships between Objects

After your tables, columns, and keys are defined, look at the design as a whole and analyze it using normalization guidelines in order to find logical errors. As mentioned in the overview, normalization involves breaking down larger tables into smaller ones in order to group data together that is naturally related.

Note

A detailed explanation of the normal forms are out of the scope of this document. There are many excellent books on the subject on the market.

When a database is designed using proper normalization methods, data related to other data does not need to be stored in more than one place—if the relationship is properly specified. The advantages of storing the data in one place are:

  • The data is easier to update or delete.
  • When each data item is stored in one location and accessed by reference, the possibility for error due to the existence of duplicates is reduced.
  • Because the data is stored only once, the possibility for introducing inconsistent data is reduced.

In general, the normalization process includes::

  • Eliminating repeating groups.
  • Removing partially-dependent columns.
  • Removing transitively-dependent columns.

An explanation of each step follows.


Eliminating Repeating Groups (Normalizing the Database)

When a field in a given row contains more than one value for each occurrence of the primary key, then that group of data items is called a repeating group. This is a violation of the first normal form, which does not allow multi-valued attributes.

Refer to the DEPARTMENT table. For any occurrence of a given primary key, if a column can have more than one value, then this set of values is a repeating group. Therefore, the first row, where DEPT_NO = “100”, contains a repeating group in the DEPT_LOCATIONS column.

DEPARTMENT table
DEPT_NO DEPARTMENT HEAD_DEPT BUDGET DEPT_LOCATIONS

100

Sales

000

1000000

Monterey, Santa Cruz, Salinas

600

Engineering

120

1100000

San Francisco

900

Finance

000

400000

Monterey

In the next example, even if you change the attribute to represent only one location, for every occurrence of the primary key “100”, all of the columns contain repeating information except for DEPT_LOCATION, so this is still a repeating group.

DEPARTMENT table - Repeating Group
DEPT_NO DEPARTMENT HEAD_DEPT BUDGET DEPT_LOCATION

100

Sales

000

1000000

Monterey

100

Sales

000

1000000

Santa Cruz

600

Engineering

120

1100000

San Francisco

100

Sales

000

1000000

Salinas

To normalize this table, we could eliminate the DEPT_LOCATION attribute from the DEPARTMENT table, and create another table called ­DEPT_LOCATIONS. We could then create a primary key that is a combination of DEPT_NO and DEPT_LOCATION. Now a distinct row exists for each location of the department, and we have eliminated the repeating groups.

DEPT_LOCATIONS table
DEPT_NO DEPT_LOCATION

100

Monterey

100

Santa Cruz

600

San Francisco

100

Salinas


Removing Partially-dependent Columns (Normalizing the Database)

Another important step in the normalization process is to remove any non-key columns that are dependent on only part of a composite key. Such columns are said to have a partial key dependency. Non-key columns provide information about the subject, but do not uniquely define it.

For example, suppose you wanted to locate an employee by project, and you created the PROJECT table with a composite primary key of EMP_NO and ­PROJ_ID.

PROJECT table
EMP_NO PROJ_ID LAST_NAME PROJ_NAME PROJ_DESC PRODUCT

44

DGPII

Smith

Automap

blob data

hardware

47

VBASE

Jenner

Video database

blob data

software

24

HWRII

Stevens

Translator upgrade

blob data

software

The problem with this table is that PROJ_NAME, PROJ_DESC, and PRODUCT are attributes of PROJ_ID, but not EMP_NO, and are therefore only partially dependent on the EMP_NO/PROJ_ID primary key. This is also true for ­LAST_NAME because it is an attribute of EMP_NO, but does not relate to ­PROJ_ID. To normalize this table, we would remove the EMP_NO and ­LAST_NAME columns from the PROJECT table, and create another table called EMPLOYEE_PROJECT that has EMP_NO and PROJ_ID as a composite primary key. Now a unique row exists for every project that an employee is assigned to.


Removing Transitively-dependent Columns (Normalizing the Database)

The third step in the normalization process is to remove any non-key columns that depend upon other non-key columns. Each non-key column must be a fact about the primary key column. For example, suppose we added ­TEAM_LEADER_ID and PHONE_EXT to the PROJECT table, and made ­PROJ_ID the primary key. PHONE_EXT is a fact about TEAM_LEADER_ID, a non-key column, not about PROJ_ID, the primary key column.

PROJECT table
PROJ_ID TEAM_LEADER_ID PHONE_EXT PROJ_NAME PROJ_DESC PRODUCT

DGPII

44

4929

Automap

blob data

hardware

VBASE

47

4967

Video database

blob data

software

HWRII

24

4668

Translator upgrade

blob data

software

To normalize this table, we would remove PHONE_EXT, change TEAM_LEADER_ID to TEAM_LEADER, and make TEAM_LEADER a foreign key referencing EMP_NO in the EMPLOYEE table.

PROJECT table
PROJ_ID TEAM_LEADER PROJ_NAME PROJ_DESC PRODUCT

DGPII

44

Automap

blob data

hardware

VBASE

47

Video database

blob data

software

HWRII

24

Translator upgrade

blob data

software

EMPLOYEE table
EMP_NO LAST_NAME FIRST_NAME DEPT_NO JOB_CODE PHONE_EXT SALARY

24

Smith

John

100

Eng

4968

64000

48

Carter

Catherine

900

Sales

4967

72500

36

Smith

Jane

600

Admin

4800

37500


When to Break the Rules

You should try to correct any normalization violations, or else make a conscious decision to ignore them in the interest of ease of use or performance. Just be sure that you understand the design trade-offs that you are making, and document your reasons. It might take several iterations to reach a design that is a desirable compromise between purity and reality, but this is the heart of the design process.

For example, suppose you always want data about dependents every time you look up an employee, so you decide to include DEP1_NAME, ­DEP1_BIRTHDATE, and so on for DEP1 through DEP30, in the EMPLOYEE table. Generally speaking, that is terrible design, but the requirements of your application are more important than the abstract purity of your design. In this case, if you wanted to compute the average age of a given employee’s dependents, you would have to explicitly add field values together, rather than asking for a simple average. If you wanted to find all employees with a dependent named “Jennifer,” you would have to test 30 fields for each employee instead of one. If those are not operations that you intend to perform, then go ahead and break the rules. If the efficiency attracts you less than the simplicity, you might consider defining a view that combines records from employees with records from a separate DEPENDENTS table.

While you are normalizing your data, remember that InterBase offers direct support for array columns, so if your data includes, for example, hourly temperatures for twenty cities for a year, you could define a table with a character column that contains the city name, and a 24 by 366 matrix to hold all of the temperature data for one city for one year. This would result in a table containing 20 rows (one for each city) and two columns, one NAME column and one ­TEMP_ARRAY column. A normalized version of that record might have 366 rows per city, each of which would hold a city name, a Julian date, and 24 columns to hold the hourly temperatures.

Advance To: