Normalizing the Database
Go Up to Establishing Relationships between Objects
Contents
After your tables, columns, and keys are defined, look at the design as a whole and analyze it using normalization guidelines in order to find logical errors. As mentioned in the overview, normalization involves breaking down larger tables into smaller ones in order to group data together that is naturally related.
A detailed explanation of the normal forms are out of the scope of this document. There are many excellent books on the subject on the market.
When a database is designed using proper normalization methods, data related to other data does not need to be stored in more than one place—if the relationship is properly specified. The advantages of storing the data in one place are:
- The data is easier to update or delete.
- When each data item is stored in one location and accessed by reference, the possibility for error due to the existence of duplicates is reduced.
- Because the data is stored only once, the possibility for introducing inconsistent data is reduced.
In general, the normalization process includes::
- Eliminating repeating groups.
- Removing partially-dependent columns.
- Removing transitively-dependent columns.
An explanation of each step follows.
Eliminating Repeating Groups (Normalizing the Database)
When a field in a given row contains more than one value for each occurrence of the primary key, then that group of data items is called a repeating group. This is a violation of the first normal form, which does not allow multi-valued attributes.
Refer to the DEPARTMENT
table. For any occurrence of a given primary key, if a column can have more than one value, then this set of values is a repeating group. Therefore, the first row, where DEPT_NO
= “100”, contains a repeating group in the DEPT_LOCATIONS
column.
DEPARTMENT table
DEPT_NO | DEPARTMENT | HEAD_DEPT | BUDGET | DEPT_LOCATIONS |
---|---|---|---|---|
100 |
Sales |
000 |
1000000 |
Monterey, Santa Cruz, Salinas |
600 |
Engineering |
120 |
1100000 |
San Francisco |
900 |
Finance |
000 |
400000 |
Monterey |
In the next example, even if you change the attribute to represent only one location, for every occurrence of the primary key “100”, all of the columns contain repeating information except for DEPT_LOCATION
, so this is still a repeating group.
DEPARTMENT table - Repeating Group
DEPT_NO | DEPARTMENT | HEAD_DEPT | BUDGET | DEPT_LOCATION |
---|---|---|---|---|
100 |
Sales |
000 |
1000000 |
Monterey |
100 |
Sales |
000 |
1000000 |
Santa Cruz |
600 |
Engineering |
120 |
1100000 |
San Francisco |
100 |
Sales |
000 |
1000000 |
Salinas |
To normalize this table, we could eliminate the DEPT_LOCATION
attribute from the DEPARTMENT
table, and create another table called DEPT_LOCATIONS
. We could then create a primary key that is a combination of DEPT_NO
and DEPT_LOCATION
. Now a distinct row exists for each location of the department, and we have eliminated the repeating groups.
DEPT_LOCATIONS table
DEPT_NO | DEPT_LOCATION |
---|---|
100 |
Monterey |
100 |
Santa Cruz |
600 |
San Francisco |
100 |
Salinas |
Removing Partially-dependent Columns (Normalizing the Database)
Another important step in the normalization process is to remove any non-key columns that are dependent on only part of a composite key. Such columns are said to have a partial key dependency. Non-key columns provide information about the subject, but do not uniquely define it.
For example, suppose you wanted to locate an employee by project, and you created the PROJECT
table with a composite primary key of EMP_NO
and PROJ_ID
.
PROJECT table
EMP_NO | PROJ_ID | LAST_NAME | PROJ_NAME | PROJ_DESC | PRODUCT |
---|---|---|---|---|---|
44 |
DGPII |
Smith |
Automap |
blob data |
hardware |
47 |
VBASE |
Jenner |
Video database |
blob data |
software |
24 |
HWRII |
Stevens |
Translator upgrade |
blob data |
software |
The problem with this table is that PROJ_NAME
, PROJ_DESC
, and PRODUCT
are attributes of PROJ_ID
, but not EMP_NO
, and are therefore only partially dependent on the EMP_NO
/PROJ_ID
primary key. This is also true for LAST_NAME
because it is an attribute of EMP_NO
, but does not relate to PROJ_ID
. To normalize this table, we would remove the EMP_NO
and LAST_NAME
columns from the PROJECT
table, and create another table called EMPLOYEE_PROJECT
that has EMP_NO
and PROJ_ID
as a composite primary key. Now a unique row exists for every project that an employee is assigned to.
Removing Transitively-dependent Columns (Normalizing the Database)
The third step in the normalization process is to remove any non-key columns that depend upon other non-key columns. Each non-key column must be a fact about the primary key column. For example, suppose we added TEAM_LEADER_ID
and PHONE_EXT
to the PROJECT
table, and made PROJ_ID
the primary key. PHONE_EXT
is a fact about TEAM_LEADER_ID
, a non-key column, not about PROJ_ID
, the primary key column.
PROJECT table
PROJ_ID | TEAM_LEADER_ID | PHONE_EXT | PROJ_NAME | PROJ_DESC | PRODUCT |
---|---|---|---|---|---|
DGPII |
44 |
4929 |
Automap |
blob data |
hardware |
VBASE |
47 |
4967 |
Video database |
blob data |
software |
HWRII |
24 |
4668 |
Translator upgrade |
blob data |
software |
To normalize this table, we would remove PHONE_EXT
, change TEAM_LEADER_ID
to TEAM_LEADER
, and make TEAM_LEADER
a foreign key referencing EMP_NO
in the EMPLOYEE
table.
PROJECT table
PROJ_ID | TEAM_LEADER | PROJ_NAME | PROJ_DESC | PRODUCT |
---|---|---|---|---|
DGPII |
44 |
Automap |
blob data |
hardware |
VBASE |
47 |
Video database |
blob data |
software |
HWRII |
24 |
Translator upgrade |
blob data |
software |
EMPLOYEE table
EMP_NO | LAST_NAME | FIRST_NAME | DEPT_NO | JOB_CODE | PHONE_EXT | SALARY |
---|---|---|---|---|---|---|
24 |
Smith |
John |
100 |
Eng |
4968 |
64000 |
48 |
Carter |
Catherine |
900 |
Sales |
4967 |
72500 |
36 |
Smith |
Jane |
600 |
Admin |
4800 |
37500 |
When to Break the Rules
You should try to correct any normalization violations, or else make a conscious decision to ignore them in the interest of ease of use or performance. Just be sure that you understand the design trade-offs that you are making, and document your reasons. It might take several iterations to reach a design that is a desirable compromise between purity and reality, but this is the heart of the design process.
For example, suppose you always want data about dependents every time you look up an employee, so you decide to include DEP1_NAME
, DEP1_BIRTHDATE
, and so on for DEP1
through DEP30
, in the EMPLOYEE
table. Generally speaking, that is terrible design, but the requirements of your application are more important than the abstract purity of your design. In this case, if you wanted to compute the average age of a given employee’s dependents, you would have to explicitly add field values together, rather than asking for a simple average. If you wanted to find all employees with a dependent named “Jennifer,” you would have to test 30 fields for each employee instead of one. If those are not operations that you intend to perform, then go ahead and break the rules. If the efficiency attracts you less than the simplicity, you might consider defining a view that combines records from employees with records from a separate DEPENDENTS
table.
While you are normalizing your data, remember that InterBase offers direct support for array columns, so if your data includes, for example, hourly temperatures for twenty cities for a year, you could define a table with a character column that contains the city name, and a 24 by 366 matrix to hold all of the temperature data for one city for one year. This would result in a table containing 20 rows (one for each city) and two columns, one NAME
column and one TEMP_ARRAY
column. A normalized version of that record might have 366 rows per city, each of which would hold a city name, a Julian date, and 24 columns to hold the hourly temperatures.