SQL Normalization: Unraveling the Fundamentals

Welcome to this comprehensive exploration of SQL Normalization. In this blog, we'll delve into the concept of normalization, its different forms, benefits, and a few examples to put these concepts into perspective.

Introduction to SQL Normalization

link to this section

Normalization in SQL is a database design technique that organizes tables in a manner that reduces redundancy and dependency of data. It divides larger tables into smaller tables and links them using relationships. The inventor of the relational model, Edgar F. Codd, introduced normalization and its forms are based on the understanding of functional dependency.

Objective of Normalization

The main objective of normalization is to eliminate redundant (repeated) data, which in turn prevents data manipulations anomalies and saves storage. Normalization rules are divided into several forms, each with an increasing level of complexity.

The Normal Forms

link to this section

The normalization process involves getting our data to conform to a series of 'normal forms' to ensure data integrity and reduce data redundancy. These normal forms, from first (1NF) through fifth (5NF), provide increasingly stringent rules for how the database schema should be organized.

First Normal Form (1NF)

link to this section

1NF is achieved by ensuring that each column in a table contains atomic, indivisible values, and each cell contains only one value from its attribute (column) domain. It also necessitates that all entries in a column (attribute) are of the same kind.

For example, consider a table:

StudentID Subject
1 Math, Physics
2 English, History

This table is not in 1NF because the Subject column has multiple values. A 1NF version of this table would be:

StudentID Subject
1 Math
1 Physics
2 English
2 History

Second Normal Form (2NF)

link to this section

A table is in 2NF if it is in 1NF and if all non-key attributes are fully functional dependent on the primary key. This means that all non-key attributes must depend on the entire set of primary key attributes.

Consider a table:

StudentID CourseID StudentName CourseName
1 101 John Math
2 102 Sarah English
1 103 John Physics

In this table, the primary key is (StudentID, CourseID). But the problem is that 'StudentName' is only dependent on 'StudentID', and 'CourseName' is only dependent on 'CourseID'. They are not dependent on the full set of primary keys. This is a partial dependency, which is not allowed in 2NF. So, we break this table up to remove the partial dependency:

Student table:

StudentID StudentName
1 John
2 Sarah

Course table:

CourseID CourseName
101 Math
102 English
103 Physics

StudentCourse table:

StudentID CourseID
1 101
2 102
1 103

Third Normal Form (3NF)

link to this section

A table is in 3NF if it is in 2NF and if there are no transitive functional dependencies. A transitive functional dependency is when a non-key attribute depends on another non-key attribute.

Consider a table:

StudentID StudentName CourseID CourseName CourseInstructor
1 John 101 Math Mr. Smith
2 Sarah 102 English Ms. Johnson
1 John 103 Physics Dr. Brown

In this table, the primary key is (StudentID, CourseID). But 'CourseInstructor' depends on 'CourseName', which depends on 'CourseID'. This is a transitive dependency, which is not allowed in 3NF. So, we break this table up to remove the transitive dependency:

Student table:

StudentID StudentName
1 John
2 Sarah

Course table:

CourseID CourseName CourseInstructor
101 Math Mr. Smith
102 English Ms. Johnson
103 Physics Dr. Brown

StudentCourse table:

StudentID CourseID
1 101
2 102
1 103

Boyce-Codd Normal Form (BCNF)

link to this section

A table is in BCNF if it is in 3NF and for each of its dependencies X -> Y, X is a superkey. A superkey is a set of one or more columns that can uniquely identify a row in a table.

Let's consider the following table:

CourseID CourseName CourseInstructor
101 Math Mr. Smith
102 English Ms. Johnson
103 Physics Dr. Brown
104 Math Dr. Stone

The primary key here is (CourseID, CourseName). However, 'CourseInstructor' is dependent only on 'CourseName', which is part of the primary key. Therefore, we must separate the tables like this:

Course table:

CourseID CourseName
101 Math
102 English
103 Physics
104 Math

Instructor table:

CourseName CourseInstructor
Math Mr. Smith
English Ms. Johnson
Physics Dr. Brown
Math Dr. Stone

Fourth Normal Form (4NF)

link to this section

A table is in 4NF if it is in BCNF and there are no multi-valued dependencies. A multi-valued dependency occurs when one attribute in a table depends on another, yet they are both independent of the primary keys.

Suppose we have a table:

StudentID Skill Hobby
1 Programming Fishing
1 Design Fishing
2 Design Painting
2 Design Stamp collecting

Here, a student can have multiple skills and multiple hobbies, but the skills and hobbies are independent of each other. This is a multi-valued dependency. To make it 4NF, we'd split it into two tables:

Skills table:

StudentID Skill
1 Programming
1 Design
2 Design

Hobbies table:

StudentID Hobby
1 Fishing
2 Painting
2 Stamp collecting

Fifth Normal Form (5NF) or Project-Join Normal Form (PJNF)

link to this section

A table is in 5NF or PJNF if every join dependency in the table is a consequence of the candidate keys. This is about situations where information can be re-constructed from smaller pieces of information that can be maintained with less redundancy.

Let's consider an example. Suppose we have a database of musicians, instruments they play, and bands they play for:

Musician Instrument Band
John Guitar Band1
John Bass Band2
Paul Bass Band1
Paul Drums Band2

If we decompose this into two tables:

Table1

Musician Instrument
John Guitar
John Bass
Paul Bass
Paul Drums

Table2

Musician Band
John Band1
John Band2
Paul Band1
Paul Band2

We lost the information that John plays Guitar for Band1, Bass for Band2, etc. Hence, this decomposition leads to loss of information and it's not a correct decomposition. In other words, the original table is in 5NF.

Benefits of Normalization

link to this section
  1. Avoids Data Redundancy : Normalization splits up the data into additional tables to avoid duplication.

  2. Data Consistency : By having a single update point, normalization brings in data consistency.

  3. Database Optimization : Smaller tables with reduced data tend to increase the search performance.

  4. Saves Storage Space : Normalization helps to eliminate redundant data, saving storage space.

An Example

Consider a table ‘Customer’ with fields: CustomerID, CustomerName, CustomerAddress and OrderID.

This table is not in 1NF as the OrderID field is multi-valued. To convert it into 1NF, we can divide it into two tables ‘Customer’ and ‘Order’ where CustomerID would be a common field.

The 'Customer' table now has: CustomerID, CustomerName, CustomerAddress. The 'Order' table has: CustomerID, OrderID.

Now, both tables are in 1NF. Further normalization can be applied as necessary.

Conclusion

link to this section

Normalization is a key aspect of relational database design. Proper normalization ensures that your database is efficient, flexible, and it minimizes redundancy without compromising the integrity of your data. Although the process may seem complex, the benefits are worth the effort. It's crucial to remember that normalization, like many aspects of database design, must be carefully tailored to the requirements of the system – there's no one-size-fits-all solution. Keep exploring the world of SQL with us for more insights and guides! Happy learning!