Data Normalization

Ofili Lewis
10 min readJan 26, 2023
Photo by Colton Sturgeon on Unsplash

Data normalization is the process of organizing data in a consistent and uniform format to ensure that it is accurate, reliable, and easy to understand. This process is essential for data analysis, modeling, and decision-making, as it helps to eliminate data inconsistencies and errors, and makes data more accessible and usable. In this article, we will take an in-depth look at the data normalization process, including the different types of data normalization, the benefits of data normalization, and how to implement data normalization in Python.

There are several types of data normalization, but the most commonly used are first normal form (1NF), second normal form (2NF), third normal form (3NF), and Boyce-Codd normal form (BCNF). The first normal form (1NF) is the most basic form of normalization and it ensures that data is stored in a single table and is atomic (i.e., the data is indivisible and cannot be broken down into smaller parts). The second normal form (2NF) is a more advanced form of normalization and it ensures that data is non-redundant and that each non-key column is dependent on the primary key. The third normal form (3NF) is an even more advanced form of normalization and it ensures that data is non-transitively dependent on the primary key. The Boyce-Codd normal form (BCNF) is the most advanced form of normalization and it ensures that data is in third normal form and that any non-trivial functional dependencies in the data are based on the primary key.

The benefits of data normalization include:

  • Reducing data redundancy and inconsistencies
  • Increasing data accuracy and reliability
  • Making data more accessible and usable
  • Improving data analysis and decision-making
  • Simplifying data management and maintenance

First Normal Form (1NF)

First Normal Form (1NF) is the simplest and most basic form of data normalization. The goal of 1NF is to eliminate duplicate data and organize data into a single table, with each row representing a unique piece of data and each column representing a different attribute of that data.

In order to achieve 1NF, a table must meet the following criteria:

  • Each row must have a unique identifier, known as a primary key
  • Each column must contain atomic values, meaning that the data in the column cannot be further divided
  • There can be no repeating groups of data within a column

For example, let’s consider a table of customer orders. The table may contain information such as the customer’s name, the product ordered, the quantity, and the price.

| Order ID | Customer Name | Product | Quantity | Price |
|----------|---------------|---------|----------|-------|
| 1 | John Smith | T-Shirt | 3 | $20 |
| 2 | Jane Doe | Hat | 1 | $15 |
| 3 | John Smith | Shoes | 2 | $50 |

This table would not be in 1NF because it does not have a unique identifier (primary key), and the “Customer Name” column contains repeating data. To bring it to 1NF, we could add an “Order ID” column as a primary key, and create a separate “Customers” table to store customer information.

Customers Table:
| Customer ID | Customer Name |
|-------------|---------------|
| 1 | John Smith |
| 2 | Jane Doe |

Orders Table:
| Order ID | Customer ID | Product | Quantity | Price |
|----------|-------------|---------|----------|-------|
| 1 | 1 | T-Shirt | 3 | $20 |
| 2 | 2 | Hat | 1 | $15 |
| 3 | 1 | Shoes | 2 | $50 |

Now the Orders table only contains atomic values and has a primary key, it is in first normal form.

It’s important to note that achieving 1NF is just the first step in data normalization, and it may be necessary to further normalize the data to meet the requirements of a specific use case. However, 1NF is an essential starting point for any data normalization process, and it provides a foundation for more advanced forms of data normalization such as Second Normal Form (2NF), Third Normal Form (3NF), and so on.

Second Normal Form (2NF)

Second Normal Form (2NF) builds upon the principles of First Normal Form (1NF) by further reducing data redundancy and ensuring that the data is fully functional. In order to achieve 2NF, a table must meet the following criteria:

  • The table must already be in First Normal Form (1NF)
  • The table must not have any partial dependencies, meaning that non-primary key columns are dependent on only a part of the primary key

In other words, 2NF is achieved when all non-primary key columns in a table are functionally dependent on the entire primary key. To understand this concept, let’s consider the following example:

| Order ID | Customer ID | Product | Quantity | Price |
|----------|-------------|---------|----------|-------|
| 1 | 1 | T-Shirt | 3 | $20 |
| 2 | 2 | Hat | 1 | $15 |
| 3 | 1 | Shoes | 2 | $50 |

This table is in 1NF as it has a unique primary key “Order ID” and contains atomic values. However, it is not in 2NF because the non-primary key column “Price” is only dependent on the “Product” column, and not the entire primary key. To achieve 2NF, we could create a separate “Products” table to store product information, including the price.

Customers Table:
| Customer ID | Customer Name |
|-------------|---------------|
| 1 | John Smith |
| 2 | Jane Doe |

Products Table:
| Product ID | Product | Price |
|------------|---------|-------|
| 1 | T-Shirt | $20 |
| 2 | Hat | $15 |
| 3 | Shoes | $50 |

Orders Table:
| Order ID | Customer ID | Product ID | Quantity |
|----------|-------------|------------|----------|
| 1 | 1 | 1 | 3 |
| 2 | 2 | 2 | 1 |
| 3 | 1 | 3 | 2 |

Now the “Orders” table is in 2NF because all non-primary key columns are functionally dependent on the entire primary key (Order ID).

It’s worth noting that achieving 2NF is just one step in the normalization process, and it may be necessary to further normalize the data to meet the specific requirements of a use case. However, by normalizing data to 2NF, we have greatly reduced data redundancy and ensured that the data is fully functional. This can greatly improve the performance and maintainability of a database, as well as minimize data inconsistencies and errors.

In SQL, we can use the following command to check if a table is in 2NF:

SELECT 
t1.column_name, t1.ordinal_position, t1.constraint_name
FROM
information_schema.key_column_usage t1
JOIN information_schema.table_constraints t2 ON
t1.table_name = t2.table_name AND
t1.constraint_name = t2.constraint_name
WHERE
t2.constraint_type = 'PRIMARY KEY' AND
t1.table_name = '<table_name>';

This command will return the column names, ordinal positions, and constraint names of the primary key in the specified table. From this, we can determine if there are any partial dependencies and if the table is in 2NF.

Third Normal Form (3NF)

Normalizing data to the third normal form (3NF) is the next step in the process of data normalization. The main objective of 3NF is to further minimize data redundancy and ensure that the data is fully and non-transitively dependent on the primary key.

A table is in 3NF if it is in 2NF and it has no transitive dependencies. A transitive dependency exists when a non-primary key column is dependent on another non-primary key column. For example, if a table has a column “Address” that is dependent on a column “City”, and a column “City” that is dependent on a column “Country”, the “Address” column is transitively dependent on the “Country” column, and the table is not in 3NF.

To create a table in 3rd Normal Form (3NF) in SQL, we can follow these steps:

  1. Identify the primary key of the table. This can be a single column or a combination of columns.
  2. Identify all non-prime attributes, i.e. columns that are not part of the primary key.
  3. For each non-prime attribute, check if it has any non-trivial functional dependencies on the primary key. A functional dependency is non-trivial if the dependent attribute is not a part of the primary key.
  4. If a non-prime attribute has a non-trivial functional dependency on the primary key, remove it from the table and create a new table with the dependent attribute as the primary key and a foreign key referencing the primary key of the original table.
  5. Repeat steps 3 and 4 for all non-prime attributes until the table is in 3NF.

For example, let’s say we have a table called “orders” with the following columns: “order_id”, “customer_id”, “product_id”, “quantity”, and “price”. The primary key is “order_id”.

CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
product_id INT,
quantity INT,
price DECIMAL(10,2)
);

The non-prime attributes are “customer_id”, “product_id”, “quantity”, and “price”. Let’s say “quantity” has a non-trivial functional dependency on “product_id” and “price” has a non-trivial functional dependency on “order_id”, “product_id” and “quantity”. We would need to remove “quantity” and “price” from the “orders” table and create new tables “order_items” and “order_prices” with functional dependencies as primary keys and foreign keys referencing “orders” table primary key “order_id”

CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT,
PRIMARY KEY(order_id, product_id),
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

CREATE TABLE order_prices (
order_id INT,
product_id INT,
price DECIMAL(10,2),
PRIMARY KEY(order_id, product_id),
FOREIGN KEY (order_id, product_id) REFERENCES order_items(order_id, product_id)
);

This will ensure that the table is in 3NF and eliminates any non-trivial functional dependencies.

In SQL, we can use the following command to check if a table is in 3NF:

SELECT 
t1.column_name, t1.ordinal_position, t1.constraint_name
FROM
information_schema.key_column_usage t1
JOIN information_schema.table_constraints t2 ON
t1.table_name = t2.table_name AND
t1.constraint_name = t2.constraint_name
WHERE
t2.constraint_type = 'PRIMARY KEY' AND
t1.table_name = '<table_name>';

This command will return the column names, ordinal positions, and constraint names of the primary key in the specified table. From this, we can determine if there are any non-primary key columns that are dependent on other non-primary key columns and if the table is in 3NF.

Boyce-Codd normal form (BCNF)

Boyce-Codd normal form (BCNF) is a higher level of normalization than 3NF. It is designed to handle certain types of anomalies that are not handled by 3NF. A table is in BCNF if it is in 3NF and it has no non-trivial functional dependencies on any non-prime attribute. A non-prime attribute is a non-primary key attribute.

A functional dependency is trivial if the dependent attribute is a part of the primary key. A functional dependency is non-trivial if the dependent attribute is not a part of the primary key. For example, in a table with columns “Employee ID”, “Employee Name”, and “Department”, the functional dependency “Employee ID -> Employee Name” is trivial, because “Employee ID” is a part of the primary key. However, the functional dependency “Department -> Employee Name” is non-trivial, because “Department” is not a part of the primary key.

To create a table in Boyce-Codd Normal Form (BCNF) in SQL, we can follow these steps:

  1. Identify the primary key of the table. This can be a single column or a combination of columns.
  2. Identify all non-prime attributes, i.e. columns that are not part of the primary key.
  3. For each non-prime attribute, check if it has any non-trivial functional dependencies on the primary key. A functional dependency is non-trivial if the dependent attribute is not a part of the primary key.
  4. If a non-prime attribute has a non-trivial functional dependency on a subset of the primary key, remove it from the table and create a new table with the dependent attribute(s) as the primary key and a foreign key referencing the primary key of the original table.
  5. Repeat steps 3 and 4 for all non-prime attributes until the table is in BCNF.

For example, let’s say we have a table called “orders” with the following columns: “order_id”, “customer_id”, “product_id”, “quantity”, and “price”. The primary key is “order_id”.

CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
product_id INT,
quantity INT,
price DECIMAL(10,2)
);

The non-prime attributes are “customer_id”, “product_id”, “quantity”, and “price”. Let’s say “quantity” has a non-trivial functional dependency on “product_id” and “price” has a non-trivial functional dependency on “order_id”, “product_id”. We would need to remove “quantity” and “price” from the “orders” table and create new tables “order_items” and “order_prices” with functional dependencies as primary keys and foreign keys referencing “orders” table primary key “order_id”

CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT,
PRIMARY KEY(order_id, product_id),
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

CREATE TABLE order_prices (
order_id INT,
product_id INT,
price DECIMAL(10,2),
PRIMARY KEY(order_id, product_id),
FOREIGN KEY (order_id, product_id) REFERENCES order_items(order_id, product_id)
);

This will ensure that the table is in BCNF and eliminates any non-trivial functional dependencies. It’s worth mentioning that BCNF is a stronger normalization form than 3NF, and it is harder to achieve because it requires a strict condition to be met, which is that all non-prime attributes must be fully functionally dependent on the primary key.

Another normal form that is related to BCNF is the 4th Normal Form (4NF) and 5th Normal Form (5NF), also known as Projection-Join Normal Form (PJ/NF) and Domain-Key Normal Form (DK/NF). These normal forms are even more restrictive than BCNF, and they are rarely used in practice because they can lead to a large number of tables with complex relationships.

In SQL, we can use the following command to check if a table is in BCNF:

SELECT 
t1.column_name, t1.ordinal_position, t1.constraint_name
FROM
information_schema.key_column_usage t1
JOIN information_schema.table_constraints t2 ON
t1.table_name = t2.table_name AND
t1.constraint_name = t2.constraint_name
WHERE
t2.constraint_type = 'PRIMARY KEY' AND
t1.table_name = '<table_name>';

This command will return the column names, ordinal positions, and constraint names of the primary key in the specified table. From this, we can determine if there are any non-prime attributes that have non-trivial functional dependencies and if the table is in BCNF.

It’s important to note that normalization is not always the best approach for every situation. There are trade-offs to be made, such as increased complexity and performance issues when dealing with a large number of tables. Therefore, it’s important to understand the specific requirements of your project and weigh the pros and cons of normalization before implementing it.

In summary, data normalization is a process of organizing data in a way that minimizes data redundancy and improves data integrity. There are several normal forms, such as 1NF, 2NF, 3NF, and BCNF, each with its own set of rules and guidelines. Normalization helps to ensure that data is accurate, consistent, and easy to maintain, but it’s not always the best solution for every project. It’s important to understand the specific requirements of your project and weigh the pros and cons of normalization before implementing it.

--

--

Ofili Lewis

Transforming and making data more accessible so that organizations can use it to evaluate and optimize performance.