Database design & development
Bits and pieces of my learning how to design and develop a database
This writeup is the first of many aimed at sharing my learnings on how to design and develop a database (as described in ‘Introduction to Data 0x’). This project - to build an AI model from scratch - is part of my data journey, and you are welcome for the ride.
The first task of course is to design and develop a database that’d house our AI modeling data. To this end, I asked myself what the various considerations that should go into designing and developing a database are. The following were my findings;
Choosing and designing a database schema.
Designing tables (fact & dimension).
Assigning keys (primary & foreign).
Data types.
Understanding and applying data normalization thinking.
SQL: The ‘C’ of CRUD for database development.
Plus, I believe these are the duties of a database administrator or developer. I am happy to start my data journey from the ground up.
What is a database?
A database is a structured collection of data that is stored in a computer system in a way that allows it to be easily accessed, managed, and updated. It is designed to efficiently store and retrieve large amounts of data/information and can be used by individuals or organizations to manage a variety of data types, such as customer records, inventory, financial transactions, and more. The primary purpose of creating a database schema is to ensure the integrity of the data stored in the database, so it can confidently be used by the broader business for applications such as business intelligence.
A database typically consists of one or more tables, each of which contains records or rows, with each row representing a unique instance of the data being stored. Each row is composed of one or more fields or columns, which hold specific pieces of information about the data.
What is a database schema?
A database schema defines the design architecture of the database. That is, it describes how the data is organized and connected with each other. This implies that it is not the data itself, but rather the design of the relationships between the tables within.
It is often considered the “blueprint” of a database. In essence, the schema is the Entity Relationship Diagram (ERD) of the database. A database schema may include schema objects like tables, views, fields, relationships, packages, indexes, types, and many other elements.
Schema diagrams are particularly helpful in helping data analysts understand the keys that they should join tables on.
Schema designs
Star Schema
A star schema is a type of relational database schema composed of a single, central fact table, surrounded by dimension tables. This tends to be considered a simpler schema compared to the snowflake schema. Star schema design looks something like this
With tables having data, it would look something like this
Snowflake Schema
A snowflake schema consists of one fact table that is connected to many dimension tables, which can be connected to other dimension tables through a many-to-one relationship. A snowflake schema design looks something like this
With tables having data, it would look something like this
Key differences between a Star and Snowflake design
A star schema has denormalized dimension tables, while a snowflake schema has normalized dimension tables.
A star schema is easier to design and implement than a snowflake schema.
A star schema can be more efficient to query than a snowflake schema, because there are fewer JOINs between tables.
A star schema can require more storage space than a snowflake schema, because of the denormalized data.
A star schema can be more difficult to update than a snowflake schema, because of the denormalized data.
A star schema can be more difficult to troubleshoot than a snowflake schema, because of the denormalized data
Alternatively, there is a third schema design type that is relatively new. It is called the Galaxy schema or fact constellation schema.
Galaxy schema (or Fact constellation schema)
The Galaxy Data Warehouse Schema, also known as a Fact Constellation Schema, acts as the next iteration of the data warehouse schema. Unlike the Star Schema and Snowflake Schema, the Galaxy Schema uses multiple fact tables connected with shared normalized dimension tables. Galaxy Schema can be thought of as star schema interlinked and completely normalized, avoiding any kind of redundancy or inconsistency of data.
Fact & Dimension tables
So far, in trying to understand how to design a database schema, we have been seeing facts and dimension tables a lot. So what are they?
Fact table
Fact tables = Quantitative data
A fact table contains metrics, measurements, facts, or actual events about business processes such as sales, inventory, production, or financial transactions. Fact tables are usually large and contain millions of rows, and are optimized for querying and reporting rather than for data entry or updates.
A fact table typically contains foreign keys that connect it to various dimensions tables, which provide context and descriptive information about the facts. For example, a sales fact table might contain foreign keys to a time dimension, a product dimension, a customer dimension, and a store dimension. By joining the fact table with the relevant dimensions, analysts can gain insights into the performance of the business process across the various dimensions.
Dimension table
Dimension tables = Qualitative data
A dimension table is a table that contains descriptive attributes or characteristics of the objects or events being analyzed, such as customers, products, locations, time periods, and other dimensions that are relevant to the business process. Dimension tables provide context for the quantitative data stored in the fact table, and allow users to slice and dice the data along various dimensions for analysis and reporting. For example, a customer dimension table might contain columns such as customer name, address, age, gender, and other demographic or behavioral attributes.
Dimension tables are usually smaller than fact tables and are used to provide context and meaning to the quantitative data in the fact table. They are often used to create hierarchies, roll-ups, and drill-downs that allow analysts to navigate through the data along various dimensions.
Dimension tables are not joined to each other. Instead, they are joined to fact tables. Dimension tables are also usually denormalized and optimized for querying and reporting rather than for data entry or updates.
Primary & Foreign keys
Primary key
A primary key is a unique identifier for each row in a database table. It is a specific field or set of fields within the table that is chosen to uniquely identify each row in the table. The primary key is used to enforce data integrity, to ensure that each row is unique and can be easily referenced, identified, and updated.
In database design, it is important to choose the right primary key for each table. The primary key should be a field or set of fields that is unlikely to change, and that is guaranteed to be unique for each row. Common choices for primary keys include ID numbers, social security numbers, or other unique identifiers.
By setting a primary key for a table, the database management system can enforce constraints to ensure that the primary key value is unique for each row, and that it is not null. This helps to maintain the integrity of the data in the table and ensures that the data can be easily queried and updated.
Foreign key
A foreign key is a field or set of fields in a database table that refers to the primary key of another table. It is used to establish a relationship between two tables, such that data in one table (the dimension table) refers to data in another table (the fact table).
The foreign key is typically a field or set of fields in the child table that corresponds to the primary key of the parent table. This relationship allows data to be connected across multiple tables, making it possible to retrieve related data from different tables with a single query.
For example, consider a database that stores information about customers and their orders. The customers table might have a primary key called "customer_id", and the orders table might have a foreign key called "customer_id" that refers back to the primary key in the customers table. This allows data about each customer's orders to be retrieved by joining the two tables together on the "customer_id" field.
By using foreign keys, the database management system can enforce referential integrity, which ensures that data in the child (dimension) table always refers to valid data in the parent (fact) table. If an attempt is made to insert data into the child table that references a non-existent record in the parent table, the database management system will raise an error, preventing the invalid data from being inserted.
Data normalization
DATA NORMALIZATION. What is it?
. Faster query performance due to fewer JOINS
. More data tables due to normalization hence more storage space is needed
. Makes DB difficult to troubleshoot and update
Normalization is a technique for organizing the data into multiple related tables to minimize Data Redundancy and Data Inconsistency. It aims to eliminate anomalies in data.
Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity.
This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.
Redundant data wastes disk space and creates maintenance problems. If data that exists in more than one place must be changed, the data must be changed in exactly the same way in all locations. A customer address change is much easier to implement if that data is stored only in the Customers table and nowhere else in the database.
Database normalization is the process of organizing a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd.
There are a few rules for database normalization. Each rule is called a "normal form." If the first rule is observed, the database is said to be in "first normal form." If the first three rules are observed, the database is considered to be in "third normal form." Although other levels of normalization are possible, third normal form is considered the highest level necessary for most applications.
The 5 Rules of Data normalization
Eliminate Repeating Groups.
Eliminate Redundant Data.
Eliminate Columns Not Dependent on Key.
Isolate Independent Multiple Relationships.
Isolate Semantically Related Multiple Relationships.
1 NF (First Normal Form)
2 NF (Second Normal Form)
3 NF (Third Normal Form)
BCNF (Boyce -Codd Normal Form)
4 NF (Fourth Normal Form)
5 NF (Fifth Normal Form)
6 NF (Sixth Normal Form)
4NF to 6NF applies to multivalued dependencies and complex table scenarios
First Normal Form (1NF)
The most basic form of data normalization is 1NFm which ensures there are no repeating entries in a group. To be considered 1NF, each entry must have only one single value for each cell and each record must be unique.
For example, you are recording the name, address, gender of a person, and if they bought cookies.
Second Normal Form (2NF)
Again working to ensure no repeating entries, to be in the 2NF rule, the data must first apply to all the 1NF requirements. Following that, data must have only one primary key. To separate data to only have one primary key, all subsets of data that can be placed in multiple rows should be placed in separate tables. Then, relationships can be created through new foreign key labels.
For example, you are recording the name, address, gender of a person, if they bought cookies, as well as the cookie types. The cookie types are placed into a different table with a corresponding foreign key to each person’s name.
Third Normal Form (3NF)
For data to be in this rule, it must first comply with all the 2NF requirements. Following that, data in a table must only be dependent on the primary key. If the primary key is changed, all data that is impacted must be put into a new table.
For example, you are recording the name, address, and gender of a person but go back and change the name of a person. When you do this, the gender may then change as well. To avoid this, in 3NF gender is given a foreign key and a new table to store gender.
Data types
It is of paramount importance to consider the data types for each field in your tables also, as this affects the database storage size and performance. Here are some of the key data type considerations:
Data size: You'll want to choose a data type that is appropriately sized for the data you're storing. For example, if you're storing a small integer value, you might use a "tinyint" data type instead of a larger "int" data type.
Data format: Certain data types are designed to store specific types of data, such as dates, times, or currency values. Using the appropriate data type can help ensure that your data is formatted correctly and can be easily searched and sorted.
Data validation: Some data types have built-in validation rules that can help prevent data entry errors. For example, a "date" data type might only allow dates in a specific format, while a "numeric" data type might only allow numbers.
Indexing: Choosing the right data type can also affect how efficiently your data can be searched and sorted. Some data types, such as "int" or "varchar," can be indexed for faster searches.
Compatibility: Finally, it's important to consider the compatibility of your chosen data types with any applications or systems that will be accessing your database. If you're using a specific programming language or database management system, you may need to choose data types that are supported by those tools.
That said, here is a list of commonly used SQL data types, along with their description and typical sizes in bytes:
Numeric Data Types:
INT/INTEGER: whole numbers that can be signed or unsigned. Typically takes up 4 bytes.
BIGINT: larger whole numbers that can be signed or unsigned. Typically takes up 8 bytes.
DECIMAL/NUMERIC: fixed-point numbers that can store decimal values. The size of these types varies based on the precision and scale specified.
Character and String Data Types:
CHAR: fixed-length character strings. The size is determined by the number of characters specified in the definition.
VARCHAR/VARCHAR2: variable-length character strings. The size is determined by the number of characters entered.
TEXT: used for large text strings. The size varies depending on the amount of text stored.
Date and Time Data Types:
DATE: used to store dates. Typically takes up 3 bytes.
TIME: used to store times. Typically takes up 3-5 bytes.
TIMESTAMP: used to store date and time values. The size varies depending on the precision specified.
Binary Data Types:
BLOB: used to store binary data such as images or audio files. The size varies depending on the amount of data stored.
BYTEA: used to store binary data as a variable-length array of bytes. The size varies depending on the amount of data stored.
Boolean Data Types:
BOOLEAN: used to store boolean values, which can be either true or false, yes or no. Typically takes up 1 byte.
Other Data Types:
ENUM: used to store a fixed set of values. The size varies depending on the number of values stored.
UUID: used to store unique identifiers. Typically takes up 16 bytes.
JSON: used to store JSON-formatted data. The size varies depending on the amount of data stored.
It's important to note that the size of a data type can vary depending on the database system being used, the storage engine being used, and the configuration of the database. It's also important to choose the appropriate data type for the data being stored to ensure accuracy and consistency.
SQL (Create) Functions
The SQL functions you are referring to are actually SQL commands that are used to create and manipulate tables in a database. Here are some of the major SQL commands used in database development:
SELECT: This command is used to retrieve data from one or more tables in a database. It specifies the columns to retrieve and the table or tables to retrieve them from.
CREATE [TABLE]: This command is used to create a new table in a database. It specifies the table name and the names and data types of the columns that will be in the table.
INSERT: This command is used to insert new rows of data into a table in a database.
UPDATE: This command is used to modify existing rows of data in a table in a database.
DELETE: This command is used to delete existing rows of data from a table in a database.
DROP TABLE: This command is used to delete an existing table from a database.
ALTER TABLE: This command is used to modify the structure of an existing table. It can be used to add, modify, or delete columns, constraints, or indexes.
JOIN: This command is used to combine data from two or more tables in a database based on a common column or set of columns.
What are constraints?
A constraint is an SQL object that helps define the set of valid values in the table in various ways. Constraints in database development are rules that are applied to a table or column to ensure that the data being stored in it is accurate, consistent, and conforms to certain criteria. These rules can be applied to individual columns or to the table as a whole, and can be used to enforce various data integrity rules.
There are two ways to define constraints: table constraints and column constraints.
A column constraint is defined as part of a column definition. A table constraint definition is not tied to a particular column, and it can encompass more than one column. Every column constraint can also be written as a table constraint; a column constraint is only a notational convenience for use when the constraint only affects one column.
There are several types of constraints that can be applied to a database table, including:
Primary Key Constraint: A primary key constraint is used to ensure that each row in a table is uniquely identified by a particular column or set of columns. This constraint ensures that no two rows have the same primary key value.
Foreign Key Constraint: A foreign key constraint is used to enforce a relationship between two tables. It ensures that data entered in one table matches the data in another table, typically the primary key of another table.
Unique Constraint: A unique constraint is used to ensure that the data in a column or set of columns is unique across all rows in the table. This constraint can be used in place of a primary key constraint when a table does not have a natural primary key.
Check Constraint: A check constraint is used to enforce specific conditions or rules on the data being entered into a column or set of columns. It can be used to ensure that data entered into a column is within a certain range or matches a certain pattern.
Not Null Constraint: A not null constraint is used to ensure that a column cannot contain null values. This constraint requires that data be entered into the column for each row in the table.
Constraints also help prevent errors and inconsistencies that can arise from incorrect or inconsistent data entry, which can have a significant impact on the reliability and usefulness of the data.
By applying constraints to a database table, developers can help ensure the accuracy and consistency of the data being stored in the database.
Conclusion
As one begins to better understand the normalization forms, the rules will become more clear while separating your data into tables and levels will become effortless. These tables will then make it simple for anyone within an organization to gather information and ensure they collect correct data that is not duplicated.