Ever since E. F. Codd introduced the first relational model for storing data at IBM in 1970, the industry has picked up the database technology and used it for its competitive advantage.
The relational database management system - or RDBMS - was the default technology for storing and accessing data for a long time.
It supported transactional data storage, the building of data products, and was the go-to model for data that was used in data-driven decisioning.
Everything changed with the boom of the internet in the 2000s.
Data was produced faster, in greater quantities, and with greater varieties than ever before. Companies generated, collected, and analyzed data that was not only text and numbers as they were in the past, but also images, videos, and audio clips, produced at scale by the masses using social media.
The relational model of data turned out to be ill-equipped to deal with the velocity, volume, and variety (3Vs) of the big data era.
That’s when we witnessed the re-birth of the non-relational database model. The technology was around for a while (the 1990s), but companies uptook the non-relational databases to support the data processing needs of modern applications.
In this article, we will take a deep dive behind the scenes to better understand the different database systems and how they compare.
To answer the ultimate question: which one should you chose?
Relational database systems store data into tables that look like Excel files: each row represents one entity (for example a customer), while the columns are the attributes or characteristics for that entity (e.g. date_of_first_purchase, customer_email, etc.).
Where relational databases differ from Excel are in the optimizations under the hood that make data querying much faster.
Each table has a primary key (in the example above, “customer_id”) that uniquely identifies each record in the table. The data in the table is linked to other tables via relations (hence the name relational) consisting of primary key - foreign key pairs.
So if we wanted to check how much John (customer_id = 1) purchased, we would look for the foreign key “customer_id” belonging to John in the table “orders”:
By creating relationships between tables, relational databases avoid redundancy (Information about John is saved just in the customer table, all else is queried via primary key - foreign key pairs, hence need less space on the hard drive) and make data more resilient to corruption (higher data integrity).
Relational databases use SQL (Structured Query Language) a devoted and specialized programming language to query relational database systems.
Non-relational databases (also called NoSQL databases) refer to a family of different database types that share one commonality: they do not store data as row records in tables.
NoSQL databases were primarily built for unstructured data, such as streams of events, video data, or rich text documents - to name but a few.
There are different types of non-relational databases and each type is optimized for storing a different data structure that is not relational (read: table-like) in its essence.
There are four main types of non-relational databases
Key-value stores store data as key-value pairs. Each key is unique and is associated with a value.
The key-value stores do not need primary or foreign keys and are super efficient at retrieving data.
They are often used for session data (where session_id is the key).
Document databases are optimized for storing rich text documents as they are. The document is translated from its text or pdf version into a machine-readable format such as JSON, BSON, or XML.
Each document is a standalone unit, and document stores are not good for storing relationships between documents. But they are great for quick reading of documents and retrieving information from a wide text corpus.
Graph databases are best used to model data that looks like - well - graphs! Think social media networks, where one person can be connected to other people, and each of the friends can have their friendship networks.
Graph databases are particularly useful for building dynamic schema - you do not need to know how your data will look like before you start storing it (a precondition for relational databases). Instead, your graph model can develop along with your data collection.
The most notable example of a graph database is Neo4j.
Wide-column stores might look like a table because they store information into rows and columns. But unlike relational databases, wide-column stores are optimized for rows that have different values in different columns.
They are optimized for sparse data. You can think of sparse data through the Netflix example - each person has seen a couple of movies, but not all people have seen all movies. So if you have a table, where each row is a Netflix user, and each column is a movie, the majority of cells will be empty (greyed out in the table).
Sparse datasets are problematic for SQL databases because they decrease performance as new columns and rows are being added.
The most notable examples of wide-column stores are the open-source Apache HBase (modeled after Google's Bigtable: A Distributed Storage System for Structured Data) and Amazon DynamoDB.
NoSQL databases comprise multiple different technological solutions, but in general, we can compare the two types of databases on 5 levels:
Non-relational databases offer high availability. NoSQL databases are distributed by design. This means they store data replicas across multiple servers or nodes. Non-relational databases offer high availability - they are always available for read and write requests, even when faced with network partitions (read: network failure) or certain nodes going offline.
Their high availability comes at the cost of consistency. Sometimes data integrity is compromised to allow for availability.
Unlike non-relational databases, SQL databases focus on high consistency. They will prioritize data integrity over availability via their ACID transactions (unfamiliar? Read more here).
Relational databases are usually deployed over a single server or node, so they can only scale vertically - by adding more CPU or RAM to the existing infrastructure. Vertical scaling is slower and more expensive.
On the other hand, non-relational databases are distributed by design, so can easily scale vertically - by adding a new machine or server to the system. This type of scaling is faster and cheaper.
Relational databases are best for structured data that is modeled well by the table model.
Non-relational databases, on the other hand, handle unstructured data well and are best for different data structures.
SQL databases adapt poorly to changing data schemas (the overall structure of your data across all tables).
NoSQL databases handle dynamic schema changes well - because they do not presuppose a given data structure in advance.
Non-relational databases were designed to handle insanely big amounts of data - those amounts that usually break a SQL database.
The ultimate choice between a non-relational and relational database will depend on your business use cases.
If your application requires high availability (but not necessarily consistency), easy scalability, and model flexibility you are better off with a non-relational database.
No matter which solution you choose, Keboola can help you with heavy lifting.
Keboola is an end-to-end data operations platform design to automate the setup and maintenance work of data engineers and analysts.
Keboola uses components to automate writing to and reading from various data stores - from relational to non-relational databases.
Explore more than 250 integrations Keboola has to offer that save you engineering hours and nerves as you build your ideal data store.