Learn everything about Enterprise Data Warehouses or shortly, EDWs. From their architecture to advantages and the list of best EWDs in 2020.
Companies use predictive and business analytics to gain an advantage over their competitors and claim a bigger share of the market. But with the accelerated proliferation of data volume, speed and variety, establishing a system to make sense of this data is posing ever-increasing challenges.
Several data solutions - from databases to data lakes - have emerged to empower companies of all sizes to take over their data and use it to accelerate growth. This has culminated in enterprise data warehousing (EDW).
1. What is an Enterprise Data Warehouse (EDW)?
A data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a technological solution deployed by an enterprise to store, centralize, transform, and analyze its data.
The data is collected from disparate sources (e.g. databases, CRMs, ERPs, advertising platforms, marketing tools, accounting solutions, sales software, etc.) at different levels of detail and with varying structures: from unstructured to fully relational (think of tabular data like Excel).
The ultimate purpose of an EDW is to store the data in a way that speeds up and facilitates business and predictive analytics, which leads to competitive intelligence and insights for the enterprise.
2. Why do companies need an EDW?
Historically, companies kept their data records in relational, highly structured and well-organized databases (DBs). But relational databases did not adjust well to the growth spurt in data. Concretely, databases had several issues:
Volume. Databases can handle huge amounts of inflowing data, as long as that data follows ACID constraints. Massive parallel data inflows (such as from distributed systems and IoT devices) and duplicated records present a challenge to traditional databases. Unless we invest a lot of engineering customization upfront, relational databases are not optimized to handle distributed inflows and deduplication transformations without incurring a performance cost.
Speed. Traditional databases were extremely quick at inserting data (OLTP operations) but slowed down when it came to typical analytical queries, which require a lot of transformations (OLAP operations). Analysts took for granted that running a business intelligence query could mean waiting longer than a day to get results.
Data types. Traditional - SQL - databases were ill-prepared for the rise of NoSQL (not only SQL) data, such as geographic information, IoT logs, graphs (social media) and the like. Some, like PostgreSQL, have adjusted by implementing native NoSQL data types and transformations. However, these solutions are far from being exhaustive for users who need (near) real-time analytics.
External disparate sources of data. Companies faced additional issues when trying to fit and model their data within databases. With the increased reliance of businesses on Software-as-a-Service (SaaS) apps (ERPs, CRMs, marketing tools, etc.), crucial business data was no longer located within the four walls of the company, but within the apps that the company was using to run its business instead. The process of preparing external data for analysis (commonly referred to as Extract-Transform-Load or ETL) is crucial for both DBs and EDWs. But from the very start, DBs were not architectured to handle disparate data sources, orchestrations, varying structures, and changing schemas.
These challenges encouraged a technological revolution, which gave birth to EDWs. These join data from multiple databases and data sources into a centralized system - one that is optimized for storing and analyzing data.
3. EDW architecture
The architecture of EDWs is designed to optimize the performance of analytics on the data stored in the EDW:
Columnar storage. Unlike relational databases, which store data into rows (or ‘tuples’ in SQL parlance), EDWs use columnar storage. When retrieving data based on an attribute (e.g. “Show me all customers who have purchased product X”), the EDW can query just the column where the attribute resides (e.g. “purchased_product”) instead of retrieving entire rows. This architectural design speeds up analytical queries by several orders of magnitude.
Transformation performance. Traditional databases slow down at the point of data transformations. EDWs, on the other hand, are optimized to transform data with massive parallel processing (clean it, aggregate it, remove corrupted data…) on the fly.
Hardware optimization. With the default reliance on flash memory from SSDs to accelerate data storage, retrieval, and analysis, the EDWs are architected from the cold metal up to speed up analytic endeavors.
These, along with other technical implementations, have positioned the architecture of EDWs above most databases for OLAP (Online Analytical Processing) purposes
3.1 What is the difference between a database (DB) and an enterprise data warehouse (EDW)?
EDWs were specially architected to compensate for the intrinsic weakness of databases: the slowing down - or even crashing - under heavy (analytical) workloads, which relied upon disparate data.
4. Advantages of using a data warehouse
Deploying an EDW brings several benefits to a business:
Accelerated time to insights. By relying on an architecture that was designed with analytics in mind, users of an EDW get to insights faster. Queries which once took hours or days can be shortened to just minutes and seconds. Thus, analysts can iterate through the data faster and discover novel opportunities for business growth.
Data centralization. EDWs centralize data access across a disparate ecosystem of data repositories and sources. This provides a better holistic view across different business silos and closes the gap between varying business endeavors.
Streamlined operations. EDWs are part of the extract-transform-load (ETL) data pipeline, which takes data from its raw form (extracts), cleans and aggregates it (transforms) and stores it in the destination database or data warehouse (loads). ETL still requires engineering work and optimization, even within EDWs. However, EDWs offer out-of-the-box tools and integrations to other software, which automates aspects of the ETL to provide usable data faster.
Data security & integrity. From role-based access control (RBAC) to multi-factor authentication (MFA), the best EDWs provide several layers of security to keep your data from prying eyes. Alongside data encryption (at rest and in transit) and integrity considerations, EDWs do the heavy lifting when it comes to keeping your data secure and running quality assurance audits.
NoSQL. Except for rare databases, EDWs offer a superior solution for working with NoSQL data. From tailored storage solutions to flexible schemas and transformation routines, EDWs are at the frontier of using NoSQL data for analytics.
4.1 Advantages of deploying an EDW in the cloud
Data warehouses can be deployed on-premise (company-owned and maintained server) or as Software-as-a-Service (SaaS) solutions on the cloud (or Enterprise-data-warehouse-as-a-Service, EDWaaS).
Based on the recent advancements in cloud infrastructure and software, it has become clear that cloud deployment surpasses on-premise deployment for a multitude of reasons:
Costs. Using SaaS for your EDW requirements removes several costs, which are instead picked up by the vendor, including hardware costs, software costs, infrastructure, security certifications and audits, maintenance costs, and more. You only pay for your storage and compute needs, at a fraction of the price it would take to set it up.
Scaling. EDWaaS often deploy auto scalers (or scaling tools) which make it easier to scale up your rented instance (when you need more compute or storage power) and equally as important scale down (when your needs decrease). The native scaling options allow you to adapt to situations faster, as well as save you money (scaling down).
Features. EDWaaS solutions often develop new features (from machine learning to integrations with other software), which are offered for free or at a considerably lower cost than your in-house development. This positions you at the forefront of technological innovation when it comes to using data for gaining competitive advantage.
Security. The best EDWs implement the highest levels of security protocols, audit their safeguarding processes and run penetration testing, all to ensure that your data is kept safe. Taking care of security concerns on your own not only raises expenses in terms of engineering man-hours, but also burdens your payroll with security experts, who do not come cheap.
Downtime. Cloud solutions are distributed by design. If one (or two, or three) servers fail, their processes and data are redistributed to other servers, which then pick up the job. For this reason, cloud solutions experience less downtime, and even that is barely noticeable for the end user when compared to single-deployment on-premise solutions.
5. What are the best EDWs of 2020?
When deciding which EDW to pick, there are four main contenders in 2020. All of them are state-of-the-art examples of EDW architecture at great value-per-dollar price points:
Amazon Redshift. Amazon Redshift uses columnar storage for accelerated data processing and access. It scales with nodes (units with assigned CPU, RAM, and storage space) and implements cutting-edge massive parallel processing (MPP). With a similar syntax to PostgreSQL, Amazon Redshift strikes a good balance between being innovative and accessible. As the oldest player among all EDWs, Amazon Redshift offers a wide ecosystem of integrations with other ETL and BI tools.
Snowflake. Snowflake is not associated with a cloud platform and can run across all the giants’ solutions (AWS, Azure, Google Cloud Platform...). Unlike Amazon Redshift, it can scale compute independently of storage, making it particularly interesting for users whose compute needs vary autonomously (e.g. real-time analytics with high seasonality, IoT, etc.).
Google’s BigQuery. Google’s Big Query (BQ) separates itself from the others by offering a serverless architecture, through which the user cannot see the details of resource allocation. This is bad news for the tinkers, but great news if you don’t wish to optimize the low-level machinery of your EDW. Computational and storage provisioning occurs continuously in the background and adjusts dynamically to your needs.
Microsoft Azure. Microsoft Azure has competitively placed itself towards more niche technical verticals, such as artificial intelligence, machine learning, the internet of things, and blockchain. With its compute and storage optimization, it still makes for a superb EDW for the general audience, but their integrations are mostly developed for the Microsoft ecosystem of tools.
6. Evaluation criteria for EDWs
Not all EDWs are the same, especially when they need to be a good fit for your data deployment. When deciding on an EDW, you should match it against the following criteria:
Meets current and future needs. Make a list of your current data engineering and analytic needs, as well as ‘wished for’ features which are currently unresolved. Check how each EDW vendor compares against your list. Additionally, make sure to spend some time on the future promise of each vendor to see if their growth aligns with your future data requirements.
Shortens time to insights. Measure how long it takes in the current system to obtain analytic answers to your business questions. For enquiries that you can’t answer at the moment (for example, the data might be so messy that it’s impossible to run clustering algorithms), put in a long default value such as 12 months. Compare your average time-to-insight to that promised by each EDW vendor.
Manages diverse data. Match your current sources of data to the vendor’s integrations. Look at both tools (for example, automatic connectors to SalesForce and Google Analytics), as well as the possibility of building your own custom integrations. Check how the EDW matches different sources and how much overhead those transformations demand.
Supports existing skills. Even the best solution on the market does not provide value unless it can be used. Check what technology expertise is needed to use the EDW (if any). Always err on the side of SQL or SQL-like solutions, as these are well understood; they have a lot of history, material, and plenty of experts on the market to hire, so you won’t be stranded with a great platform that no one can use.
Reduces costs. Each EDW calculates approximate monthly costs for compute and storage. Talk to your analysts and engineers to gauge how much storage, RAM, and CPU you currently use. Insert that data into calculators offered by the vendors to match your current costs to the ones you can expect from the EDW. Also make sure that you find out what the cost of scaling would be and how much you could save by downscaling at times of lesser compute/storage need.
Implements security. Look into the security implementations of your chosen EDW solution. If you are not a security expert, check how your vendor compares against official industry security certifications, such as GDPR compliance, HIPAA compliant architecture, SOC 2 and SOC 3, and ISO 27001 Certification.
Streamlines the ETL data pipeline. Whatever your choice of EDW, you will still need to implement your ETL data pipeline. Ask the vendors how they integrate with other ETL tools, what solutions they offer natively and how this will impact your time-to-insight and costs.
7. Four steps to getting started with an EDW
Getting started with EDWs can be a little daunting. To smooth the transition for databases to enterprise data warehouses, we prepared a simple four-step process for guidance:
Make a shortlist of EDW providers based on the evaluation criteria specified above.
Decide whether you want to migrate existing data to the new EDW platform or start anew.
Determine success criteria.
Set up proof of concept with each shortlisted vendor and match them against the success criteria.
Voila! The process of picking the best EDW for your use case is as simple as going through these four steps.
8. How can Keboola help your company accelerate your EDW deployment?
Keboola is the all-in-one data operations platform which automates your data-related processes. Among its feature-packed offerings, Keboola integrates natively with the major EDWs to accelerate your data warehouse deployment.