Don’t sacrifice scalability for savings - have it both ways!
When left unchecked, the cumulative costs of your company data can ramp up fast.
From training CPU-intensive machine learning algorithms that aren’t used in production to supporting enormous databases storing every minute event “just in case”.
Letting your data operating costs run without checks and balances can quickly cause costs to bloat beyond your allocated budgets.
Luckily, improving data operations can help and in this blog, we are going to tell you how.
Four principles guide the philosophy of cost reduction.
These help you understand the big picture and context necessary to prioritize concrete data operation initiatives (explored later) that save operating expenses.
Unlike sales or marketing, the data teams are rarely directly responsible for revenue growth and cash flow.
With the rare exception of products with machine learning at their core, the data teams are usually a supportive role for other players in your company that help them make better business decisions that in turn indirectly drive growth.
It is often hard to quantify the direct impact data insights have on your company scaling and growth. But that doesn’t mean they are not impactful.
When cutting business costs, take into consideration the downstream effects of your decisions.
For example, unsubscribing from a business intelligence tool license might save costs today. But it can also cut your sales team's quick access to customer data needed to close cold calls next week.
Always keep scalability and savings in balance. A good rule of thumb is: to cut costs, it is better to optimize existing data operations than to remove entire data workflows and tools.
Complexity is the sister of growth. As your company grows, your data architecture tends to increase in complexity.
For example, let’s say you’re running an e-commerce shop. As you were growing, you wanted predictive analytics to better inform optimal delivery routes. Your data team decided to introduce a new database (MongoDB) that can scale geo data predictions better than your existing e-commerce transaction database (MySQL).
This is just one of the many examples of how entropy increases the complexity of your data operations - from additional tools (complex stack) to layered and codependent workflows (new ETL data pipelines, additional last-minute data quality scripts written for investor reports, …), your data operations become more chaotic as your company scales.
Simplify complexity to cut costs. We’ll look at concrete examples later on.
Mature and heavily regulated companies (banks, insurances, etc.) tend to have the opposite issue from chaos - they are too rigid.
From stacks that cannot change (“we need to keep the Oracle database for compliance reasons”) to rigid infrastructure (on-premise servers cannot be migrated to the cloud), a non-flexible DataOps can cause you opportunity costs - different tools and workflows could help you cut costs, but you do not implement them, because your company’s architecture is too rigid.
Loosen up fixed architectures to allow your companies to grow with lean methodologies.
You cannot fly a plane blindfolded. And you cannot cut costs unless you know what you’re cutting alongside the savings.
There are three ways to improve your cost measurements:
Now that we are equipped with the right principles to guide us, let’s look at concrete ideas on how to cut costs by improving data operations.
You may not be able to reduce office supplies costs, travel expenses or downsize office spaces to optimize business expenses, but there are ways data teams can cut costs.
Here are most the most common areas where expenditure cutting can save your bottom line while helping your company grow.
Companies amass large quantities of data through the lifetime of their operations.
The majority of historical data is stale and is used only on rare occasions (transactional data kept for regulatory reasons, raw data dumps that were used in big data algorithm training but are seldom rechecked once the algorithm’s parameters have been calibrated, etc.).
The data cannot be deleted. But it can be re-architectured into cheaper storage.
For example, by combining a data lake and data warehouse architecture, the data lake can keep historical data in data dumps that are optimized for storage but not data processing (e.g. AWS’s Glacier), while data that is crucial for company growth is piped into the data warehouse for data analytics and data science initiatives.
In the past, database administrators worried a lot about query optimization and data modeling at rest, so the database costs would not skyrocket.
But with the popularization of MPP warehouses (Snowflake, Redshift, BigQuery, …), storage became comparatively cheap and data modeling fell in popularity.
The cloud warehouses made 7- and 8-figure technological solutions available for 4- to 5-figures.
But that doesn't mean there is no room for improvement.
Modeling your data correctly can save you a lot of money. For example, if you identify that your data analysts perform the same join over two massive tables 20-times every day, you need to either create indices for those two tables to speed up processing or save (materialize) the joined table as an analytic table, to avoid computing all rows at each join.
Do you have random EC2 instances running without any jobs on them? Is there an Airflow DAG updating a dashboard with a live data stream, despite no one looking at the dashboard in real time?
Every company has unused and underused data assets and data pipelines. Identify where random workflows and assets are being wasted and cut the unnecessary excess off.
When departmental silos exist, processes get duplicated.
From marketing and sales both running their customer data enrichment processes to engineering teams and data insights teams both collecting database logs for monitoring.
Analyze which workflows are duplicated and join them together to halve the costs of these processes.
Many companies make the mistake of not filtering the data early enough in the data lifecycle.
When you collect raw data from various data sources (data integration with your data lake), not all data needs to get to the data warehouse. Or at least not at the same granularity.
Let’s say you collect sensor data that is produced 50-times every second. But all your data operations, business intelligence, and SLAs to customers use sensor data at a granularity of 60-seconds (a 3k difference in magnitude).
You can aggregate the data (sum it, take averages, …) from the 50 Hz to the minute and make the aggregated data the input to your data warehouse, where application developers and data scientists will pick it up for their models.
Data management and data governance help you establish tools and processes that take control over data flow throughout its lifecycle - from collecting raw data to driving insights.
Having a clear understanding of where data is, what certain data means, how it is generated, how it is protected, and all the metadata associated with it is crucial for running streamlined data operations on three levels:
One tool that makes data lineage process a breeze is Keboola. Not only can you automate your entire data pipeline: from collecting structured and unstructured data, to transforming and storing it for analysis. At each step Keboola automatically tracks all relevant metadata and constructs logs, which gives you a granular view of data lineage so you can identify root cause of errors immediatelly.
How much does it cost for you to wait on a report to be produced and delivered?
This simple question hints at a common truth across companies: the data insights and data engineering teams are often bottlenecks in data-driven decision-making.
The story is familiar:
This is not mismanagement of the data team, but a challenge to be solved through better data operations.
You could improve the operational throughput by increasing the headcount of your data team, or maybe outsourcing some operations. But labor costs a lot and you have so many other opportunities for optimisation.
Instead, invest in processes and tools that can help you automate reporting:
Simple automation (data modeling, BI tool, upskilling, Excel) can cover the proverbial 80% of all requests and free up valuable resources.
Development teams love manual scripting. It is fun to tinker with code to get something done. But the fun stops once the manually scripted systems start to fail.
A common example is writing extractors in Python/Java/Go/pick-your-language that collect raw data from data sources and ingest it into your data lake or data warehouse. This is fun until the data warehouse tables go through migration and the extractor script fails. Or the source data API changes endpoints or protocols and your development team spends a week figuring out how to collect the same data again.
Wherever possible, rely on tools to do the heavy lifting and avoid scripting. From maintenance costs to increased chances of making errors, manual scripting solutions seldom scale and carry long-term management costs.
Those were the 8 use cases of how to improve your data operations for cost saving without jeopardizing growth. But how do you implement them if you do not have a devoted data operations team?
You rely on the right tools to get the job done.
Keboola is a data platform as a service designed to streamline and automate your in-house data operations end-to-end, so you can optimize business processes and save costs as a result.
How does it do it?
As Brett Kokot, Director of Product at Roti and satisfied Keboola user, said:
“I don’t want to manage Airflow, I don’t have time for that! I can set up an orchestration in Keboola in 5 minutes, that would take 2+ hours of coding there.”
Try Keboola out for yourself.
Keboola offers a no-questions-asked, always-free tier (no credit card required), so you can play around and optimize business operating costs with a couple of clicks.