Are you using the right data strategy based on the hierarchy of data needs?
Discover the common misconceptions of data strategy and how to determine yours by mapping the hierarchy of data needs.
Being data-driven is the holy grail of modern business. It allows you to grow 8x faster than your competition, boosts your company’s net earnings by 30% and will have VCs throwing money at you if your organization relies on AI.
So, what strategy does one use to become data-driven? Well, it’s actually quite simple:
Track and analyze your main business metrics (KPIs).
Create insights by visualizing your data via dashboards.
Rely on smart/ML/AI/data science algorithms to reveal hidden insights.
Share data across departments so that everyone can make decisions based on the data.
If you follow this recipe to the T, you can have your data cake and eat it. Except when you follow the recipe and the cake is raw, your spoon is nowhere to be found and no one knows how to turn the oven off.
There are a couple of steps missing from this recipe. To understand how the data strategy can nurture or stunt your growth, we need to fill in the blanks and dig deeper into how companies (mis)use data.
1. The hierarchy of data needs
In striving to become data-driven, companies hire data scientists to work on machine learning algorithms and AI. But these hirees are often a bad fit for the company’s data culture and strategy.
Jumping the gun and heading straight to AI and machine learning is like shooting for the moon. Sure, even if you miss you’ll end up among the stars. But unless you’ve built a rocket and a spacesuit first, you are going to suffocate in the oxygen-deprived environment.
What is missing is the understanding that all data needs are intrinsically structured in a hierarchical way. Every step of the hierarchy relies on the previous one for its foundation. Unless you cover the base first, you cannot build the upper layers.
This metaphor draws a parallel with Maslow’s hierarchy of human needs. Physical needs and shelter (lower levels of the hierarchy) need to be fulfilled before people can self-actualize (higher levels). This makes sense: if you are hungry and without shelter amidst a hurricane, you are not going to prioritize the report for your boss.
Data flows upwards from the lower levels of the hierarchy:
It all starts with data collection. Are you collecting every user interaction with your product? Is there data you need but you haven’t yet built the instrument that can collect it?
Once collected, data needs to be moved and stored throughout your ETL. Are you using the best database for your time-series data? How easy is it to access the data for analysis at a later date? Do you have backups and reliability monitoring throughout your ETL?
Only when data is accessible can you start to explore and transform it. This is the so-called “data cleaning” part. Why are some customers missing data? How come the numbers are double what they normally are on certain days? Was the data collection even working during this period?
With reliable and clean data, you can finally start conducting analytics (or what is usually termed BI). Combine all of the customer data together in a single table so that you can really see the big picture. Divide your suppliers into segments to understand who delivers on time. And start building features and training data, which can be used for machine learning.
Finally, the last two layers involve machine learning and artificial intelligence. With these, you can work on the optimization of company processes, data-driven predictions and work automation.
When people skip ahead to the upper layers, they are unintentionally shooting themselves in the foot. Data scientists/analysts might spend weeks working on an algorithm or analysis, pouring over its inner workings and fine-tuning it, only later realizing that the base data was corrupted. It’s a waste of time and effort.
It’s like trying to install a state-of-the-art satellite dish on top of your house of cards. It’s going to crumble under its weight.
Although from the outside, this might seem like a simple problem to avoid (just collect>store>clean the data!), the general lack of understanding of the hierarchy of data needs carries hidden consequences. Ignore them, and your data strategy might harm your company rather than helping it grow.
2. The perils of a misinformed data strategy
Companies who do not internalize the hierarchical nature of data needs in their strategy often make mistakes in four areas of business growth:
Poor hiring choices. Picture this: you hire your first data scientist. With her Ph.D. in Advanced Statistics, she talks about all the analyses that could help your company grow, a sparkle glistening in her eyes. Algorithms to predict which customer segments are going to bring the biggest chunk of money in the future. Identification of customers at risk of churn. Logistic optimization analyses - which could save you a lot of money. Then she sits down behind her computer, only to realize the data is dispersed across multiple servers and she does not know how to access it via your Kafka deployment. I mean, outside of specialized engineering teams, people think Kafka is a writer, not a data streaming service.
The misunderstanding of which skills are necessary to get the job done is prevalent in the data industry and affects both employers and employees: “The data scientist likely came in to write smart machine learning algorithms to drive insight, but can't do this because their first job is to sort out the data infrastructure and/or create analytic reports. In contrast, the company only wanted a chart that they could present in their board meeting each day. The company then gets frustrated because they don't see value being driven quickly enough and all of this leads to the data scientist being unhappy in their role.”
The misconception lies in the specific roles needed to successfully see the job through. A full data hierarchy require 2-3 different roles. The lower levels (data collection, storing, transformation) are best served by people who are software engineers, while the higher levels (analytics, optimization, AI) are suited for data analysts and/or scientists. As Samson Hu from the tech-startup Wish eloquently puts it:
“Without at least one data analyst, the data engineer will be buried under reporting tasks and data pulls. Without the data engineer, data analysts will be burnt out from querying difficult data sources while dealing with the data requests fire-hose.”
So what can be done to avoid these issues? Generally, companies opt for one of four possible solutions:
Hire someone with all three skills: data engineering, data analysis, data science. Or as they are called in the industry, unicorns.✨🦄✨ Because they are as magical and rare as unicorns are in real life.
Hire both data engineers and analysts/scientists. If you already have staff with certain expertise, supplement them with complementary roles.
Build data engineering products in such a way that technical analysts can use them for self-serving.
Inappropriate project planning. There is a reason why the levels of the hierarchy become narrower as we move higher up. This is a great visual metaphor for the investment of time needed at each stage.
An outsider might presume that the majority of data work is achieved by selecting, training, and optimizing the right algorithms and analyzing insights. Around 80% of a data scientist’s time is spent cleaning and organizing data. This comes as a surprise to (project) managers, who might plan delivery dates without allowing for the extra time needed, and as a disappointment to novices, who expected to work on shiny new machine learning models.
But this is a necessary step, especially when working with new data: “Anytime there’s work on a new metric or new data source the analyst is unfamiliar with, they need to spend extra time exploring the data and understanding the system.” - Samson Hu (Wish)
Understanding the building blocks is essential for understanding the final results. Only by comprehending the details of the data at hand can analysts and scientists avoid forming the wrong conclusion.
Lacking this knowledge is like cooking without knowing what ingredients you’re mixing in the bowl - did we add sugar or salt? The final result can differ drastically from what was anticipated.
As a motivating example: imagine you are working on a table with a column labeled “date”. Is this the date a customer added the product to the basket, the payment date, the shipping date, or perhaps the date that the product was connected to your platform? Knowing that difference determines whether we use that information in our website optimization report, financial statements, logistic analysis or product onboarding metrics. And tracking down the answer often involves following the data flow right down to the collection stage through multiple, time-consuming steps.
With this said, how can you adapt your data strategy to avoid this common issue?
Schedule the extra time into your delivery timelines, especially when onboarding new people or when existing staff work with new data.
Missed business goals. Advanced data approaches rely on previous company metrics. To give an example, let’s say you ran a multiarmed bandit test (aka A/B tests on steroids) and found that the winning subject line had 20% open rates. Time to pop open the champagne? Well no, not if your average open rate without algorithmic tinkering sits at 40%. Therefore, you must have strong business intelligence and metrics in place before you begin optimizing processes via advanced data approaches.
Oftentimes, throwing state-of-the-art data approaches at a problem is a waste of resources altogether, because the scope of the problem is unclear. You can’t develop a machine learning churn model if it is not clear what the business understanding of ‘churn’ is. Is it when a customer hasn’t logged into your app for 7 days? Or when they uninstall the app? What happens if they re-install the app?
This lack of clear criteria is a well-known challenge in the world of data science, and it’s called the cold start problem: “if we don’t have these initial insights and validation metrics, then how does such model-building get started and get moving towards the optimal solution?”
Avoid wasting data resources if they are unable to bring any benefits. Make sure analytics and business intelligence are set up before diving into more advanced data approaches.
Accumulated technical debt. The last issue commonly faced when ignoring the hierarchy of data needs is failing to foresee how the data will be used… and accumulating technical debt along the way. When rushing to achieve an engineering goal, people often do not think of the entire data flow ahead.
To illustrate: your senior engineers choose to save incoming web data to a Redis instance. No one cares, because your analyst team is not focusing on web data at the moment. Later on, your analysts need to report on e-commerce conversion rates. But they are used to SQL querying. So, either they need to learn Redis syntax (time-consuming), or the engineering team needs to migrate the data to a different database, such as BigQuery (time-consuming).
Analogous issues emerge when using databases that are unsuitable for specific data types, underestimating how the data will grow, or not taking into consideration the costs of linking and joining data across different technological stacks.
3. How to align your data strategy with your growth
To tap into the potential that data has for growth, align your data strategy with the hierarchy of data needs:
Make sure each stage of the hierarchy is solidly built before moving onto the next one. No need for perfection, but also no need to run before you walk.
Make smart hiring choices. For advanced data processes, you need your team to have a mix of engineering, analytical and scientific knowledge.
Plan timelines that follow the nature of work. When working with new people or data, account for the additional time needed for digging through and cleaning the data.
Make sure you are aligned with business goals and baseline metrics before improving your processes.
Make architectural choices with the endpoint in mind: how will the data be accessed and used?
Before you bake your cake and strap in for the moon landing, take a moment to reflect on the recipe and adjust your rocket’s aim. If it’s worth doing, it’s worth doing well. After all, it is the ride of a lifetime.