With the steep rise of data, smart businesses have started capitalizing on this new oil to build a new type of products and services: data applications.
Admittedly, the engineering and business development of data apps overlaps with their cousins - the trusty desktop app and the well-known web app, … But there is a core difference that sets data applications apart: they are first and foremost about the data they use to deliver value. Given that data plays this crucial role, the processes, technologies, and architectural decisions of data apps differ from their similar applications.
In this guide we will look at 7 best practices when building modern data applications:
Best practice #1: Start with a business goal
Whether you are building the next generation of big data applications or prototyping your run-of-the-mill web app, the fundamental principles are the same: you do not start with technology, but with a business goal in mind.
There are generally four business use cases specific to data applications:
- Build a standalone data product. Big data analytics does not just offer insights. It also allows us to build standalone products. The most notorious example is Google search. Relying on vast amounts of past search data, machine learning algorithms, and tuning the search results for different perspectives (e.g. search for the intent of browsing or for the intent of answering questions), Google has built a superb standalone product that is far ahead of the competition.
- Improve existing products with data. One of the most common business advantages of data applications is the incorporation of big data technology to improve existing products. Netflix is renowned worldwide as one of the biggest entertainment industry giants, stealing eyeballs from Hollywood and cable TV by offering better movies and series. What sets Netflix apart is its recommendation engine. By combing through large data sets with huge amounts of data on viewing habits, Netflix can determine which combinations of past movies best predict your future preferences. Making you more likely to enjoy the movies they recommend and log into Netflix again. The potential of data apps to set you ahead of the competition has been proven across various industries: banks have used data mining algorithms like anomaly detection to cut off fraudsters; retailers have used business intelligence apps to join past purchased data with social media logs and digital marketing data to quickly understand and adjust to incoming trends, etc. Data applications help companies understand their customers better and deliver a more competitive product.
- Offer data as a product. If your ordinary app collects and generates vast amounts of data, why not capitalize on that as a new revenue stream? The individual users’ demand for data has been steadily growing. From activity tracking in Strava to personalized healthcare recommendation apps, consumers are hungry for data. Why not improve your product offering by adding analyzed data as an additional feature? You can monetize your existing data as a new revenue stream: build a customer-facing data app under a higher monthly subscription, offer data exports via a paid API, start licensing aggregate data to industry researchers, etc.
- Optimize operations with data. Data applications can streamline your operations, helping you to cut costs and deliver your products and services with higher reliability. The Internet of Things (IoT) has helped improve operations in the manufacturing industry. By collecting sensor data from machinery directly, manufacturers can analyze and predict when machines will fail, before the machines actually fail. This allows for better planning and business continuity even in the face of adversity. The transportation industry has also used logistic analytics to streamline operations. Amazon Prime uses data applications that look at traffic control data, previous route planning, historical customer demands, and a variety of other data to predictively ship items even before a customer orders them. Once the customer actually orders the items, they are closer to the customer’s physical location, their delivery to the last mile is faster, thus optimizing their operations and delighting customers with the speed of their services.
The difference between business objectives can sometimes be subtle. For example, both Netflix and Amazon use recommender systems. But digging deeper, you would notice that the first uses real-time processing, while the other uses the classic ETL design.
It is important, therefore, to have a clear business goal in mind, because your vision will guide the architectural choices and tradeoffs down the line.
Best practice #2: Build an MVP, not a polished solution
Your first data application does not have to be a fully-fledged engineering beauty. Data applications are rather novel in comparison to other enterprise applications, so it is not always clear what value they will deliver.
This is why it is important to build MPVs (minimally viable products) and test them.
Building MVPs will help you to:
- Understand better the value your data app offers. Get feedback from customers, try your recommender algorithms in your warehouse, evaluate how well the insights you got from customer analytics helped you improve digital marketing. This approach allows you to better understand the value your app brings to the table and which edge cases it does not solve. Getting feedback early often allows us to course-correct and even discover the value we have not planned for. For example, an exploration into product recommendation might flag that even though XYZ is the most popular product, customers have not bought more of it because of a supply-chain issue (and we can move from improving the product to improving the operations).
- Remove unforeseen engineering issues. You might discover the data warehouse you picked does not scale well with increased demand. MVPs allow you to change architectural choices faster.
- Control your expenses. MVPs are cheaper than fully-fledged solutions. Experimenting here does not cost your company as much as developing a fully specced product and having it fail the day after the launch.
Best practice #3: Don’t let architectural decisions come between your data and your customers
The architectural decision will be derived from your business goals.
Let us look at an illustrative example. Imagine we are building a real-time data app, such as Netflix or a digital advertising app that collects ad impressions and clicks. If your app relies on a streaming flux of data you are going to design the data capturing and processing system with different constraints (needs to be real-time, needs to scale with scaling influxes, parallelize data analysis to speed up the process, ...) versus if your app needs to analyze stationary data long after the data has been collected.
So, what are the usual architectural choices?
- Data capturing mode. Does it have to be live streaming (as data is generated in the example above) or can it be done post-factum (for example once daily after the order book has been concluded)? If we can collect data after it has been generated, do we do it in one go, or do we batch the data collection due to volume?
- ETL vs ELT design. Do you need to load the data first and then clean it and analyze it or is the incoming data manageable enough that you can transform it on the fly? The ELT design is much more favorable for huge amounts of data when transformations before loading would cause a bottleneck and would delay real-time analytic systems.
- Tooling and infrastructure. The choice of the specific software you deploy will be guided by the goals. For example, MySQL is a great relational database, but streaming data is better handled by Kafka.
- Supported data types. When you work with structured data it is easy to pick a solution. Unstructured data, on the other hand, or data without a fixed schema demands the use of specific data types (VARIANTS for JSON data) or again a different choice of tooling (e.g. NoSQL databases, graph databases, document storages, etc.).
- And others …
The architectural choices presented above are far from exhaustive. But they illustrate specific choices that need to be made when building data applications. There are a couple of principles to keep in mind when architecting your app:
- Decouple your features. Design your app in a modular fashion, so it is easier to (1) develop individual units separately, (2) maintain the app without adding complexity, (3) experiment and re-architect singular modules instead of the entire app.
- Avoid decision lock-in. Every design choice we make comes at the opportunity cost of another choice. But that doesn’t entail that we need to follow that choice ‘till death do us part. Having the option and ability to re-architect our app gives us business agility and accelerates responsiveness when app changes are necessary. And they will be necessary. Rely on modern data platforms to avoid decision lock-in, such as AWS, GCP, Azure, etc. Unlike in-house infrastructure, which is often harder and more expensive to change, cloud providers offer several plug-and-play tools and allow us to seamlessly migrate between different solutions without being locked-in in a single design.
Best practice #4: Deliver faster with developer tools and integrations
Your app will outshine the competitors due to the smartly engineered proprietary code you write.
But that doesn't mean you have to manually code every feature and function.
Modern data platforms come with different data-devoted developer tools and integrations which accelerate your development. To name a couple:
- Instead of writing your data collection scripts manually for each API you collect data from, use point-and-click integrations to a data extractor app, which automatically collects data from Third-Party Apps. Third-Party Apps include advertising software (Facebook Ads, Google Ads, …), CRMs (Salesforce, CloseIO, ...), email marketing tools (Mailchimp, SendInBlue, …), etc.
- Use developer tools to automate data cleaning. Write your cleaning scripts once, and rely on software to automatically clean data at every ingestion.
- Prototype data products with developer tools. From sandboxes that allow data experiments and sharing without breaking your engineering pipelines, to Jupyter Notebooks which allow collaboration between your data scientists. Developer tools are made to speed up those pesky and boring aspects of processes, which you never needed, such as setting up virtualization and dockerization before you can play with data (Sandboxes) or emailing data or sharing via GDrive to allow data scientists to collaborate.
Relying on integrations and developer tools speeds up your app development and it also liberates time from your experts’ schedules, which can be devoted to more productive engineering and revenue-generating work.
Best practice #5: Automate provisioning and deployment
Provisioning comes in many forms:
- Setting up servers by installing the correct software and patching breaking updates.
- User provisioning by monitoring access rights and authorization privileges to keep your app secure.
- Networking by connecting users, servers, containers, IoT devices, …
- Service provisioning to set up services and the data relying on them.
All the provisioning work is necessary if you want to deploy your app reliably.
But work doesn't stop there.
Before your app can be launched, you also need tests, the CI/CD integration with your version control provider, monitoring of the production server for the correct distribution of resources, …
Accelerate your app deployment by automating provisioning and deployment work. The modern data platform can do this for you. You will still need to set up the configuration for your software, define the users’ groups and roles for access and monitor the overall production server, but a lot of the in-between steps can be automated without any human interaction.
Best practice #6: Be ready to scale
The issue with traditional apps is that they are built on traditional architecture. Data applications differ from your classical enterprise software in various architectural areas:
- Data apps generate and consume larger volumes of data at faster speeds and demand different pipelines for data ingestion and delivery.
- Because of the larger amounts of data, data volume and velocity scale non-linearly with demand. Relying on suboptimal provisioning to tackle scaling needs increases your expenses and cuts into your margins.
- Traditional apps tightly couple features with their data. As expected, this causes increased complexity with every feature change and product release, and it introduces technical debt. But it also makes it harder for your app to scale compute resources independently of storage resources. Having a spike in users querying your data warehouse is not the same problem as having an increase in write queries to your warehouse.
Instead, use the cloud technology for your data applications platform. Cloud providers allow you to adjust your compute and storage resources to match demand. With tools such as auto-scalers and automated cluster replicas, you can enjoy the benefits of automated cost-adjustment with almost zero maintenance and virtually no effect on service delivery. You just need to keep the decoupling principle in mind - to scale compute separately from storage, your app logic needs to be separated from your warehousing.
Best practice #7: Optimize the value you deliver
Once you build and launch your app, you can constantly optimize the value your data app offers.
Set up a meta-analytic platform that measures your application’s operations. Monitor each aspect of your app, from your ETL/ELT pipeline, your warehousing configuration, the specific ML algorithms you use, to the end customer-facing UI, use tools which help you diagnose and improve the value delivery of your app.
Keboola can help you build modern data applications faster
Keboola is an end-to-end data operations platform that helps you prototype, test and deploy your data applications faster.
Within the ecosystem of Keboola tools and apps you can find:
- Over 250 integrators, which allow you to automatically extract and load data from a variety of 3rd Party Apps and seamlessly move them to multiple databases, data warehouses, and data lakes.
- Automate your data cleaning. Write your cleaning scripts in SQL, Python, or the language you most love and put them on autopilot.
- Build ETL and ELT pipelines with simple point-and-click technology.
- Prototype data products by using Sandboxes for exploring data without affecting the engineering pipelines, the Data Catalog to share data, and collaborative data science tools such as Jupyter Notebooks.
- Scale and deploy seamlessly, by relying on best-in-class cloud provider technology.
Explore everything Keboola has to offer without any commitments. Feel free to give it a go with our always free plan.