DataOps and automation at the heart of the banking revolution
According to the European Banking Authority report on Advanced Analytics and Big Data in banking, the implementation of data technologies, infrastructure, and practices is still at “an early stage”.
The game is on for early contenders in this winner-takes-most market. Banks that move quickly are likely to get ahead of the curve, grabbing more of the market pie before others rise to the challenge.
To move rapidly, contenders will need to look over the fence to the wins and losses of data endeavors from other verticals. One evidential lesson is that only a fraction of data science projects make it into production. The main reason? Surprisingly, it’s not a lack of technical skills, scarce talent, inadequate technology or even growth cramps. It’s inefficiencies in data operations.
Banks must turn to DataOps to overcome the challenges of streamlining data operations.
What is DataOps
Data Operations (or DataOps for short) is a set of processes, tools and technical practices that help organizations deliver end-to-end data solutions.
Its name is a nod of appreciation towards the legacy of DevOps, a set of practices in Development Operations which transformed software development processes from inefficient to the state-of-the-art.
DevOps looked at these issues from a holistic perspective. Much like the lean movement identified the manufacturing production line as a single process of delivering products from raw materials to end products, DevOps viewed the software development process as a single line, which needed to be optimized in order to run smoothly.
The lean philosophy helped to streamline and automate common processes involved in software development. By removing barriers, developers could concentrate on what matters most: delivering a software solution.
In a similar vein, DataOps is a set of technical processes and tools that help to streamline the process of changing data from its raw form into a final product. The data pipeline is a manufacturing line: it takes the raw material (data) and transforms it into an end product (dashboard, analysis, algorithm…).
Within this data factory, DataOps combats several challenges:
Repeated manual reporting. The same reports need to be manually created over and over again because the data has changed (but the report structure has not).
Untrustworthy data. The same customer has inconsistent data across multiple databases because not all of the databases are regularly updated.
Slow data collection. Relying on scripts to collect data from multiple third-party apps causes regular breakdowns and slow analysis. This delays the generation of all other reports (creditworthiness, loan qualification, automated portfolio investments…) based on that data.
Scalability issues. When more than fifty requests are being sent every day, the server cannot return an answer because it runs out of memory to compute the query. Analysts and operational personnel waste time waiting for these results.
Lack of data versioning. Re-running the same experiment or algorithm produces drastically different results, and it’s impossible to understand why because incoming data is not versioned. For example, how will you check that your anti-fraud algorithm has correctly flagged the same transactions as potentially fraudulent if the transactions you are experimenting on are constantly changing?
Disjoined and siloed data. Data is stored in different formats (csv, database table, uncleaned contract scans…) across disparate locations (departments, regions, clouds and on-premise storage). It all needs to be centralized before an analysis can be made. If the personal banker needs to email five departments in order to obtain the data before they can confirm a loan request, it’s a waste of both the bank’s resources and the customers’ time.
The problems are numerous, but the solutions to them can be summarized within the 5 principles of good DataOps.
The 5 principles of good DataOps
DataOps focuses on speeding up the end-to-end cycle of data analytics, which starts with data collection and ends with analytics or visualizations that bring added value.
DataOps relies on 5 main principles:
Bottleneck removal. The data pipeline is regarded as a holistic process from beginning to end. To improve the entire pipeline, we need to identify and remove bottlenecks wherever they occur. If there are problems with your data collection, it’s going to delay the analysis stage, too. First of all, any bottlenecks need to be identified, then processes or tools should be deployed to remove or mitigate these in the operations.
Automation. All (unnecessary) repetition should be automated. Automation prevents you from wasting precious time repeating common tasks, and it also mitigates the possibility of human error with increased repetition.
Exhaustiveness. The entire end-to-end data pipeline needs to be part of the DataOps solution. Optimizing just one part of the pipeline does not prevent errors down the line, since the bottlenecks still remain.
Monitoring. Every aspect of the operations should be monitored. That way, bottlenecks can be identified and removed accordingly. Even if you’ve eliminated all of the bottlenecks, monitoring ensures that you’re made aware of potential deterioration in the data pipeline, allowing you to intervene before the damage is done.
Speed of iterations. Data analytics is cyclical: we form a question, collect, store and clean data in order to answer it, then lastly analyze and obtain the answers we’re looking for. But with these answers come new questions, and so the cycle is repeated. The aim of DataOps is to speed up the analytical cycles and shorten the time it takes to get from posing a question to finding an answer.
How to start with DataOps in banking
When implementing DataOps for the first time, concentrate on the recurring data problems burdening your bank. These problems are the bottlenecks that are preventing you from streamlining your operations.
The cycle of continuous improvement
DataOps is not a one-time solution - it’s a process of continuous improvement. Every time a segment of the data pipeline is improved and data products are delivered faster, more efficiently and with fewer errors, new bottlenecks can be identified and corrected.
DataOps challenges in the banking industry
The banking industry encounters several unique challenges and these shape the way in which DataOps is implemented:
Sensitive data. The nature of finance constitutes a special class of sensitive data. This requires DevOps practices to place increased emphasis on security and reliability.
Legacy technology and architecture. The choice of DevOps tools and practices must take into consideration the legacy technology and architecture of the bank’s infrastructure. These are often outdated and deeply entangled within daily operations, so DevOps must build above them rather than replace them, as is the default in less mature industries.
Regulatory constraints. Data sharing and transformations are specifically legislated for the banking industry. DevOps practices must incorporate the regulatory requirements into all processes.
Although banking has specific requirements, the structure of DataOps is similar to how it is in other verticals.
DataOps structure for retail banks
DataOps initiatives are structured around the ETL (Extract-Transform-Load) process of the data pipeline. It zooms in on every stage of the pipeline and finds ways to automate that step:
Extraction. Collecting data from disparate sources into one cohesive set of data extractions speeds up data collection. Instead of relying on multiple scripts and architectures, DataOps unites the different data collections into a single cohesive practice.
Storage. Storing data across multiple databases can cause delays with retrieval and inconsistencies across different storage systems. DataOps automatically links data across disparate databases or centralizes it within a single location, which speeds up data retrieval.
Transformation. Raw data needs to be cleaned before it can be used for analysis. This can include removing corrupted data, backfilling missing data, linking customer data across different sources into a single view, etc. DataOps automates repeated transformations which are necessary for getting the data into shape.
Analysis & Visualization. Repeated reporting, analyses and KPI tracking are all automated within the DataOps pipeline. When the same report needs to be delivered every Monday before the C-suit meeting, DataOps automates the pipeline, from extraction to producing the report.
Orchestration. From end (data collection) to end (visualisation and analysis), DataOps monitors the working of the data pipeline and alerts you to any possible issues that could cause data loss or corruption. This automated preventative measure beats the usual aggro of post-error corrections.
Monitoring. The entire pipeline is a set of repeatable steps. Orchestration schedules the execution of steps at specified times and in the specified order. This means that the data pipeline can run through all of the phases without the need for human intervention, freeing up hours of time and lessening the chance of human error.
Create your innovation pipeline with DataOps
DataOps doesn’t just accelerate existing operations; a well-oiled data pipeline can allow for quicker innovations, too.
There are three things you should expect from your DataOps:
Sandboxes for experimentation. Opening up your data via sandboxes gives access to your analysts, scientists and quants, removing the need for IT and engineering intervention for every new experiment. New algorithms, visualizations and solutions can be tested before investing in extensive engineering to deploy them to the wider public.
The flexibility of adding a new pipeline. Sometimes, a solution will require additional data, while at others, you’ll want to transform the data differently for an algorithm to work. A slick DataOps enables speedy changes to the existing pipeline (e.g. add a new extractor, change the transformation, etc.), which speeds up the development of innovative solutions.
Controlled sharing. Once an analysis has been performed, a report finalized or data transformed through the dark arts of machine learning, that data needs to leave your pipeline and be made available to other people. To innovate on product development, for example, your mobile app will require access to your new product-recommendation system. DataOps makes sure that you can control what data you share externally, so innovation can happen with it outside of the data pipeline.
How can banks tailor their DevOps to the types of data found in the financial world?
Banks and other financial institutions work with three different types of data:
Direct. Data that the customer has provided directly to the bank, including contact information, demographic data, financial history, etc.
Transactional. Includes payments, withdrawals, bank transfers and wires, balances, interest over time, etc.
Value-added. Data that banks generate after analyzing direct and transactional data, such as credit scores, asset valuation, and other aggregated and standardized data.
Depending on the type of data, there are different regulatory requirements and business opportunities at play. For example, data that allows identification of a customer is more strictly regulated than analyses performed for customer segments.
Knowing which type of data your bank will be using in its product development, monetization strategies and revenue streams can also inform DevOps practices.
Personal data should be automatically placed in a different pipeline, one which requires special approval when sharing outside of the bank. Value-added data can constitute the backbone of your internal operations, which speed up face-to-face interactions with customers.
The benefits of DataOps in banking - why you should automate your data pipeline
DataOps emphasizes automation because of the many advantages it brings:
Lower rates of (human) error. Automating the pipeline removes the need to manually run scripts and therefore reduces the likelihood of human error. Additionally, with monitoring in place, it’s easier to spot errors before/when they occur.
Higher traceability of data. Data tracing plays a crucial role in both regulatory requirements (e.g. deleting all personal data following a customer’s request under GDPR) and for understanding the data in general. Data produced in automated pipelines, which are centrally orchestrated, are easier to trace.
Increased confidence in data. Removing errors and tracing data increase confidence in the decisions that are made based on that data. Issues with inconsistencies and trustworthiness are just bottlenecks, which DataOps practices are designed to remove.
Continuous improvement of the data pipeline. Continuous improvements are embedded in the heart of DataOps. Following this principle, DataOps fulfills the promise of making the data pipeline better with each iteration.
Reduce the number of man-hours wasted on repeated tasks and monitoring. The strive for automation will also liberate your talented workforce of mind-numbing and repetitive tasks, thus freeing up more man-hours for revenue-generating work.
Faster innovation. The speed of iterations is one of the central tenets of DataOps because it unlocks faster product development and innovation. Can your data pipeline be accessed via sandbox? Can new mini-pipelines be added without disrupting the regular flow of data? Can data be shared outside of the pipeline in a regulated manner? If the answer is yes, then innovation within and with data will flourish.