Here's a sneak peek into one of our conversations with a data analyst.
In their quest to find out if Keboola could be of wider benefit to a company that has so far been using the platform only as their ETL solution, Michal Hruska, a senior data consultant at Keboola, and Pavel Dolezal, Keboola’s CEO, met with Tim, the company’s data analyst. They sat down to talk about the beaten tracks of working with data and its challenges and the ways Keboola can help solve them. This article is a snippet of their conversation about the role Keboola plays in data analyst's daily work.
Michal, team Keboola: I just want to describe my journey from the widely accepted setup where you are now to using Keboola as the ultimate tool that made my day-to-day work so much easier.
I used to work as a data analyst myself. Gradually, as our project kept growing, I started noticing things that didn’t make much sense to me anymore. You are using Postgre. We had Oracle and Jupyter Notebooks, in which we ran our models and analyses. Even though it was some kind of a Python server, at least it wasn’t local. The problem was that the environments were separate.
We prepared data in one environment and worked with it in another one. While other people were preparing my data, I didn’t have a chance to get involved and adjust the pipeline or prepare it myself. I had to wait for someone to do all that for me, which took time and it slowed me down a lot. I could work only when I had my data. I was so excited when I got the opportunity to see inside the pipeline and be more independent, not having to ask IT specialists for everything and doing it myself.
As for Python, I, too, believe that it allows you to do everything and much faster. I just wondered how to version the code if I had my own Jupyter Notebook. I had to implement a versioning Git and a tool for experiment tracking to track my model training and compare individual versions without having to do it in Jupyter cells. I also wanted to be able to get back to a version during further development. That’s what made me realize that what I needed was some form of collaboration. You say you usually do everything yourself, but I began to see great potential in having the option to invite someone to check my code if I ran into problems when developing. I became interested in a “Google doc's“ way of jumping in and collaborating on one piece of code.
Tim, the data analyst: You’re right. If I’m working on something in my Jupyter Notebook, the versioning isn’t great. I can save different versions there, too, and I can come back to them, but from time to time, I have difficulties finding things, and I’m not happy about it. On the other hand, for ad hoc analyses to which I usually don’t even return, this is not really a problem. The things that might be needed again are saved to Git, accompanied with tests, and reviewed by other people. There is the versioning, it is saved, and anyone can check my code.
At the same time, it seems important to me not to be constricted by one system. I want to learn things that are universal.
In the software world, it’s easy. You have developers, your ITs and DevOps. Period. It’s very simple. You write code and have one production database. But in the data world, there's a huge end-to-end problem. You have different sources and different languages for manipulating data. You have various places where you need to get the data at different stages of the pipeline. Then you have different people, from those who prepare data, those who prepare models and insights like you do, all the way to the hardcore developers who want to write code about all that in a container and need everything to run together. And this is our territory. This is what we are trying to solve—to make everything seamlessly work together. Pavel, CEO @ Keboola
Michal, team Keboola: I would like to describe to you what a data scientist’s day looks like when using Keboola.
1. You, as a data scientist, turn on a workspace in one click. It can be a Python, Spark, or R workspace.
2. Then you decide how strong you want the machine to be, whether it’s supposed to be graphics cards or just an ordinary machine. It will start in half a minute.
3. You'll see, by default, a JupyterLab hosted by Keboola. Your Python or Spark workspace runs behind it. You automatically use our versioning and all the things you need for creating some governance in your work. The real coding then takes place in Python. Today’s Keboola offers an arbitrarily strong machine with the runtime you need for your code.
4. At the end, you can take the code or model and simply deploy it to production with one click. Keboola will set up a web service with an endpoint you can use later for scoring.
You can do all code reviews and similar things directly in one tool. For me, this was probably the main reason why I went for Keboola.
Everything is available within one tool, and I don’t have it scattered in several places, where I, as a user, don’t even get to see it and don’t know what is going on there, how, and why. I really appreciate the flexibility that it gave me because I never really knew in advance what I would develop and with how much data. If there is suddenly too much data and jobs start running slowly, the Keboola workspace allows me to increase the performance, and everything goes on smoothly.
To save money and not pay for the machine when I‘m not working, I put the workspace into sleep mode. Before I start working again, I switch it on and wait for it to recover and restore the code that was in it. To sum up, Keboola gives me everything I need for data science development, and there is also the benefit of having it under one roof where data is being prepared.
When I was still doing everything on my own laptop, I feared what would happen if one of my teammates was to take over and I had to pass on my know-how and all my old code.
Tim, the data analyst: Sure, I think this is the answer to my local pounding on my laptop. But that’s not the standard development process. Otherwise, everything is written in Git, all the code is stored there, it is versioned, etcetera. This is not where the problems arise, is it?
Michal, team Keboola: From what I understand, you are versioning something you already want to deploy into production. But what I often need to version are auxiliary pieces of code which can easily be just 200 lines of code. I use them at the beginning of my analysis for profiling to learn everything I need about that data. The 200 lines are my know-how, and I want to pin them somewhere in that tool for anyone to use. In Keboola, we call it shared code. You store a piece of shared code and someone else can come, take it, and run it or slightly modify it. It’s more straightforward than Git, at least for the users I have worked with. Git is a separate tool. It has a different user interface, library, and so on.
We integrate Git, too, because we have already come across teams who wanted to have JupyterLab integrated with Git. It makes sense to me mainly because you push something into Git, and you have a pipeline there that takes care of many other things that just happen outside of Keboola. And that’s perfectly fine. I just wanted to point out the fact that I rarely see people using Git for everything. To give an example, they would not store their pieces of great code in it to share with their coworkers. I have never met a team that would take advantage of a tool to exchange their “know-how.”
Tim, the data analyst: We actually do that. And I must admit that it’s not entirely simple. But having no previous experience with it, I automatically assumed it would not be easy and accepted it as the best method available. Everyone around me was using it, after all. Plus, it got much easier with time. We use internal scripts or reusable things and encourage each other to write those things as general as possible and at the same time in a way that solves only one problem at a time, increasing the chance that someone will use it again. The fact that people don’t use Git with all its functionalities seems a no-brainer to me. They don’t use Keboola or anything else that way. Each tool usually grows into more and more possibilities, and people find their own paths in it. But I’m slowly starting to understand more about what Keboola offers. We have been only using it as our ETL tool.
It’s not our goal to have Keboola do everything and reinvent the wheel. We are an automation layer connecting people with each other and allowing them to reuse not only clean data, catalogs, and shared buckets but also code so they can collaborate on it together. We automate DevOps so people don’t have to. You said this beautifully; the more our users grow and do more and more in code, the closer we are to the CI/CD pipeline to make the system fit into the classic DevOps architecture. So I would not look at it as either Python and Git or Keboola. On the contrary, we strive to become a link between those things and between different roles. - Pavel, CEO @ Keboola
Tim, the data analyst: I’m still not quite sure what would be in this for me and how all that could make my work easier. The first thing that comes to mind is that Keboola can give me some independence from DevOps, allowing me to manage the environment myself. That sounds quite appealing to me.
Pavel, team Keboola: That’s exactly how it is. It provides independence from DevOps but at the same time cooperation with them.
Tim, the data analyst: I didn't mean to imply that I would like to get rid of DevOps. I’m just under the impression that I’m bothering these guys with petty, little things, and it takes a long time for them to get back to me. On the other hand, I must admit that I also tend to give businesspeople access to the database and ask them to write a simple “select” so they don’t have to depend on me.
Pavel, team Keboola: This is precisely the issue here. We try to decouple people’s dependence on each other. Of course, there are extensive pieces of code that you don’t write yourself, and someone has to write them for you. But when it comes to iteration above these things, that’s where decoupling comes in. You don’t depend on others anymore. Let’s say you have 15 people in a team and 10 companies. If you assign just two people from each of the companies to a common task, it will involve 20 to 30 people. Dependence on DevOps would then be enormous. Plus, those people don’t even want to do this.
Tim, the data analyst: This is probably the most significant aspect for me. The same happens to me all the time when people want me to integrate an API for them because developers don’t have time for anything. This forces us to take various detours. I’m definitely intrigued by what you say, and I would like to attempt to make such little DevOps myself. I would love to test it on a home project first before giving it a go in production.
Pavel, team Keboola: We have recently launched a community website and already have about 150 people registered, helping each other and answering each other’s questions. You can find some help there. We would also be happy to prepare a walk-through session for you and your coworkers, covering DevOps, independence from DevOps, data engineering, and data science. This would give us a better insight into the problems you run into and a chance to show you how we would solve the problems.
Would you like to learn more about the power of decoupling and become independent? Sign up to our community, or create a free account.