Data science has been called the sexiest job of the 21st century by some people, who have obviously never seen the inside of a fire station.
And sure, working as a data scientist can be extremely rewarding. You tinker with shiny new machine learning algorithms, think deeply about interesting problems, and your work has a direct impact on the company’s bottom line. By discovering new revenue sources, optimizing costs and using analytics, you can boost and accelerate your organization’s growth.
But boy oh boy, does data science look different on the frontline.
Let’s take a peek behind the curtain to see what the ‘average Wednesday’ of a data scientist actually looks like.
Once you move from the world of academia and Kaggle competitions into industry, something strikes you hard: no one prepares you for how messy real-life data can be.
Seasoned practitioners have been grinding their teeth over this problem for years. But the fact remains that data scientists spend on average more than 80% of their time gathering and cleaning data.
Why is this?
Well, there are a couple of reasons:
Messy data is just one part of the equation. Gathering data is the other side of the 80% coin.
And here, we’re not just talking about designing the ETL pipeline that helps you to obtain the necessary data for your work. Sure, some data scientists are lucky enough to have data engineers who build ETL pipelines for them, but that’s not the case for everyone.
Even when the ETL is set, there are other challenges involved in getting usable data. On some level, this is expected. It’s in the data science hierarchy of needs that acquiring data will take up a bigger chunk of your time:
On the other hand, there are challenges in data governance which are difficult to surpass:
All repetitive work and no play makes Jack a dull data scientist.
Unfortunately, running the entire data science pipeline (ETL > data understanding and gathering > data cleaning > (finally) modeling) is a laborious process. This is a horrible pain, given that a lot of data science is based on exploration and experimentation.
Unlike some other fields of engineering, data scientists find the best algorithm for the job by running multiple variations, tuning hyperparameters, experimenting with new features, and tinkering in general.
Imagine if a structural engineer built bridges by experimenting and building multiple versions until the best one was found. We probably wouldn’t be so happy to finance those projects. And yet the processes and tools given to data scientists push them exactly in this direction - spending long cycles waiting for data before they can run the experiments needed to reach viable business conclusions.
Since its inception, data science has been the love child of three fields: mathematics, software engineering, and domain expertise.
The intersectional nature allows data scientists to see beyond the blind spots of each field and come up with creative ideas to solve business challenges. But as with any family, there are bound to be some conflicts.
The tools that are often used by data scientists are ones that work great for software developers. For example, Git for versioning and collaboration. But anyone who has ever pushed a Jupyter Notebook to master will know that Git is a poor choice for versioning and sharing (try reading this blame message).
Even when we find other ways of collaborating on Jupyter Notebooks, the crucial problem remains: there are no smart tools for sharing data.
Why does this matter? Because sharing data is the cornerstone of doing data science work. Sharing data allows you to:
This is especially painful during times when remote work is becoming the rule rather than the exception. Having tools and processes which support collaboration is paramount for a data scientist.
Before you start doom prepping and looking at job boards for a career change (orange picking in Sardinia sure does sound fun), let’s look at things in a different way. Every problem we discussed is a challenge that needs solving.
Being aware of the problem positions you ahead of it. You can start solving them.
And this is the approach you should take:
If you'd like to start implementing above approach immediately, you are welcome to create a free account in Keboola.