What they Didn’t Teach at Data Science School, and How to Fix It to 10x Your Career
Become a Full-Pipeline Data Scientist by Being Both Data Scientist and Full Stack Developer Squeezed into One
Even though I had a stellar CV, I was turned down for the position. I had, what I thought at the time, plenty of machine learning and modeling experience. I applied for a position in Holland — I was ready for something new and wanted to show my family the world. Unfortunately, they told me they wanted a data scientist with more commercial and “full pipeline” delivery experience — “please, let the next candidate in on your way out”. This was a few years back when things were less competitive, it is that much more important today.
Full Pipeline Data Scientist
And they were partly right. I built internal models for a hospital that may not have qualified as ‘commercial’ but I certainly built my share of pipelines. The problem is that most data scientist today have less applied skills than I had back then. Our educational system cranks them out that way. Some don’t even know what pipeline experience means, and if they do, they may not know how to implement one.
Full-pipeline experience is synonymous with being a data scientist and full stack developer squeezed into one. Some will argue that these positions are very different and would be better accomplished by different team members. But in most cases, on smaller teams, in fast startups, and more importantly for intuitive data science solutions, a data scientist should do it all, or at the very least, design it all and have others implement it. And, in the era of ‘A.I., the human job pillager’, the more useful you are, the longer you’ll survive.
It Ain’t Real Until it Reaches your Customer’s Plate
It is critical for a data scientist to not only understand the data, the model, and how to explain the output, they have to understand how it is going to be consumed by the end user. Whether a medical staff needing life-saving prognostics or a customer asking for clothing recommendations, today’s data scientists need to understand how their output will be digested. This is critical, some will be lost, not have a clue what to do with a percentage — like 60% or 80% good loan recipient? You will need to work with the business expert or build tunable parameter so they can tune the output threshold as their business grows. Others will be insulted with a true/false output and require the probability to scale their work or assign budget accordingly. And no, an end user will never need an AUC score…
Being a data scientist is a wonderful profession but there is a troubling gap in the teaching material when trying to become one. Data science isn’t about statistics and modeling, it is about fulfilling human needs and solving real problems. Not enough material tackles the big picture. That’s what is missing in this profession’s educational syllabus. If you build first then talk to your customer, your pipelines will be flawed, and your solutions will miss their target.
Hi there, this is Manuel Amunategui- if you're enjoying the content, don't forget to signup for my newsletter:
Some Ideas to Get You Started
Mind you, there are plenty of ways to build full pipelines but here are some open source solutions so we don’t get bogged down with proprietary solutions or NDAs (I apologize that it’s pointing to my own blog materials, but like I said, not much out there from an ML starting point).
- Life Coefficients — Modeling Life Expectancy and Prototyping it on the Web with Flask and PythonAnywhere
- Rapid Prototyping on Google App Engine — Easily Extend your Python ML Models into Interactive Web Applications — Trip Planner with Google Maps and Yelp
Today’s Models are Complex
Gone are the days of the single model/prediction in spreadsheet solution, today, it can be a choreography of multiple models working asynchronously or synchronously feeding one model’s prediction into another. A customer dashboard may contain one or many outputs, may have complex tunable parameters, may have visuals reaching into the internals of your models. You need to be involved until the end, it is your responsibility — you need to understand the ‘full pipeline’.
Thanks for reading and please share!!!
Manuel Amunategui - on Twitter: @amunategui