The role of a data scientist is evolving. And it's now more important than ever, that data scientists know how to deploy and scale their solutions.
It's now not enough for a data scientist to spend their time running ad-hoc analysis in isolation- leveraging machine learning or deep learning techniques on their local machine or provisioned resources somewhere on the cloud. Businesses are moving beyond this state. Placing a higher expectation on their data science teams to solve highly scalable and complex problems, and to deliver those solutions so that they can be used by the bulk of the business.
Consider for example, the application of data science for equipment predictive maintenance. Whilst a data scientist may be able to uncover a novel and sophisticated means for predicting equipment failure based on their individual model development efforts. The business is unlikely to see any real value from this effort until the data science solution is leveraged by relevant experts to avoid those failures. And even then, that solution may need to be deployed across thousands of individual pieces of equipment in order to provide the business true value.
This poses an interesting challenge for businesses. As no doubt, the expectation from many leaders would be that data scientists have the requisite skills to tackle these deployment tasks and bring their models to life. After all, how difficult should it be to take a model and deploy it broadly for the business to use? Well, it can actually be extremely difficult. With such initiatives requiring not only resources to develop the solution, but to also handle infrastructure provisioning, configuration management, continuous integration, deployment, testing, and monitoring of the solution.
So, as a data scientist, how should you prepare yourself to confront these challenges? Which skills should you embrace and develop, versus which should you suggest are better suited for a dedicated data engineer or software developer? Let's step through some core skills one-by-one.
Navigating a Unix-based OS
I still come across data scientists who only know how to build workflows using Windows, along with their preferred Windows-based analytical platforms and tools. That may be via a low-code data science workflow tool, a Jupyter Notebook environment, or using a light-weight IDE on-top-of Anaconda. All more than easily installed on your standard windows desktop. And whilst I'm not going to suggest that those data scientists can't use those tools to make incredibly helpful and even powerful workflows. I will go as far as suggesting that they are going be limited in their flexibility to membership and use more powerful compute resources, and definitely limited in their ability to create custom or complex deployments for those workflows so that others can use them. The answer? Get familiar with using Linux command line. You'll find those skills will open up a lot of doors for accessing and using remote server resources, or even some cloud-platform service. Best of all, nowadays, Windows 10 users can go ahead and install Windows Subsystem for Linux (WSL), and be up and running with a terminal in no time.
Recommendation: If you haven't yet moved beyond Windows, grab WSL with a distribution such as Ubuntu, and get familiar with common terminal commands.
'Proper' coding
It may seem obvious, but the ability to deploy and scale any kind of code-based data science solution is going to be heavily dependent on the standard and structure of that code. And we understand that not all data scientists come from a coding or software development background. So for some, this may be a difficult bridge to cross. But trust us, there can be a huge difference in building a solution following good coding standards versus code that was just written to get the job done. Poorly written code will be difficult to maintain or diagnose, will likely be highly inefficient, may be highly prone to errors, or worse yet, prone to encountering errors without proper handling. If Python is your preferred language, and we imagine it is for many data scientists, then head over to realpython.com and brush up where needed. And please do spare some time to sit down and talk with your software developer counterparts. These experts will be able to help shape and mold your code into something worthy of a being called 'deployable'.
Recommendation: Make sure you can code using best-practices in your preferred language.
Version control
Being able to use a version control system is a necessity for any data scientist. And that's the case even if you are coding in isolation. A good version control system will let you save and version your work as you go, will allow you to revert back to previous versions of your work, will let your work be shared and reviewed by others easily, and will make things much easier when you find yourself needing to collaborate with multiple people on developing your solutions. And the good news is that version control is not difficult to use at all. For first timers who are interested in using Git and GitHub for version control, we recommend walking through this tutorial. So no matter what type of data scientist you are, or how separated your role requirements, it's time to get familiar with version control.
Recommendation: No more excuses, start using a version control system such as Git in your day-to-day work.
Use of containers
Containers are extremely handy, particularly in the era of cloud computing. They offer a means to bring together a prescribed set of software, libraries, scripts, and routines into a single package, which can then be easily deployed on local or remote resources using only what compute and memory you allow. That means, you can use containers to create or mirror the exact environment you need. And you can use that environment in an extremely flexible manner. Easily creating schedules to automatically create or kill off the resources as needed. All incredibly useful skills for any data scientist, no matter if it's for the purposes of development or deployment. So for that reason, we would suggest all data scientists start familiarizing themselves on using a container technology such as Docker. And while you may not need to go as far as understanding the in's-and-out's of container orchestration, we think at least having an appreciation of some of the common container management tools (e.g. Kubernetes) is a also a good idea.
Recommendation: If you haven't used containers. Grab Docker and build a docker-compose file with the requirements for your current project.
Solution deployment and integration tools
This is where the requirements for a data scientist may become a little blurry. On one hand, if your organization has embraced a data science platform which is capable of tackling and perhaps abstracting away the deployment tasks needed for end-to-end delivery. Then great. As a data scientist, you may be completely satisfied with understanding how to deploy your solutions via that platform for consumption. If on the other hand, your organization does not have a dedicated platform, tends to develop its own custom solutions, or if you are working on a highly customized and specific problem, then there may be a requirement for data scientists to employ specific software development tools for development and deployment of solutions via a continuous process. These tools promote and enforce a clear strategy around regular code contributions, integrations to a main line, automated builds and unit testing, as well as automated deployment. And whilst a data scientist may not be exposed or contribute to every part of this process, it will definitely pay to understand the concepts and workflow for if you do find yourself working on such a development effort. For a great overview, we recommend you take a read of the training and debugging and testing and deployment sections of the fantastic Full Stack Deep Learning course.
Recommendation: Whilst you may not need to build your own CI/CD pipelines, go familiarize yourself with the concepts of a proper development pipeline.
Solution monitoring
Solution monitoring is somewhat related to solution deployment. But in this case, the focus is squarely back on the data scientist, at least in terms of defining the monitoring requirements and strategy. A data engineer or DevOps expert isn't going to be able to define criteria for data or concept drift, nor will they be able to define a event for model re-training. And don't think for a second that you are going to get away with manually inspecting those model residuals whenever you get some spare time. Not when your model is tied to a critical business process, and especially not if your model has been scaled out so that you actually have hundreds if not thousands of model instances running at any given time. So, similar to the above, if your organization already makes use of a specific data science platform which offers features for monitoring and correcting model health, then simply familiarize yourself with those. Otherwise, we would recommend data scientists start experimenting with using MLflow's model tracking features, even if that's as part of the model development process. MLflow will get you thinking about logging and monitoring the right way, and much earlier on in the process.
Recommendation: Try wrapping MLflow over your current project and setting up a model monitoring strategy, even if it's not intended for deployment.
What else?
We haven't even touched the surface to be honest. So, if you still find a gap in your organizations ability to deploy data science solutions, or just have a strong desire to continue to build on your productionization expertise, then it may be time to sink your teeth into how to build an API for your solution using Flask and Django. Knowing how to bring your solution to life using those frameworks will open some serious doors. Perhaps even to the point where you can create your own custom applications to interface with your solutions.