Docker — Containerization for Data Scientists

Docker — Containerization for Data Scientists

Data scientists come from different backgrounds. In today’s agile environment, it is highly essential to respond quickly to customer needs and deliver value. Faster value provides more wins for the customer and hence more wins for the organization.

Information Technology is always under immense pressure to increase agility and speed up delivery of new functionality to the business. A particular point of pressure is the deployment of new or enhanced application code at the frequency and immediacy demanded by typical digital transformation. Under the covers, this problem is not simple, and it is compounded by infrastructure challenges. Challenges like how long it takes to provide a platform for the development team or how difficult it is to build a test system that emulates the production environment adequately (ref: IBM). Docker and Containers exploded onto the scene in 2013, and it has shaped the software development and is causing a structural change in the cloud computing world.

It is essential for data scientists to be self-sufficient and participate in continuous deployment activities. Building an effective model requires multiple iterations of deployment. It is highly important to have the ability to make small changes and deploy and test frequently. Based on the queries I received over recent times, I wanted to write this blog to help people understand what Docker and Containers are and how they promote continuous deployment and help the business.

In this blog, I am writing about Docker and covering the following,

  1. When do we need Docker?
  2. Where does Docker operate in Data Science?
  3. What is Docker?
  4. How does Docker work?
  5. Advantages of using Docker

Why do we need Docker?

Image for post

This happens many times in our work; whenever you develop a model, code, or build an application, it always works on your laptop. However, it gives certain issues when we try to run the same model or application in the production or testing environment. This happened because of the different computing environment between a developer platform or production platform. For example, you could have used Windows OS or any upgraded software, and in production, they might have used Linux OS or a different software version.

In the real world, both the developer’s system and production environment should be consistent. However, it is very difficult to achieve as each person has their own preferences and cannot be forced to use them uniformly. This is where Docker comes into the picture and solves this problem.

Where does Docker operate in Data Science?

In the Data Science or Software development life cycle, Docker comes into the deployment stage.

Docker makes the deployment process very easy and efficient. It also solves any issues related to deploying the applications.

What is Docker?

Image for post

Docker is the world’s leading software container platform. Let’s take our real example, as we know, data science is a team project and needs to be coordinated with other areas like Client-side (Front end development), Backend (Server), Database, another environment/library dependencies for running the model. The model will not be deployed alone, and it will be deployed along with other software applications to get a final product.

Image for post

From the above picture, we can see the technology stack which has different components and platform which has a different environment. We need to make sure that each component in the technology stack should be compatible with every possible hardware (platform). In reality, it becomes complex to work with all the platforms due to the different computing environments of each component. This is the main problem in the industry, and we know that Docker can solve this problem. But how?

Let’s take one more practical use case from the Shipping industry.

Image for post

Docker is a tool which helps to create, deploy, and run applications by using containers in a simpler way.

The container helps the data scientist or developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and deploy it as one package.

In simpler terms, a developer and data scientist will package all the software, models, and components into a box called Container, and Docker will take care of shipping this container into different platforms. You see, the developer and data scientist clearly focus on the code, model, software, and its dependencies and put it into the container. They don’t need to worry about deployment into the platform which Docker can take care of. Machine learning algorithms have several dependencies, and Docker helps in downloading and building the same automatically.

How does Docker work?

Image for post

Developer or Data Scientist will define all the requirements (software, model, dependencies, etc.) in a file called Docker file. In other terms, a list of steps used to create a Docker image.

Docker Image — It’s just like a food recipe with all ingredients and procedures to make a dish. In simple terms, it is a blueprint that contains all the software applications, dependencies required to run that application on Docker.

Docker Hub — Official online repository where we can save and find all the Docker images. We can keep only one Docker image in the Docker hub for a free version and need to subscribe to save more images. Please refer here

When running a Docker image, we can get Docker containers. Docker containers are the runtime instances of a Docker image, and these images can be stored in an online cloud repository called Docker hub, or you can store in your own repository or any version control. Now, these images can be pulled to create a Docker container in any environment (test or production or any environment). Then all our applications run inside the container for both the test and production environment. Now both our test and production environment are the same as because they are running in the same Docker container.

Advantages of using Docker