Machine Learning Pipelines

As I begin to write this article, I am wondering, why am I not writing this in a Jupyter notebook.

I guess by now you have figured out that I am a data scientist as well. Jupyter notebook! is a dead giveaway. I said “as well” because I know that since you are reading this you are a data scientist or aspiring to become one.

Data scientists love Jupyter notebooks and I have come to realise why. It is because they allow us to present information as we are writing software and they have features that enable them to be interactive enough to allow us to engage with our audience.

I guess we can agree that their main use is for presenting information. When it comes to building machine learning pipelines they become our greatest obstacle. Anything that has the word pipeline next to it is meant to be treated as a black box. Something goes in on one end and another different thing comes out on the other end. Hence if you build machine learning models using Jupyter notebooks you are most likely doing the opposite of a pipeline.

The bad news is that this is the truth.

So am I saying that you should stop making use of Jupyter notebooks? Heck no! Am I that naive, of course not. So what am I suggesting, how about we use both but I will tell you where to use what and when. The solution to the above problem can span the next two paragraphs. But that wouldn’t be worth reading the article, right? The other reason I do not just want to tell you the solution right away is that I believe that you need to appreciate the convenience brought about by machine learning pipelines before you may adopt them.

So we will start off by recapping on the traditional steps taken when building a machine learning model. Identify and load the dataset Loading the dataset Exploratory Data Analysis Feature Engineering Feature Selection Train-Test Split Model Training Model Evaluation.

It is important to note that some steps like dealing with outliers may form part of our EDA process hence not everything that you know maybe reflected, just like hyperparameter tuning which may form part of Model Evaluation.

Below is a sample dataset, in this case, we will start by approaching the model building the traditional way. Firstly, let us introduce an imaginary dataset called the Jacob Dataset, yes I am narcissistic in that way.

‍Jacob Dataset AgeWeight (lb)Gender27200.23 Female16180.32Male33310.04Male49430.89Female

Stop it, stop it already. I am sure you are now giggling and saying that I was showing a big ego for such a small dataset! Please! Just note that I have kept the dataset small enough so that I can demonstrate important concepts. Now let us look at the training process using the traditional way of building machine learning models.We would start off by performing our EDA task until we get our data clean.

‍After cleaning the data we would take 70% of it to use as our training data. Since we have both numerical and categorical columns on our dataset, we would need to apply transformations on our dataset which would make it ready to be fed into a machine learning model. Once the data is ready, it is used to train our model. After training our model, we want to see how well it performs on unseen data, hence we will test it.

To test our model, we need to take 80% of the remaining 30% of our data and perform all the necessary transformations required to get our data ready to be fed into our model. Once the data has been transformed, we use it to evaluate our model’s performance.

The last two steps bring us to this diagram. As you can see, there is an extra step in the process, that is we tune the hyperparameters until we are happy with the performance of our model. However, since we kind of has tuned our hyperparameters to perform well on our evaluation data, we need to retest the model on the data it has never seen before.

To do our final test, we are going to take the other 20% of the 30% that was not used during the training and evaluation of our model. We apply the same transformations as we did for the last two datasets until we have data that is ready to be fed into a machine-learning model. The data is used to test how well our model performs on “totally unseen” data. And that is it for building a machine-learning model using traditional methods. Let us see the whole process in one picture.

What is easily apparent in this image is the repetition of tasks namely the splitting of our data into categorical and numerical then followed by their respective transformations. And one can argue that this type of model building is prone to bugs as software can be easily copied with its bugs. Furthermore, the object used to perform transformations during training has to be used again for the test and evaluation datasets. I can think of two more wrong things in the above image. But before you move on, I would like you to take a few more minutes and try and identify potential issues with that image.

How would building machine learning models look like with pipelines then? With machine learning pipelines all we have to do is perform our EDA and make sure that our data is clean. Once the data is clean .i.e no missing values are present in the dataset, we can split it into three different groups, namely train, and evaluation test sets.

After splitting the data, the only thing left is to specify the sequence of steps taken before the data gets into our model. For this scenario, we want our data to be split into categorical and numerical data. Perform the relevant transformations and join the data into one dataset before sending it to our model.

Our pipeline will expose methods to train and test our model.

Finally, we will have something like this. There is no arguing that this is somewhat easier to work with. But I have to warn you, it takes lots of practice to be able to build pipelines and get them to work. By now you know why we need machine learning pipelines. I have also told you what they are but if you have not been paying attention, here is what they are:

‍“A machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.”

Now we have come to the end of our article, but before I wrap up let me remind you when to use Jupyter notebooks and when to use pipelines. Use the notebooks to experiment and once you have built a model that you are satisfied with, translate that into a machine-learning pipeline. Scikit learn has a pipeline framework, furthermore, if you are using Tensorflow, sci-kit learn has wrappers for it, hence you can wrap Tensorflow models and use them in sci-kit-learn pipelines. Thanks for reading. I hope you enjoyed it and this is the end of building models without using machine learning pipelines