6.1. Pipelines and composite estimators¶
Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline . Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y ). In contrast, Pipelines only transform the observed data ( X ).
6.1.1. Pipeline: chaining estimators¶
Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:
Convenience and encapsulation
You only have to call fit and predict once on your data to fit a whole sequence of estimators.
Joint parameter selection
You can grid search over parameters of all estimators in the pipeline at once.
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).
6.1.1.1. Usage¶
6.1.1.1.1. Construction¶
The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object:
The utility function make_pipeline is a shorthand for constructing pipelines; it takes a variable number of estimators and returns a pipeline, filling in the names automatically:
6.1.1.1.2. Accessing steps¶
The estimators of a pipeline are stored as a list in the steps attribute, but can be accessed by index or name by indexing (with [idx] ) the Pipeline:
Pipeline’s named_steps attribute allows accessing steps by name with tab completion in interactive environments:
A sub-pipeline can also be extracted using the slicing notation commonly used for Python Sequences such as lists or strings (although only a step of 1 is permitted). This is convenient for performing only some of the transformations (or their inverse):
6.1.1.1.3. Nested parameters¶
Parameters of the estimators in the pipeline can be accessed using the <estimator>__<parameter> syntax:
This is particularly important for doing grid searches:
Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to ‘passthrough’ :
The estimators of the pipeline can be retrieved by index:
To enable model inspection, Pipeline has a get_feature_names_out() method, just like all transformers. You can use pipeline slicing to get the feature names going into each step:
You can also provide custom feature names for the input data using get_feature_names_out :
6.1.1.2. Notes¶
Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.
6.1.1.3. Caching transformers: avoid repeated computation¶
Fitting transformers may be computationally expensive. With its memory parameter set, Pipeline will cache each transformer after calling fit . This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical. A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration.
The parameter memory is needed in order to cache the transformers. memory can be either a string containing the directory where to cache the transformers or a joblib.Memory object:
Side effect of caching transformers
Using a Pipeline without cache enabled, it is possible to inspect the original instance such as:
Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. In following example, accessing the PCA instance pca2 will raise an AttributeError since pca2 will be an unfitted transformer. Instead, use the attribute named_steps to inspect estimators within the pipeline:
6.1.2. Transforming target in regression¶
TransformedTargetRegressor transforms the targets y before fitting a regression model. The predictions are mapped back to the original space via an inverse transform. It takes as an argument the regressor that will be used for prediction, and the transformer that will be applied to the target variable:
For simple transformations, instead of a Transformer object, a pair of functions can be passed, defining the transformation and its inverse mapping:
Subsequently, the object is created as:
By default, the provided functions are checked at each fit to be the inverse of each other. However, it is possible to bypass this checking by setting check_inverse to False :
The transformation can be triggered by setting either transformer or the pair of functions func and inverse_func . However, setting both options will raise an error.
6.1.3. FeatureUnion: composite feature spaces¶
FeatureUnion combines several transformer objects into a new transformer that combines their output. A FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix.
When you want to apply different transformations to each field of the data, see the related class ColumnTransformer (see user guide ).
FeatureUnion serves the same purposes as Pipeline — convenience and joint parameter estimation and validation.
FeatureUnion and Pipeline can be combined to create complex models.
(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)
6.1.3.1. Usage¶
A FeatureUnion is built using a list of (key, value) pairs, where the key is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object:
Like pipelines, feature unions have a shorthand constructor called make_union that does not require explicit naming of the components.
Like Pipeline , individual steps may be replaced using set_params , and ignored by setting to ‘drop’ :
6.1.4. ColumnTransformer for heterogeneous data¶
Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. Often it is easiest to preprocess data before applying scikit-learn methods, for example using pandas. Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:
Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.
You may want to include the parameters of the preprocessors in a parameter search .
The ColumnTransformer helps performing different transformations for different columns of the data, within a Pipeline that is safe from data leakage and that can be parametrized. ColumnTransformer works on arrays, sparse matrices, and pandas DataFrames.
To each column, a different transformation can be applied, such as preprocessing or a specific feature extraction method:
For this data, we might want to encode the ‘city’ column as a categorical variable using OneHotEncoder but apply a CountVectorizer to the ‘title’ column. As we might use multiple feature extraction methods on the same column, we give each transformer a unique name, say ‘city_category’ and ‘title_bow’ . By default, the remaining rating columns are ignored ( remainder=’drop’ ):
In the above example, the CountVectorizer expects a 1D array as input and therefore the columns were specified as a string ( ‘title’ ). However, OneHotEncoder as most of other transformers expects 2D data, therefore in that case you need to specify the column as a list of strings ( [‘city’] ).
Apart from a scalar or a single item list, the column selection can be specified as a list of multiple items, an integer array, a slice, a boolean mask, or with a make_column_selector . The make_column_selector is used to select columns based on data type or column name:
Strings can reference columns if the input is a DataFrame, integers are always interpreted as the positional columns.
We can keep the remaining rating columns by setting remainder=’passthrough’ . The values are appended to the end of the transformation:
The remainder parameter can be set to an estimator to transform the remaining rating columns. The transformed values are appended to the end of the transformation:
The make_column_transformer function is available to more easily create a ColumnTransformer object. Specifically, the names will be given automatically. The equivalent for the above example would be:
If ColumnTransformer is fitted with a dataframe and the dataframe only has string column names, then transforming a dataframe will use the column names to select the columns:
6.1.5. Visualizing Composite Estimators¶
Estimators are displayed with an HTML representation when shown in a jupyter notebook. This is useful to diagnose or visualize a Pipeline with many estimators. This visualization is activated by default:
It can be deactivated by setting the display option in set_config to ‘text’:
An example of the HTML output can be seen in the HTML representation of Pipeline section of Column Transformer with Mixed Types . As an alternative, the HTML can be written to a file using estimator_html_repr :
A Comprehensive Guide For scikit-learn Pipelines
Scikit Learn has a very easy and useful architecture for building complete pipelines for machine learning. In this article, we’ll go through a step by step example on how to used the different features and classes of this architecture.
There are plenty of reasons why you might want to use a pipeline for machine learning like:
- Combine the preprocessing step with the inference step at one object.
- Save the complete pipeline to disk.
- Easily experiment with different techniques of preprocessing.
- Pipeline reuse.
- Easy cloud deployment.
Alright, now let’s get down to business. In this article we’ll use a fairly easy and old problem as an example, which is the Regression problem for predicting housing prices.
Download the data and you should have a train.csv file and a test.csv file, we’ll load both using pandas.
Loading the data
Feature selection
This data has 163 columns, however, we are not going to use all of them.
After doing a bit of EDA we choose a set of nominal, ordinal and numerical columns to work with.
If you want to see the entire selection process and EDA fully explained, you can see the notebook here
Preprocessing
Now let’s choose a preprocessing plan, a very straight forward one is the following:
- Impute missing data with most frequent value
- Use Ordinal Encoding
- Impute missing data with most frequent value
- Use One Hot Encoding
- Impute missing data with mean value
- Use Standard Scaling
As you may see, each family of features has its own unique way of getting processed. Let’s create a Pipeline for each family.
We can do so by using the sklearn.pipeline.Pipeline Object
Now let’s join all of the above in one pipeline that targets each column with its family’s pipeline.
We can do so using the sklearn.compose.ColumnTransformer Object
Adding the model to the pipeline
Now that we’re done creating the preprocessing pipeline let’s add the model to the end.
If you’re waiting for the rest of the code, I’d like to tell you that that’s it. Pretty easy isn’t it. If the scikit-learn maintainers ask to take my heart I’d give it to them for such great API.
The training and evaluation process is the same as any normal model
Saving and Loading Pipelines
Now we want to save the entire preprocessing parameters and model parameters of this pipeline to disk and load it whenever needed.
We are going to use joblib for this JOB . get it? . sorry.
Save the pipeline
We are going to save the model as a pickle (.pkl) file. The code is fairly simple.
Load the pipeline
Now you’re on your flask server and you wish to load the model to help a user predict the price of a house, so you want to load the model from disk when you start the server, or whenever a request is sent. That is also fairly simple.
Full Code
That’s it, Congratulations, you’ve just created, saved and loaded your complete pipeline.
I hope this article was helpful, if not, please tell me how to improve it, I would really appreciate that. Thank you.
What is a Pipeline & Why is it essential?

Let’s say you want to build a machine learning model to predict the quality of red wine. A common workflow for solving this task would be as follows.
Here, first, we read the data and split it into a training and a test set. Once we did that we need to prepare the data for machine learning before building the model like filling the missing value, scaling the data, doing one-hot encoding for categorical features etc.
Once we prepare the data, we can go forward and train the model on the training data and make predictions on the test data.
As you can see there are lots of steps that need to be executed in the right order for training the model and If you mess things up, your model will be complete garbage. And this is just a simple example of an ml workflow. As you start working with a more complicated model, the chances of making errors are much higher. This is where the pipeline comes in.
What is a Pipeline?
A Pipeline is simply a method of chaining multiple steps together in which the output of the previous step is used as the input for the next step.
Tweet
Let’s see how can we build the same model using a pipeline assuming we already split the data into a training and a test set.
That’s it. Every step of the model from start to finish is defined in a single step and Scikit-Learn did everything for you. First, it applied all the appropriate transformations on the training set and build the model on it when we call the fit method and then transform the test set and made the prediction when we call the predict method.
Isn’t this simple and nice? Pipeline helps you hide complexity just like functions do. It also helps you avoid leaking information from your test data into the trained model during cross-validation which we will see later in this post. It is easier to use and debug. If you don’t like something you can easily replace that step with something else without making too many changes to your code. It is also nicer for others to read and understand your code.
Now, let’s see pipelines in more detail.
How to use a Pipeline in Scikit-Learn?
The Pipeline in scikit-learn is built using a list of ( key, value ) pairs where the key is a string containing the name you want to give to a particular step and value is an estimator object for that step.
There is also a shorthand syntax (make_pipeline) for making a pipeline that we saw earlier. It only takes the estimators and fills in the names automatically with the lowercase class names.
Rules for creating a Pipeline –
There are few rules that you need to follow when creating a Pipeline in scikit Learn.
- All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method) The last estimator may be any type (transformer, classifier, etc.).
- Names for the steps can be anything you like as long as they are unique and don’t contain double underscores as they are used during hyperparameter tunning.
Accessing Steps of a Pipeline –
The estimators of a pipeline are stored as a list in the steps attribute and can be accessed by index or by their name like this.
Pipeline’s named_steps attribute allows accessing steps by name with tab completion in interactive environments.
You can also use the slice notation to access them.
Grid Search using a Pipeline –
You can also do a grid search for hyperparameter optimization with a pipeline. And to access the parameters of the estimators in the pipeline using the <estimator>__<parameter> syntax.
Here, we wanted to set the numbers of neighbors parameters of the knn model so we use double underscore after the estimator name – kneighborsregressor__n_neighbors.
We can go one step further.
So far, we only worked with a single algorithm(K-Nearest Neighbors) but many other algorithms might perform better than this. So, now let’s try different algorithms and see which perform best and we will also try different options for preparing the data as well, everything in a single step.
Here, we tried 5 different algorithms with default values and we also tested the scaler and imputer method that works best with them. The best algorithm for this task is the RandomForestRegressor which is scaled and the mean is used to fill the missing values. Some other models that performed well are XGBRegressor and LinearRegression .
We can do even more than this.
Now, As we narrow down to few algorithms that are performing well on this dataset, we can further improve the result by tuning the parameters of these models separately with different settings. Here, we are using separate dictionaries for each of the algorithms that we want to tune.
Feature Selection with pipelines –
We can also do feature selection with a pipeline. There are various ways to do feature selection in scikit-Learn but we will only look at one of these. Later, I will write more about it in my future posts so make sure to subscribe to the blog.
We will do feature selection based on p-values of a feature. If it is less than 0.5, we will select that feature for building the model and ignore rest of the features.

ColumnTransformer with Pipelines –
So far, we only worked with numerical data to keep things simple but this is not going to be the case always. You are also going to have some categorical data like sex(Male, Female) and you can’t apply the same transformation like mean and median to it. You have to apply a different transformation to the categorical data.
One of the easiest ways we can apply a different transformation to numerical and categorical columns in scikit-learn is by using the ColumnTransformer.
We will read a new data set which has mixed data type(numerical and categorical) and see how to apply everything that we have learned so far using a pipeline.

Now, we will build separate pipelines for numerical and categorical data and combine them using columnTransformer that applies appropriate transformations based on the column data type.
The ColumnTransformer requires a list of tuples where each tuple contains a name, a transformer, and a list of names(or indices) of columns that the transformer should be applied to.
Here it is. We created a pipeline that encapsulates every step of the process that needs to be done to create the model. Isn’t this awesome? Nice and simple.
we can also do a grid search as before.
And we are done. We created a model from scratch and did everything using a pipeline. Hurray! Happy Days
I hope you enjoyed this post as much as I did. And if you find this post helpful then please subscribe to our blog below. And also share this post with others. Sharing is caring. And if you have any questions then feel free to ask me in the comment section below.
What is exactly sklearn.pipeline.Pipeline?
I can’t figure out how the sklearn.pipeline.Pipeline works exactly.
There are a few explanation in the doc. For example what do they mean by:
Pipeline of transforms with a final estimator.
To make my question clearer, what are steps ? How do they work?
Edit
Thanks to the answers I can make my question clearer:
When I call pipeline and pass, as steps, two transformers and one estimator, e.g:
What happens when I call this?
I can’t figure out how an estimator can be a transformer and how a transformer can be fitted.
![]()
4 Answers 4
Transformer in scikit-learn — some class that have fit and transform method, or fit_transform method.
Predictor — some class that has fit and predict methods, or fit_predict method.
Pipeline is just an abstract notion, it’s not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.
Here is a good example of Pipeline usage. Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:
With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor. Answer to edit: When you call pipln.fit() — each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can’t call fit_transform or transform on pipeline, last step of which is predictor.
I think that M0rkHaV has the right idea. Scikit-learn’s pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once ( fit() , predict() , etc). Let’s break down the two major components:
Transformers are classes that implement both fit() and transform() . You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer . If you look at the docs for these preprocessing tools, you’ll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC !
Estimators are classes that implement both fit() and predict() . You’ll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn’t necessarily implement predict() , but definitely implements fit() ). All this means is that you wouldn’t be able to call predict() .
As for your edit: let’s go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.
Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer ‘knows’ about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn’t make any sense. This is true if you print out the list of classes before trying to fit the data.
I get the following error when trying this:
But when you fit the binarizer on the vec list:
I get the following:
And now, after calling transform on the vec object, we get the following:
As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what’s important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.