An introduction to machine learning with scikit-learn¶
In this section, we introduce the machine learning vocabulary that we use throughout scikit-learn and give a simple learning example.
Machine learning: the problem setting¶
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
Learning problems fall into a few categories:
-
supervised learning, in which the data comes with additional attributes that we want to predict ( Click here to go to the scikit-learn supervised learning page).This problem can be either:
-
classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
-
regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization ( Click here to go to the Scikit-Learn unsupervised learning page).
Training set and testing set
Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.
Loading an example dataset¶
scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for classification and the diabetes dataset for regression.
In the following, we start a Python interpreter from our shell and then load the iris and digits datasets. Our notational convention is that $ denotes the shell prompt while >>> denotes the Python interpreter prompt:
A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_samples, n_features array. In the case of supervised problems, one or more response variables are stored in the .target member. More details on the different datasets can be found in the dedicated section .
For instance, in the case of the digits dataset, digits.data gives access to the features that can be used to classify the digits samples:
and digits.target gives the ground truth for the digit dataset, that is the number corresponding to each digit image that we are trying to learn:
Shape of the data arrays
The data is always a 2D array, shape (n_samples, n_features) , although the original data may have had a different shape. In the case of the digits, each original sample is an image of shape (8, 8) and can be accessed using:
The simple example on this dataset illustrates how starting from the original problem one can shape the data for consumption in scikit-learn.
Loading from external datasets
To load from an external dataset, please refer to loading external datasets .
Learning and predicting¶
In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong.
In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T) .
An example of an estimator is the class sklearn.svm.SVC , which implements support vector classification. The estimator’s constructor takes as arguments the model’s parameters.
For now, we will consider the estimator as a black box:
Choosing the parameters of the model
In this example, we set the value of gamma manually. To find good values for these parameters, we can use tools such as grid search and cross validation .
The clf (for classifier) estimator instance is first fitted to the model; that is, it must learn from the model. This is done by passing our training set to the fit method. For the training set, we’ll use all the images from our dataset, except for the last image, which we’ll reserve for our predicting. We select the training set with the [:-1] Python syntax, which produces a new array that contains all but the last item from digits.data :
Now you can predict new values. In this case, you’ll predict using the last image from digits.data . By predicting, you’ll determine the image from the training set that best matches the last image.
The corresponding image is:
As you can see, it is a challenging task: after all, the images are of poor resolution. Do you agree with the classifier?
A complete example of this classification problem is available as an example that you can run and study: Recognizing hand-written digits .
Conventions¶
scikit-learn estimators follow certain rules to make their behavior more predictive. These are described in more detail in the Glossary of Common Terms and API Elements .
Type casting¶
Where possible, input of type float32 will maintain its data type. Otherwise input will be cast to float64 :
In this example, X is float32 , and is unchanged by fit_transform(X) .
Using float32 -typed training (or testing) data is often more efficient than using the usual float64 dtype : it allows to reduce the memory usage and sometimes also reduces processing time by leveraging the vector instructions of the CPU. However it can sometimes lead to numerical stability problems causing the algorithm to be more sensitive to the scale of the values and require adequate preprocessing .
Keep in mind however that not all scikit-learn estimators attempt to work in float32 mode. For instance, some transformers will always cast there input to float64 and return float64 transformed values as a result.
Regression targets are cast to float64 and classification targets are maintained:
Here, the first predict() returns an integer array, since iris.target (an integer array) was used in fit . The second predict() returns a string array, since iris.target_names was for fitting.
Refitting and updating parameters¶
Hyper-parameters of an estimator can be updated after it has been constructed via the set_params() method. Calling fit() more than once will overwrite what was learned by any previous fit() :
Here, the default kernel rbf is first changed to linear via SVC.set_params() after the estimator has been constructed, and changed back to rbf to refit the estimator and to make a second prediction.
Multiclass vs. multilabel fitting¶
When using multiclass classifiers , the learning and prediction task that is performed is dependent on the format of the target data fit upon:
In the above case, the classifier is fit on a 1d array of multiclass labels and the predict() method therefore provides corresponding multiclass predictions. It is also possible to fit upon a 2d array of binary label indicators:
Here, the classifier is fit() on a 2d binary label representation of y , using the LabelBinarizer . In this case predict() returns a 2d array representing the corresponding multilabel predictions.
Note that the fourth and fifth instances returned all zeroes, indicating that they matched none of the three labels fit upon. With multilabel outputs, it is similarly possible for an instance to be assigned multiple labels:
In this case, the classifier is fit upon instances each assigned multiple labels. The MultiLabelBinarizer is used to binarize the 2d array of multilabels to fit upon. As a result, predict() returns a 2d array with multiple predicted labels for each instance.
What does the "fit" method in scikit-learn do? [closed]
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago .
Could you please explain what the "fit" method in scikit-learn does? Why is it useful?
1 Answer 1
In a nutshell: fitting is equal to training. Then, after it is trained, the model can be used to make predictions, usually with a .predict() method call.
To elaborate: Fitting your model to (i.e. using the .fit() method on) the training data is essentially the training part of the modeling process. It finds the coefficients for the equation specified via the algorithm being used (take for example umutto’s linear regression example, above).
Then, for a classifier, you can classify incoming data points (from a test set, or otherwise) using the predict method. Or, in the case of regression, your model will interpolate/extrapolate when predict is used on incoming data points.
It also should be noted that sometimes the «fit» nomenclature is used for non-machine-learning methods, such as scalers and other preprocessing steps. In this case, you are merely «applying» the specified function to your data, as in the case with a min-max scaler, TF-IDF, or other transformation.
A Quick Introduction to the Sklearn Fit Method
In this tutorial, I’ll show you how to use the Sklearn Fit method to “fit” a machine learning model in Python.
So I’ll quickly review what the method does, I’ll explain the syntax, and I’ll show you a step-by-step example of how to use the technique.
If you need something specific, just click on any of the following links. The link will take you to the appropriate section in the tutorial.
Table of Contents:
A Quick Introduction to Model Fitting with Sklearn Fit
To understand what the sklearn fit function does, you need to know a little bit about the machine learning process.
Typically, when we build a machine learning model, we have a machine learning algorithm and a training data set.
Remember that a machine learning algorithm is type of algorithm that learns as we expose it to data. To paraphrase Tom Mitchel: a machine learning algorithm is an algorithm that improves performance on a task as it is exposed to data.
So in order for a machine learning algorithm to learn, it must be exposed to some data.
We need to ‘train’ a machine learning algorithm with data
That’s where we use the training data.
The training dataset is an input that we use to enable the machine learning algorithm to “learn”, so it can improve its performance on the task.

Python. As such, it has tools for performing steps of the machine learning process, like training a model.
The scikit learn ‘fit’ method is one of those tools. The ‘fit’ method trains the algorithm on the training data, after the model is initialized. That’s really all it does.
So the sklearn fit method uses the training data as an input to train the machine learning model.
Then once it’s trained, we can use other scikit learn methods – like predict and score – to continue with the machine learning process.
The Syntax of the Sklearn Fit Method
Now that we’ve reviewed what the sklearn fit method does, let’s look at the syntax.
Keep in mind that the syntax explanation here assumes that you’ve imported scikit-learn and you already have a model initialized, such as LinearRegression , RandomForestRegressor , etc.
‘Fit’ syntax
Ok. Let’s look at the syntax.
When we call the fit method, we need to call it from an existing instance of a machine learning model (for example, LinearRegression , LogisticRegression , DecisionTreeRegressor , SVM ).
Once you’ve initialized an instance of a model, then you can call the method.

Example: How to Use Sklearn Fit
Now that we’ve looked at the syntax, let’s look at an example of how to use sklearn fit.
Here, I’ll show you an example of how to use the sklearn fit method to train a model.
There are several things you need to do in the example, including running some setup code, and then fitting the model.
Steps:
Run Setup Code
Before you fit the model, you’ll need to do a few things.
- import scikit-learn and other packages
- create some training data
- initialize a model
Let’s quickly do each of those.
Import Scikit Learn and other packages
First, let’s import the packages that we’ll use
We’re going to import scikit learn.
And we’ll also import Numpy and Seaborn. We’ll use Numpy to create some dummy training data, and we’ll use Seaborn to plot the data.
You can import these packages with the following code:
Create Training Data
Next, we’ll create some training data.
Specifically, we’re going to create some data that’s roughly linear, with a little noise built in.
To do this, we’ll:
- create 51 evenly spaced numbers for the x-axis variable
- create a y-axis variable that’s linearly related to the x-axis variable, with some normally distributed noise
So here, we’ll use Numpy linspace and Numpy random normal to create our variables x_var and y_var .
Notice that we’re also using Numpy random seed, to set the seed for Numpy’s pseudo-random number generator, which is used by np.random.normal.
Let’s also plot the data with Seaborn:

using the train-test split function from scikit learn.
This gives us 4 datasets:
- training features (X_train)
- training target (y_train)
- test features (X_test)
- test target (y_test)
Initialize Model
Now, we’ll initialize a model object.
Here, we’ll use DummyRegressor for the sake of simplicity.
Once you run this, dummy_regressor is an sklearn model object, from which we can call the fit method.
Fit the Model
Now, we’ll fit the model:
Here, we’re fitting the model with X_train and y_train . As you can see, the first argument to fit is X_train and the second argument is y_train .
That’s typically what we do when we fit a machine learning model. We commonly fit the model with the “training” data.
Note that X_train has been reshaped into a 2-dimensional format.
Predict
Commonly, after we fit a model, we then predict new output values, based on the test features ( X_test ). (Note that X_test needs to be in a 2D format, so we’ll reshape it with Numpy reshape.)
Let’s quickly do that:
Here, the model predicts the value 5.5831811 for any input, which may seem strange. That’s because we’re using the DummyRegressor model, for which the prediction is the average of the training y values (the mean of y_train).
Again: this might seem strange, but it’s useful to use as a baseline, against which you can judge the performance of other machine learning models.
And in this case, it’s simply a simple example that we can use when trying to learn how to fit a model with sklearn fit.
Leave your other questions in the comments below
Do you have other questions about the sklearn fit method?
Is there something that I’ve missed?
If so, leave your questions in the comments section near the bottom of the page.
For more machine learning tutorials, sign up for our email list
In this tutorial, I’ve shown you how to use the sklearn fit method.
But if you want to master machine learning in Python, there’s a lot more to learn.
That said, if you want to master scikit learn and machine learning in Python, then sign up for our email list.
When you sign up, you’ll get free tutorials on:
- Scikit learn
- Machine learning
- Deep learning
- … as well as tutorials about Numpy, Pandas, Seaborn, and more
We publish tutorials for FREE every week, and when you sign up for our email list, they’ll be delivered directly to your inbox.
Sign up for FREE data science tutorials
If you want to master data science fast, sign up for our email list.
When you sign up, you’ll receive FREE weekly tutorials on how to do data science in R and Python.
fit() vs predict() vs fit_predict() in Python scikit-learn
What’s the difference between fit, predict and fit_predict methods in sklearn
![]()
![]()
Towards Data Science
scikit-learn (or commonly referred to as sklearn) is probably one of the most powerful and widely used Machine Learning libraries in Python. It comes with a comprehensive set of tools and ready-to-train models — from pre-processing utilities, to model training and model evaluation utilities.
Many sklearn objects, implement three specific methods namely fit() , predict() and fit_predict() . Essentially, they are conventions applied in scikit-learn and its API. In this article, we are going to explore how each of these work and when to use one over the other.
Note that in this article we are going to explore the aforementioned functions using specific examples, but the concepts explained here are applicable to most (if not all) objects that implement these methods.
Before explaining the intuition behind fit() , predict() and fit_predict() , it is important to first understand what an estimator is in scikit-learn API. The reason why we need to know about estimators is simply because such objects implement the methods we are interested in.
What are estimators in scikit-learn
In scikit-learn, an estimator is an object that fits a model based on the input data (i.e. training data) and performs specific calculations that correspond to properties on new, unseen data. In other words, an estimator can be a regressor or a classifier.
The library comes with the base class sklearn.base.BaseEstimator and all estimators should inherit from that class. The base class comes with two methods, namely get_params() and set_params() that can be used to get and set the parameters of an estimator respectively. Note that the estimators must explicitly provide all of their parameters in the constructor method (i.e. the __init__ method).
What does fit() do
fit() is implemented by every estimator and it accepts an input for the sample data ( X ) and for supervised models it also accepts an argument for labels (i.e. target data y ). Optionally, it can also accept additional sample properties such as weights etc.
fit methods are usually responsible for numerous operations. Typically, they should start by clearing any attributes already stored on the estimator and then perform parameter and data validation. They are also responsible for estimating the attributes out of the input data and store the model attributes and finally return the fitted estimator.
Now as an example, let’s consider a classification problem where we need to train a SVC model to recognise hand-written images. In the code below, we first load our data and then split it into training and testing sets. Then we instantiate a SVC classifier and finally call fit() to train the model using the input training and data.
fit (X, y, sample_weight=None): Fit the SVM model according to the given training data.
X — Training vectors, where n_samples is the number of samples and n_features is the number of features.
y — Target values (class labels in classification, real numbers in regression).
sample_weight — Per-sample weights. Rescale C per sample. Higher weights force the classifier to put more emphasis on these points.
Now that we have successfully trained our model, we can now access the fitted parameters, as shown below:
Note that every estimator might have different parameters that you can access once the model is fitted. You can find which parameters you can access in the official documentation and in the ‘Attributes’ section of the specific estimator you are working with. Typically, fitted parameters use an underscore _ as a suffix. For the SVC classifier in particular, you can find the available fitted parameters in this section of the documentation.
What does predict() do
Now that we have trained our model, the next step typically involves predictions over the testing set. To do so, we need to call the method predict() that will essentially use the learned parameters by fit() in order to perform predictions on new, unseen test data points.
Essentially, predict() will perform a prediction for each test instance and it usually accepts only a single input ( X ). For classifiers and regressors, the predicted value will be in the same space as the one seen in training set. In clustering estimators, the predicted value will be an integer. The predicted values of the provided test instances will be returned in a form of an output of an array or sparse matrix.
Note that if you attempt to run predict() without first executing fit() you will receive a exceptions.NotFittedError , as shown below.
What does fit_predict() do
Going forward, fit_predict() is more relevant to unsupervised or transductive estimators. Essentially, this method will fit and perform predictions over training data thus, is more appropriate when performing operations such as clustering.
fit_transform (X, y=None, sample_weight=None)
Compute clustering and transform X to cluster-distance space. Equivalent to fit(X).transform(X) , but more efficiently implemented.
- clustering estimators in scikit-learn must implement fit_predict() method but not all estimators do so
- the arguments passed to fit_predict() are the same as those to fit()
Conclusion
In this article, we discussed what is the purpose of the three most commonly implemented functions in sklearn, namely fit() , predict() and fit_predict() . We explored what each does and what their differences are as well as in what use-cases you should use one over the other.
As mentioned in the introduction of this article, even though we used specific examples to demonstrate their behaviour, the concepts explained in the article are applicable to pretty much all estimators implementing these methods in scikit-learn.
fit() method will fit the model to the input training instances while predict() will perform predictions on the testing instances, based on the learned parameters during fit . On the other hand, fit_predict() is more relevant to unsupervised learning where we don’t have labelled inputs.
A very similar topic is probably the comparison between fit() , transform() and fit_transform() methods which are implemented by scikit-learn transformers that are used to transform features. If you want to learn more about them you can read my Medium article below.