9. Model persistence¶
After training a scikit-learn model, it is desirable to have a way to persist the model for future use without having to retrain. The following sections give you some hints on how to persist a scikit-learn model.
9.1. Python specific serialization¶
It is possible to save a model in scikit-learn by using Python’s built-in persistence model, namely pickle:
In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle ( dump & load ), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:
Later you can load back the pickled model (possibly in another Python process) with:
dump and load functions also accept file-like object instead of filenames. More information on data persistence with Joblib is available here.
9.1.1. Security & maintainability limitations¶
pickle (and joblib by extension), has some issues regarding maintainability and security. Because of this,
Never unpickle untrusted data as it could lead to malicious code being executed upon loading.
While models saved using one version of scikit-learn might load in other versions, this is entirely unsupported and inadvisable. It should also be kept in mind that operations performed on such data could give different and unexpected results.
In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:
The training data, e.g. a reference to an immutable snapshot
The python source code used to generate the model
The versions of scikit-learn and its dependencies
The cross validation score obtained on the training data
This should make it possible to check that the cross-validation score is in the same range as before.
Aside for a few exceptions, pickled models should be portable across architectures assuming the same versions of dependencies and Python are used. If you encounter an estimator that is not portable please open an issue on GitHub. Pickled models are often deployed in production using containers, like Docker, in order to freeze the environment and dependencies.
If you want to know more about these issues and explore other possible serialization methods, please refer to this talk by Alex Gaynor.
9.1.2. A more secure format: skops ¶
skops provides a more secure format via the skops.io module. It avoids using pickle and only loads files which have types and references to functions which are trusted either by default or by the user. The API is very similar to pickle , and you can persist your models as explain in the docs using skops.io.dump and skops.io.dumps :
And you can load them back using skops.io.load and skops.io.loads . However, you need to specify the types which are trusted by you. You can get existing unknown types in a dumped object / file using skops.io.get_untrusted_types , and after checking its contents, pass it to the load function:
If you trust the source of the file / object, you can pass trusted=True :
Please report issues and feature requests related to this format on the skops issue tracker.
9.2. Interoperable formats¶
For reproducibility and quality control needs, when different architectures and environments should be taken into account, exporting the model in Open Neural Network Exchange format or Predictive Model Markup Language (PMML) format might be a better approach than using pickle alone. These are helpful where you may want to use your model for prediction in a different environment from where the model was trained.
ONNX is a binary serialization of the model. It has been developed to improve the usability of the interoperable representation of data models. It aims to facilitate the conversion of the data models between different machine learning frameworks, and to improve their portability on different computing architectures. More details are available from the ONNX tutorial. To convert scikit-learn model to ONNX a specific tool sklearn-onnx has been developed.
PMML is an implementation of the XML document standard defined to represent data models together with the data used to generate them. Being human and machine readable, PMML is a good option for model validation on different platforms and long term archiving. On the other hand, as XML in general, its verbosity does not help in production when performance is critical. To convert scikit-learn model to PMML you can use for example sklearn2pmml distributed under the Affero GPLv3 license.
Как сохранить модель машинного обучения python
In this article, let’s learn how to save and load your machine learning model in Python with scikit-learn in this tutorial.
Once we create a machine learning model, our job doesn’t end there. We can save the model to use in the future. We can either use the pickle or the joblib library for this purpose. The dump method is used to create the model and the load method is used to load and use the dumped model. Now let’s demonstrate how to do it. The save and load methods of both pickle and joblib have the same parameters.
syntax of dump() method:
pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
- obj: The pickled Python object.
- file: The pickled object will be written to a file or buffer.
- fix_imports: When supplied, the method dump() will determine if the pickling procedure should be compatible with Python version 2 or not based on the value for the pickle protocol option. True is the default value. Only a name-value pair should be used with this default parameter.
The load() method Returns the rebuilt object hierarchy indicated therein after reading the pickled representation of an object from the open file object file.
Example 1: Saving and loading models using pickle
Python’s default method for serializing objects is a pickle. Your machine learning algorithms can be serialized/encoded using the pickling process, and the serialized format can then be saved to a file. When you want to deserialize/decode your model and utilize it to produce new predictions, you can load this file later. The training of a linear regression model is shown in the example that follows. In the below example we fit the data with train data and the dump() method is used to create a model. The dump method takes in the machine learning model and a file is given. The test data is used to find predictions after loading the model using the load() method. root mean square error metric is used to evaluate the predictions of the model.
How to save and load your Scikit-learn models in a minute
![]()
Have you ever built a Machine learning Model and wondered how to save them? Well in a minute, I will show you how to save your Scikit learn models as a file.
The saving of data is called Serialization, where we store an object as a stream of bytes to save on a disk. Loading or restoring the model is called Deserialization, where we restore the stream of bytes from the disk back to the Python object.
Reasons why you should save your model?
- In case you need to recreate the Trained model.
- Share the model with others. We can save the model onto a file and share the file with others, which can be loaded to make predictions.
- When you need to use the model for production purposes. To avoid long training times, We have trained the model on a huge data set and have a well-performing predictive model.
Tools to save and restore models in Scikit-learn
The first tool we describe is Pickle, the standard Python tool for object serialization and deserialization. Afterwards, we look at the Joblib library which offers easy (de)serialization of objects containing large data arrays, and finally, we present a manual approach for saving and restoring objects to/from JSON (JavaScript Object Notation). None of these approaches represents an optimal solution, but the right fit should be chosen according to the needs of your project.
Model Initialisation
Initially, let’s create one scikit-learn model. In our example, we’ll use a Logistic Regression model and the Iris dataset. Let’s import the needed libraries, load the data, and split it into training and test sets.
Now let’s create the model with some non-default parameters and fit it to the training data. We assume that you have previously found the optimal parameters of the model, i.e. the ones which produce the highest estimated accuracy.
Using the fit method, the model has learned its coefficients which are stored in model.coef_ . The objective is to save the model's parameters and coefficients to file, so you don't need to repeat the model training and parameter optimization steps on new data.
Our Model is trained now. We can save the model and later load the model to make predictions on unseen data.
Save your model with Pickle
Pickle is used for serializing and de-serializing Python object structures also called marshalling or flattening. Serialization refers to the process of converting an object in memory to a byte stream that can be stored on a disk or sent over a network. Later on, this character stream can then be retrieved and de-serialized back to a Python object. Pickle is very useful when you are working with machine learning algorithms, where you need to save them to be able to generate predictions at a later time, without having to rewrite everything.
Save your model with Joblib
Joblib is part of the SciPy ecosystem and provides utilities for pipelining Python jobs. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently. This can be useful for some machine learning algorithms that require a lot of parameters or store the entire dataset (like K-Nearest Neighbors). While Pickle requires a file object to be passed as an argument, Joblib works with both file objects and string filenames. In case your model contains large arrays of data, each array will be stored in a separate file, but the save and restore procedure will remain the same.
Save your model Using JSON format
Depending on your project, many times you would find Pickle and Joblib as unsuitable solutions. Some of these reasons are discussed later in the Consideration section. We will first import the JSON library, create a dictionary containing the coefficients and intercept. Coefficients and intercept are an array object. We cannot dump an array into JSON strings so we convert the array to a list and store it in the dictionary
we convert Python dictionary to a JSON string using JSON dumps. we need indented output so we provide indent parameter and set it to 4. Save the JSON string to a file.
We load the content of the file to a JSON string. Open the file in ‘read’ mode and then load the JSON data into a python object which in our case is a dictionary
Since the data serialization using JSON actually saves the object into a string format, rather than byte stream, the ‘regressor_param.txt’ file could be opened and modified with a text editor. Although this approach would be convenient for the developer, it is less secure since an intruder can view and amend the content of the JSON file.
Consideration section
- The biggest drawback of the Pickle and Joblib tools is their compatibility over different models and Python versions.
- The internal structure of the model should remain the same between saving and restoring the model
- Restore models when received from a known source to avoid any malicious code. Both Pickle and Joblib could contain malicious code, so it is not recommended to restore data from untrusted or unauthenticated sources.
Conclusion
Congratulations! You’re now ready to start pickling and unpickling files with Python. You’ll be able to save your machine learning models and resume work on them later on. The Pickle and Joblib libraries are quick and easy to use but have compatibility issues across different Python versions and changes in the learning model.
New to data science? Need mentoring?
You can reach out here
- Twitter: @akinwhande
- Linkedin: akinwande
Loved this article? Then follow me on medium to get more Insightful articles.
I worked an open source project that helps data scientists navigate the issues of basic data wrangling and preprocessing steps. The idea behind Slik is to jump-start supervised learning projects. Link to the documentation can be found here
How to save a trained model by scikit-learn? [duplicate]
I am trying to re-create the prediction of a trained model but I don’t know how to save a model. For example, I want to save the trained Gaussian processing regressor model and recreate the prediction after I trained the model. The package I used to train model is scikit-learn.
![]()
2 Answers 2
1. pickle
2. joblib
In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string: