How to Make Synthetic Datasets with Python: A Complete Guide for Machine Learning
A good dataset is difficult to find. Besides, sometimes you just want to make a point. Tedious loadings and preparations can be a bit much for these cases.
Today you’ll learn how to make synthetic datasets with Python and Scikit-Learn — a fantastic machine learning library. You’ll also learn how to play around with noise, class balance, and class separation.
You can download the Notebook for this article here.
Make your first synthetic dataset
Real-world datasets are often too much for demonstrating concepts and ideas. Imagine you want to visually explain SMOTE (a technique for handling class imbalance). You first have to find a class-imbalanced dataset and project it to 2–3 dimensions for visualizations to work.
There’s a better way.
The Scikit-Learn library comes with a handy make_classification() function. It’s not the only one for creating synthetical datasets, but you’ll use it heavily today. It accepts various parameters that let you control the looks and feels of the dataset, but more on that in a bit.
To start, you’ll need to import the required libraries. Refer to the following snippet:
You’re ready to create your first dataset. It’ll have 1000 samples assigned to two classes (0 and 1) with a perfect balance (50:50). All samples belonging to each class are centered around a single cluster. The dataset has only two features — to make the visualization easier:
A call to sample() prints out five random data points:
Image 1 — Random sample of 5 rows (image by author)
This doesn’t give you the full picture behind the dataset. It’s two dimensional, so you can declare a function for data visualization. Here’s one you can use:
Here’s how it looks like visually:
Image 2 — Visualization of a synthetic dataset (image by author)
That was fast! You now have a simple synthetic dataset you can play around with. Next, you’ll learn how to add a bit of noise.
Add noise
You can use the flip_y parameter of the make_classification() function to add noise.
This parameter represents the fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder. Note that the default setting flip_y > 0 might lead to less than n_classes in y in some cases[1].
Here’s how to use it with our dataset:
Here’s the corresponding visualization:
Image 3 — Visualization of a synthetic dataset with added noise (image by author)
You can see many more orange points in the blue cluster and vice versa, at least when compared with Image 2.
That’s how you can add noise. Let’s shift the focus on class balance next.
Tweak class balance
It’s common to see at least a bit of class imbalance in the real-world datasets. Some datasets suffer from severe class imbalance. For example, one of 1000 bank transactions could be fraudulent. This means the balance ratio is 1:1000.
You can use the weights parameter to control class balance. It excepts a list as a value with N – 1 values, where N is the number of features. We only have 2, so there’ll be a single value in the list.
Let’s see what happens if we specify 0.95 as a value:
Here’s how the dataset looks like visually:
Image 4 — Visualization of a synthetic dataset with a class imbalance on positive class (image by author)
As you can see, only 5% of the dataset belongs to class 1. You can turn this around easily. Let’s say you want 5% of the dataset in class 0:
Here’s the corresponding visualization:
Image 5 — Visualization of a synthetic dataset with a class imbalance on negative class (image by author)
And that’s all there is to class balance. Let’s finish by tweaking class separation.
Tweak class separation
By default, there are some overlapping data points (class 0 and class 1). You can use the class_sep parameter to control how separated the classes are. The default value is 1.
Let’s see what happens if you set the value to 5:
Here’s how the dataset looks like:
Image 6 — Visualization of a synthetic dataset with a severe class separation (image by author)
As you can see, the classes are much more separated now. Higher parameter values result in better class separation, and vice versa.
You now know everything to make basic synthetic datasets for classification. Let’s wrap things up next.
Conclusion
Today you’ve learned how to make basic synthetic classification datasets with Python and Scikit-Learn. You can use them whenever you want to prove a point or implement some data science concept. Real datasets can be overkill for that purpose, as they often require rigorous preparation.
Feel free to explore official documentation to learn about other useful parameters.
An introduction to machine learning with scikit-learn¶
In this section, we introduce the machine learning vocabulary that we use throughout scikit-learn and give a simple learning example.
Machine learning: the problem setting¶
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
Learning problems fall into a few categories:
-
supervised learning, in which the data comes with additional attributes that we want to predict ( Click here to go to the scikit-learn supervised learning page).This problem can be either:
-
classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
-
regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization ( Click here to go to the Scikit-Learn unsupervised learning page).
Training set and testing set
Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.
Loading an example dataset¶
scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for classification and the diabetes dataset for regression.
In the following, we start a Python interpreter from our shell and then load the iris and digits datasets. Our notational convention is that $ denotes the shell prompt while >>> denotes the Python interpreter prompt:
A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_samples, n_features array. In the case of supervised problems, one or more response variables are stored in the .target member. More details on the different datasets can be found in the dedicated section .
For instance, in the case of the digits dataset, digits.data gives access to the features that can be used to classify the digits samples:
and digits.target gives the ground truth for the digit dataset, that is the number corresponding to each digit image that we are trying to learn:
Shape of the data arrays
The data is always a 2D array, shape (n_samples, n_features) , although the original data may have had a different shape. In the case of the digits, each original sample is an image of shape (8, 8) and can be accessed using:
The simple example on this dataset illustrates how starting from the original problem one can shape the data for consumption in scikit-learn.
Loading from external datasets
To load from an external dataset, please refer to loading external datasets .
Learning and predicting¶
In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong.
In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T) .
An example of an estimator is the class sklearn.svm.SVC , which implements support vector classification. The estimator’s constructor takes as arguments the model’s parameters.
For now, we will consider the estimator as a black box:
Choosing the parameters of the model
In this example, we set the value of gamma manually. To find good values for these parameters, we can use tools such as grid search and cross validation .
The clf (for classifier) estimator instance is first fitted to the model; that is, it must learn from the model. This is done by passing our training set to the fit method. For the training set, we’ll use all the images from our dataset, except for the last image, which we’ll reserve for our predicting. We select the training set with the [:-1] Python syntax, which produces a new array that contains all but the last item from digits.data :
Now you can predict new values. In this case, you’ll predict using the last image from digits.data . By predicting, you’ll determine the image from the training set that best matches the last image.
The corresponding image is:
As you can see, it is a challenging task: after all, the images are of poor resolution. Do you agree with the classifier?
A complete example of this classification problem is available as an example that you can run and study: Recognizing hand-written digits .
Conventions¶
scikit-learn estimators follow certain rules to make their behavior more predictive. These are described in more detail in the Glossary of Common Terms and API Elements .
Type casting¶
Where possible, input of type float32 will maintain its data type. Otherwise input will be cast to float64 :
In this example, X is float32 , and is unchanged by fit_transform(X) .
Using float32 -typed training (or testing) data is often more efficient than using the usual float64 dtype : it allows to reduce the memory usage and sometimes also reduces processing time by leveraging the vector instructions of the CPU. However it can sometimes lead to numerical stability problems causing the algorithm to be more sensitive to the scale of the values and require adequate preprocessing .
Keep in mind however that not all scikit-learn estimators attempt to work in float32 mode. For instance, some transformers will always cast there input to float64 and return float64 transformed values as a result.
Regression targets are cast to float64 and classification targets are maintained:
Here, the first predict() returns an integer array, since iris.target (an integer array) was used in fit . The second predict() returns a string array, since iris.target_names was for fitting.
Refitting and updating parameters¶
Hyper-parameters of an estimator can be updated after it has been constructed via the set_params() method. Calling fit() more than once will overwrite what was learned by any previous fit() :
Here, the default kernel rbf is first changed to linear via SVC.set_params() after the estimator has been constructed, and changed back to rbf to refit the estimator and to make a second prediction.
Multiclass vs. multilabel fitting¶
When using multiclass classifiers , the learning and prediction task that is performed is dependent on the format of the target data fit upon:
In the above case, the classifier is fit on a 1d array of multiclass labels and the predict() method therefore provides corresponding multiclass predictions. It is also possible to fit upon a 2d array of binary label indicators:
Here, the classifier is fit() on a 2d binary label representation of y , using the LabelBinarizer . In this case predict() returns a 2d array representing the corresponding multilabel predictions.
Note that the fourth and fifth instances returned all zeroes, indicating that they matched none of the three labels fit upon. With multilabel outputs, it is similarly possible for an instance to be assigned multiple labels:
In this case, the classifier is fit upon instances each assigned multiple labels. The MultiLabelBinarizer is used to binarize the 2d array of multilabels to fit upon. As a result, predict() returns a 2d array with multiple predicted labels for each instance.
Как сделать датасет в python
Мы начинаем знакомиться с Pandas — главной библиотекой Python для анализа данных.
Это очень удобный и не самый сложный в освоении инструмент дата-журналиста. Он позволяет работать с данными в привычном для нас табличном виде. В Pandas можно изучить набор данных, почистить его, внести изменения, проанализировать и сделать выводы, построить графики и многое другое.
В отличие от Excel, где тоже широкие возможности для работы с данными, Pandas справится даже с очень большими файлами, в которых сотни тысяч и миллионы строк. Стандартному Excel такое не под силу, как и работа с файлами в формате json — именно в таком формате часто хранятся открытые данные. Например, на сайте ГосРасходы данные о государственных контрактах и субсидиях можно выгрузить только в json.
Мы посвятим библиотеке Pandas несколько уроков. Сегодня будет первое знакомство: мы научимся делать самые базовые вещи в Pandas, с которыми уже можно анализировать реальные данные.
1. Создаём датафреймы с нуля
Для начала загрузим библиотеку pandas, сокращенно будем обращаться к ней pd.
Если вы работаете не через Анаконду, загружайте Pandas командой pip3 install pandas.
В Pandas данные представлены в виде датафрейма, или таблицы с данными. Её можно создать с нуля, например, из словарей, списков или кортежей.
Создадим свой первый датафрейм для трех крупнейших городов России. Перед нами словарь, в котором ключами являются city и population, а значениями — списки с названиями этих городов и численностью населения, которая им соответствует.
В первом столбце указан индекс (порядковый номер), а остальное выглядит, как привычная таблица.
Также можно создать датафрейм из списка списков.
На этот раз значения записались построчно, и удобнее будет поменять местами строки и столбцы, то есть транспонировать.
Теперь мы знаем, как создать датафрейм с нуля. Но на практике его обычно приходится загружать из файла в форматах excel, csv или json.
2. Создаем датафрейм из файла
Сегодня мы будем работать с данными по смертности людей от разных причин. Перейдем на сайт Института оценки показателей здоровья США и выберем нужные показатели смертности по странам.
Выберем такие показатели:
- Location: Countries and territories (Страны и территории)
- Year: 2000-2019
- Content: Cause (причина смерти)
- Age: All ages
- Metric: Rate (число смертей от определенной причины на 100 тысяч жителей)
- Measure: Deaths (смерти)
- Sex: Male, Female, Both
- Cause: level 2 causes (2 уровень детализации причин смерти)
Дальше кликнем Download csv (скачать в формате csv).
Нужно поставить галочки, соглашаясь не использовать данные в коммерческих целях, выбрать Names и ввести почту.
Данные могут прийти не сразу, поэтому тот же самый датасет можно скачать по ссылке. Это большой датасет на 260 тысяч строк, с которым было бы непросто работать в Excel.
Скопируем путь к файлу на компьютере и загрузим его сразу в датафрейм с помощью метода pd.read_csv().
Выведем первые 5 строк датафрейма методом .head().
Чтобы вывести произвольное число строк, укажем нужное число в скобках.
Последние строки выводятся методом .tail().
Метод df.info() дает описание датафрейма: сколько в нем строк и столбцов, какие типы данных содержатся, сколько не пустых значений (non-null), сколько памяти занимает.
Если мы хотим только узнать, сколько в нем строк и столбцов, используем атрибут shape.
Метод describe дает статистику по числовым столбцам: среднее, максимум и минимум, квартили, стандартное отклонение.
Если нам для статистики нужны не только числовые столбцы, но и все остальные, добавим аргумент all.
Тогда к таблице добавятся поля unique (сколько в столбце уникальных значений), top (какое значение встречается чаще всего) и freq (сколько раз встречается самое частое значение).
У нас в датафрейме нет пустых значений (missing values, NA), но часто они бывают в реальных данных, «загрязняют» их и мешают с ними работать. От пустых значений можно избавиться методом dropna.
3. Фильтруем датафрейм по названию столбца и индексам
Выбрать определенный столбец можно двумя способами:
Также можно вывести на экран сразу несколько столбцов, обернув их в двойные квадратные скобки.
Ещё для выбора столбцов и строк существует методы loc и iloc. Loc — для выбора по названию, iloc — по индексу (порядковому номеру).
Если нам нужны только строки с 100 по 110, пропишем это условие слева.
Если мы хотим выбрать столбцы и строки не по названию, а по номеру, задействуем метод iloc. Это бывает полезно, например, когда названия столбцов слишком длинные.
Важно: в методе loc включаются все указанные в условии числа, поэтому мы видим строку 110, а в iloc правый конец исключается (как в стандартных срезах в Python), поэтому последняя строка, которую мы видим — 109. Чтобы вывести на экран строки 100:110, нужно было бы указать df.iloc[100:111, 0:3].
4. Фильтруем датафрейм по условиям
В Pandas очень удобно выбирать данные по условию. Например, нам нужны показатели смертности только для Both sexes, обоих полов.
Или по трем условиям: оба пола, только 2019 год, только сердечно-сосудистые заболевания.
Или по четырем: оба пола, только 2019 год, только сердечно-сосудистые заболевания, только больше 600 смертей на 100 тысяч жителей.
Мы видим страны с самой высокой смертностью от сердечно-сосудистых болезней. Cохраним результат в отдельный датафрейм и назовем его cardio.
И отсортируем его по показателю смертности от сердечно-сосудистых болезней. Оказывается, чаще всего от проблем с сердцем умирают в Болгарии.
Можно задать и так, чтобы соблюдалось или одно условие, или другое с помощью оператора «или»:
Если мы не знаем точно, как называется причина смерти, либо название слишком длинное, удобно использовать метод contains (содержит). Метод str.contains(«HIV») выдаст нам строки, в которых упоминается HIV (ВИЧ).
Если нам, наоборот, нужны строки, в которых НЕ упоминается ВИЧ, то можно добавить либо условие False в конце, либо знак
5. Удаляем, добавляем столбцы, меняем названия
Видно, что значения в столбцах measure, metric, age одинаковые. Также мы не будем работать с столбцами lower и upper (нижняя оценка показателя смертности и высокая), поэтому их можно удалить. Для этого используем метод drop.
Чтобы удалить несколько столбцов, обернем их в квадратные скобки.
Аргумент axis =1 показывает, что удалить нужно именно столбец (1 — столбцы, 0 — строки).
Теперь научимся добавлять столбцы. Например, добавим столбец, в котором будут храниться значения val, но только округленные до одной десятой.
Переименовать столбцы тоже несложно. Переименуем, например, val в value.
6. Сохраняем датафрейм
Датафрейм можно сохранить в csv или xlsx и работать с ним дальше в любой программе. Название файла должно быть уникальным, иначе если у вас на компьютере уже есть файл с таким названием, то он перезапишется.
Вот мы и освоили необходимым минимум для работы в Pandas. Тетрадь с этого урока можно скачать здесь, а задать вопросы и пообщаться — в чате «Мастерской» в Telegram.
Build Your Own Custom Dataset using Python
How I constructed thousands of rows of data points from scratch
Towards Data Science
D ata is the foundation of data science and machine learning. Thousands upon thousands of data points are needed in order to analyze, visualize, draw conclusions, and build ML models. In some situations, data is readily available and free to download. Other times, the data is nowhere to be found.
In situations where data is not readily available but needed, you’ll have to resort to building up the data yourself. There are many methods you can use to acquire this data from webscraping to APIs. But sometimes, you’ll end up needing to create fake or “dummy” data. Dummy data can be useful in times where you know the exact features you’ll be using and the data types included but, you just don’t have the data itself.
Here, I’ll show you how I created 100,000 rows of dummy data. This data is also not purely random. If it was, building it would’ve been much easier. Sometimes when creating dummy data from scratch, you’ll need to develop tendencies or patterns within the data that might reflect real world behavior. You’ll see what I mean later on.
The Need to Construct a Dataset
Let’s say you are building an app from the ground up and need to establish a large user base for example testing. You are provided with a list of features and their respective data types.
This user base also needs to somewhat accurately reflect real world users and tendencies so it can’t be completely random. For example, you don’t want a user to have a college degree who is also 10 years old. Or, you might not want an overrepresentation of a specific data point, such as more males than females. These are all points of consideration to be aware of as you create your dataset.
Since real world data is rarely, truly random, a dataset such as this would be a great simulation for other datasets you’ll handle in the future. It could also serve as a testing ground for any machine learning models you wish to train.
Now let’s get started, feel free to follow along, and I’ll show you how I built this dataset…
Constructing the Dataset
To code along, start by importing the following libraries:
The dataset size will be 100,000 data points (you can do more but it may take longer to process). I assigned this amount to a constant variable, which I used throughout:
Features
I picked 10 features I expected to be the most common in a regular dataset of users. These features and respective data types are:
- ID — a unique string of characters to identify each user.
- Gender — string data type of three choices.
- Subscriber — a binary True/False choice of their subscription status.
- Name — string data type of the first and last name of the user.
- Email —string data type of the email address of the user.
- Last Login — string data type of the last login time.
- Date of Birth — string format of year-month-day.
- Education — current education level as a string data type.
- Bio — short string descriptions of random words.
- Rating — integer type of a 1 through 5 rating of something.
I inputted the above as a list of features to initialize a Pandas dataframe:
Creating Imbalanced Data
Some attributes above should normally contain imbalanced data. It can be safely assumed with some quick research, some choices will not be equally represented. For a more realistic dataset, these tendencies need to be reflected.
For the ID attribute, I used the uuid library to generate a random string of characters 100,000 times. Then, I assigned it to the ID attribute in the dataframe.
UUID is a great library to generate unique IDs for each user because of its astronomically low chance of duplicating an ID. It is a great go-to when it comes to generating unique identifying sets of characters. But, if you want to be sure that no IDs were repeated, then you can do a simple check on the dataframe with the following:
This will return True if all the IDs in the dataset are unique.
Gender
This attribute is one of the instances where an equally random choice should probably not be used. Because, it can be safely assumed that each choice is not equally likely to occur.
For gender, I provided three options: male, female, and na. However, if I were to use Python’s random library, then each choice might be equally shown in the dataset. In reality, there would be significantly more male and female than na choices. So I decided to display that imbalance in the data:
By using the random library, I provided the choices() function with the list of gender options, weights for each choice, and how many choices to make represented by “k”. It was then assigned to the dataframe’s “gender” attribute. The imbalance I described before, is represented in the weights section with a “na” choice appearing about 6% of the time.
Subscriber
For this attribute, the choices can be randomly selected between True and False. Since it can be reasonably expected that about half of the users would be subscribers.
Just like “Gender” before, I used random.choices() but without weights because this attribute can be randomly split between the two choices.
Here I used the Faker library to create thousands of names for all these users. The Faker library is great in this situation because it has an option for male and female names. In order to process the gendered names, I created a function to assign names based on a given gender.
I used my simple function to quickly produce a list of names based on the “Gender” attribute data before and assigned it to the dataframe.
Email turned out to be one of the trickier attributes. I wanted to create email addresses related to the names generated. However, there would probably be a chance of duplication because people can share the same name but not the same email.
First, I created a new function that would format names into email addresses with a default domain name. It would also handle duplicate addresses by simply adding a random number at the end of the formatted name:
Now in order to appropriately harness the purpose of this function, I created a loop that would rerun the function when necessary while iterating through the “Name” attribute. The loop would keep rerunning the function until a unique email name was created.
After all emails have been generated, I assigned them to the dataframe’s “Email” attribute. You can also do an optional check to see that each email is unique with the same method as IDs.
Last Login
This attribute now requires specific formatting which was made easier with the utilization of the datetime library. Here I wanted users to have a history of logging in within the past month or so. I used another custom function to help:
The function basically generates a list of timestamps between two given times. It generated a list of random timestamps to assign to the dataframe.
Date of Birth
This attribute is an easy one since it is similar to “Last Login”. All I did was change the time format by removing the hour, minute, and seconds. I used datetime again to help randomly choose a date for each user but this time the range of time started from 1980 to 2006 to get a nice random distribution of ages. The code below is largely the same as before but with a different format and date range:
Education
The “education” attribute is dependent on “dob”. In these instances, the level of education is based on the user’s age and not the highest level of education they attained. This is another one of those attributes where randomly picking a level of education would not reflect real world tendencies.
I created another simple function that checks a user’s age based on today’s date and returns their current education level.
After generating a list of education levels, I then assigned it to the dataframe.
For this attribute, I wanted to vary the length of the bio depending on the user’s subscription status. If a user was a subscriber, then I would assume that their bios would be longer than non-subscribers.
To accommodate this aspect, I built a function that checks their subscription status and returns a random sentence from Faker that varies in length.
In the above function, I randomly chose the length of the fake sentences based on the subscription status. If they were a subscriber then their bios leaned towards being longer than usual and vice-versa.
Rating
To round out this dataset, I wanted to include a numerical data type. I chose “rating” to be of an integer type. The rating of 1 to 5 represents anything and it’s just there for any discretionary purpose.
For the ratings themselves, I chose to skew the distribution of 1 to 5 towards the extremes to reflect the tendency of users seemingly being more absolute with their ratings.
I used the random.choices() function once again but with the weights skewed towards 1 and 5.
Saving the Dataset
Now that the data is complete and if you were coding along, feel free to view the dataframe before you decide to save it. If it all looks good, then save the dataframe as a .csv file with this simple command:
This saves the dataset as a fairly large CSV file in your local directory. And if you want to check on your saved dataset, used this command to view it:
Everything should look good and now, if you wish, you can perform some basic data visualization. If you want to learn more about data visualization check out my following article:
Why I Use Plotly for Data Visualization
Closing Thoughts
I hoped you learned something new from my experience building this dataset. If anything, building it provided a bit of practice in creating Python functions and algorithms. Also, I believe building your own dataset can equip you with a better understanding of datasets in general. It can perhaps prepare you to handle any large amounts of data you need.
If you wish, you can improve upon it by adding more restrictions or dependencies to the data distribution or even add more attributes to the dataframe. Feel free to explore the data or expand upon it!