Как объединить несколько csv файлов в один python
Перейти к содержимому

Как объединить несколько csv файлов в один python

  • автор:

Как объединить несколько CSV файлов в один?

Есть несколько CSV файлов, которые я хочу объединить в один.

Изначально воспользовался ответом, который предлагает следующее:

Проблема оказалась в том, что при объединении не добавляется перенос строки после очередного файла (т.е. первая строка второго файла оказывается совмещена с последней строкой первого файла). Кроме того, каждый CSV файл имеет заголовок (одинаковый для всех), который хорошо бы убрать у всех фалов, кроме первого.

How to combine multiple CSV files using Python for your analysis

Stella Joshua

Oftentimes, as a data analyst, you may find yourself overloaded with multiple CSV files that needs to be combined together before you may even start your analysis on the data available. However, its not always the case that all the files are extracted from the same data sources and have the same data columns or follow the same data structure.

In this tutorial, you will learn how to combine multiple CSVs with either similar or varying column structure and how to use append() , concat() , merge() and combine_first() functions to do so.

Before we do that, let’s see how to import a single csv file into a dataframe using Pandas package.

1. Importing the File into pandas DataFrames:

To import a single file into a dataframe you can simply use pd.read_csv() function.

When you have multiple files to work with, the best way is to paste all the files into a single directory and then read all these files using pd.read_csv() function.

2. Setting up the working directory:

One method is to pass the path of the directory into a variable and then list all the files in that directory.

Else, if you want to read files from the same directory as your ipynb file you can use below code.

The third method is to use the glob() function to list only the csv files from the working directory.

Now to read multiple CSV files with the similar table structure, you can use pandas.DataFrame.append() OR pd.concat() functions.

Let’s look at the 3 sample CSV files we’ll be working with.

csv_sample1.csv

csv_sample2.csv

csv_sample3.csv

All three files have the same column headers except, csv_Sample2.csv has an additional column named “Birthdate”. Also, note that there are 2 entries that are common between csv_Sample1.csv and csv_Sample2.csv, as highlighted. Here, entry for “Tom R. Powell” has different “Joined Date” values in both files. Note how these entries get combined in all the methods used below.

3. Combining multiple files with the similar table structure using pandas.DataFrame.append()

Use the below code to read and combine all the csv files from the earlier set directory.

The output after using the append() function is as below.

Here, you can see that all the data rows from the files have been appended one below the other. However, NaN values have been inserted in the “Birthdate” column as these values are not present in csv_sample1.csv and csv_sample3.csv files.

4. Combining multiple files with the similar table structure using pandas.concat()

Another way to combine the files is using pandas.conact() , as shown below.

Now, if you want to join data rows of the files based on related columns then you may use pandas.DataFrame.merge() function.

5. Using pandas.DataFrame.merge() to join the data rows

First read the files into separate dataframes as below.

Now, while using merge() between these dataframes, you need to specify the related columns on which you want to join the rows.

The function joined all the rows only where the all the values of the specified columns were a match.

Here, we have used the outer join method to merge the files. To learn more on the type of merge to be performed, you may refer this link: pandas.merge()

In the above example, we passed a list of column names on which we wanted to join the rows. Instead, if we join the rows only on the “Email” column then we would get an output as below.

Now, if you want to create a dataframe with values of say, csv_sample1.csv and wherever null, take values from a different file say, csv_sample2.csv then use combine_first() .

6. Updating null values in columns from other columns using pandas.combine_first()

Replace ‘_x’ from the column headers.

Pass all the column names on which you want to apply combine_first() . An easy way is to fetch columns with ‘_y’ in the headers and then remove ‘_y’ from them, as below.

How to merge multiple CSV files with Python

In this guide, I’ll show you several ways to merge/combine multiple CSV files into a single one by using Python (it’ll work as well for text and other files). There will be bonus — how to merge multiple CSV files with one liner for Linux and Windows. Finally with a few lines of code you will be able to combine hundreds of files with full control of loaded data — you can convert all the CSV files into a Pandas DataFrame and then mark each row from which CSV file is coming.

  • data_201901.csv
  • data_201902.csv
  • data_201903.csv

Steps to merge multiple CSV(identical) files with Python

Note: that we assume — all files have the same number of columns and identical information inside

Short code example — concatenating all CSV files in Downloads folder:

Step 1: Import modules and set the working directory

First we will start with loading the required modules for the program and selecting working folder:

Step 2: Match CSV files by pattern

Next step is to collect all files needed to be combined. This will be done by:

The next code: data_*.csv match only files:

  • starting with data_
  • with file extension .csv

You can customize the selection for your needs having in mind that regex matching is used.

Step 3: Combine all files in the list and export as CSV

The final step is to load all selected files into a single DataFrame and converted it back to csv if needed:

Note that you may change the separator by: sep=’,’ or change the headers and rows which to be loaded

You can find more about converting DataFrame to CSV file here: pandas.DataFrame.to_csv

Full Code

Below you can find the full code which can be used for merging multiple CSV files.

merge_multiple_csv_files_with_python

Steps to merge multiple CSV(identical) files with Python with trace

Now let’s say that you want to merge multiple CSV files into a single DataFrame but also to have a column which represents from which file the row is coming. Something like:

row col col2 file
1 A B data_201901.csv
2 C D data_201902.csv

This can be achieved very easy by small change of the code above:

In this example we iterate over all selected files, then we extract the files names and create a column which contains this name.

Combine multiple CSV files when the columns are different

Sometimes the CSV files will differ for some columns or they might be the same only in the wrong order to be wrong. In this example you can find how to combine CSV files without identical structure:

Pandas will align the data by this method: pd.concat . In case of a missing column the rows for a given CSV file will contain NaN values:

row col col2 col_201901 file
1 A B AA data_201901.csv
2 C D NaN data_201902.csv

If you need to compare two csv files for differences with Python and Pandas you can check: Python Pandas Compare Two CSV files based on a Column

More about pandas concat: pandas.concat

Bonus: Merge multiple files with Windows/Linux

Linux

Sometimes it’s enough to use the tools coming natively from your OS or in case of huge files. Using python to concatenate multiple huge files might be challenging. In this case for Linux it can be used:

In this case we are working in the current folder by matching all files starting with data_ . This is important because if you try to execute something like:

You will try to merge the newly output file as well which may cause issues. Another important note is that this will skip the first lines or headers of each file. In order to include headers you can do:

If the commands above are not working for you then you can try with the next two. The first one will merge all csv files but have problems if the files ends without new line:

The second one will merge the files and will add new line at the end of them:

How to Join Two CSV Files in Python Using Pandas ? 3 Steps Only

Join Two CSV Files in Python Using Pandas

Somethings We have the dataset that is provided not in single CSVs files. These are in separate excel sheets. And you already know that Its better that We should do all the computational or preprocessing tasks on a Single Dataset that more than one datasets. It reduces our time for doing all the preprocessing tasks. If you want to do so then this entire post is for you. In this tutorial, you will Know to Join or Merge Two CSV files using the Popular Python Pandas Library.

Steps By Step to Merge Two CSV Files

Step 1: Import the Necessary Libraries

Here all things are done using pandas python library. So I am importing pandas only.

Step 2: Load the Dataset

I have created two CSV datasets on Stocks Data one is a set of stocks and the other is the turnover of the stocks. Read it using the Pandas read_csv() method. I have included all the datasets in the Conclusion Section.

dataset 1 csv

dataset 2 csv

Step 3: Merge the Sheets

Now to merge the two CSV files you have to use the dataframe.merge() method and define the column, you want to do merging. If the data is not available for the specific columns in the other sheets then the corresponding rows will be deleted. You can verify using the shape() method. Use the following code.

Other Things you can Do

Now there is a case when you want to append the rows only of one sheet to another sheet and vice-versa. To this, you have to use concate() method. Suppose I have two sheets of the same dataset and I want to work on a single sheet. Then I have to first add all the rows of one sheet to another. After that I can do anything from that dataset. Below is the code for appending the rows in a Dataframe.

data1 from the same dataset

data2 from the same dataset

data2 from the same dataset
Conclusion

Most of the Data Scientist do data analysis on the single sheets. When you search online for any Dataset then you will mostly see the dataset in a single sheet. You should also do this as doing analysis on a single sheet increase efficiency and reduce computational task.

I hope you have understood how to Join Two CSV Files in Python Using Pandas. If you have any query please contact us for more information. Below is the dataset for all the examples taken here.

Other Questions

1. You are getting the error ” Columns not found in either dataset: …”

You may get this error while joining two CSV files. You are getting this error as one of two CSV files does not have a columns name on which you are merging. To solve it you have to make sure the columns exist in CSV files before joining.

2. Getting MergeError: No common columns to perform merge on”

If you are getting this error then interpreter is telling you that all the CSV files you want to join do not have any common columns. To solve this issue you have to do merging on different columns or you have to add a common column on the CSV files you want to perform the merge.

  • Total 4
Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

We respect your privacy and take protecting it seriously

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *