Как удалить дубли строк из файла linux

Команда Linux uniq

В Linux команда uniq отфильтровывает повторяющиеся строки в файле, то есть она помогает обнаруживать повторяющиеся строки и удалять их.

Использование команды uniq в Linux с примерами

Отфильтрованные данные утилита записывает в выходной файл.

Содержание

Синтаксис

Опции

Полем считается последовательность пробельных символов (обычно, пробелы и/или TAB), за которой следуют непробельные символы. Сначала пропускаются поля, потом символы.

Примеры использования команды uniq в Linux

Простой пример

Содержимое исходного файла:

Приведённый файл содержит повторяющиеся строки. Используем команду uniq для их удаления:

Строки были выстроены в алфавитном порядке, но повторы не были удалены, так как для их удаления необходимо указать выходной файл. Повторяем, записывая вывод в файл usa-states-example2.txt :

Содержимое исходного файла:

Выполняем команду Linux uniq с записью результата в выходной файл:

Команда Linux uniq не обнаруживает повторяющиеся строки, если они не смежные. Вы можете сначала выполнить сортировку, или использовать sort -u без uniq .

Количество повторений строки

В данном случае используется опция -c :

В начале каждой строки было выведено число её повторений.

Вывод повторяющихся строк

Для вывода лишь повторяющихся строк используется опция -D :

Вывод уникальных строк

Для вывода строк с отстутствием дублей используется опция -u :

Результат отрицателен, так как все строки дублированы.

Нечувствительность к регистру

Использование опции -i позволяет сделать сравнение нечувствительным к регистру:

Заключение

В этой небольшой статье были приведены основные примеры использования в Linux команды uniq . Из примеров можно сделать вывод, что эта утилита будет весьма полезна тем, кто занят быстрой фильтрацией информации из больших файлов, но не имеет времени на её вычитывание.

How to remove duplicate lines inside a text file?

A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table).

What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the original sequence order. In the result each line is to be unique. If there were 100 equal lines (usually the duplicates are spread across the file and won’t be neighbours) there is to be only one of the kind left.

I have written a program in Scala (consider it Java if you don’t know about Scala) to implement this. But maybe there are faster C-written native tools able to do this faster?

UPDATE: the awk ‘!seen[$0]++’ filename solution seemed working just fine for me as long as the files were near 2 GiB or smaller but now as I am to clean-up a 8 GiB file it doesn’t work any more. It seems taking infinity on a Mac with 4 GiB RAM and a 64-bit Windows 7 PC with 4 GiB RAM and 6 GiB swap just runs out of memory. And I don’t feel enthusiastic about trying it on Linux with 4 GiB RAM given this experience.

10 Answers 10

An awk solution seen on #bash (Freenode):

If you want to edit the file in-place, you can use the following command (provided that you use a GNU awk version that implements this extension):

There’s a simple (which is not to say obvious) method using standard utilities which doesn’t require a large memory except to run sort , which in most implementations has specific optimizations for huge files (a good external sort algorithm). An advantage of this method is that it only loops over all the lines inside special-purpose utilities, never inside interpreted languages.

If all lines begin with a non-whitespace character, you can dispense with some of the options:

For a large amount of duplication, a method that only requires storing a single copy of each line in memory will perform better. With some interpretation overhead, there’s a very concise awk script for that (already posted by enzotib):

Удаляем дубликаты строк из файла средствами Linux

archive view archive save

article Как средствами Linux удалить дубликаты строк из текстового файла? Удаление дубликатов строк из текстового файла средствами ОС Linux не составляет особого труда, для этого нам достаточно стандартных программ sort и uniq.

К примеру у нас имеется текстовый файл garbage.txt с содержимым:

Стандартные утилиты sort и uniq помогут нам отсортировать строки и выбрать только уникальные:

How to delete duplicate lines in a file without sorting it in Unix

Is there a way to delete duplicate lines in a file in Unix?

I can do it with sort -u and uniq commands, but I want to use sed or awk .

Is that possible?

9 Answers 9

seen is an associative array that AWK will pass every line of the file to. If a line isn’t in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.

The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2 , and so on. AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

]*\n).*\n\1/d; s/\n//; h; P' means, roughly, "Append the whole hold space this line, then if you see a duplicated line throw the whole thing out, otherwise copy the whole mess back into the hold space and print the first part (which is the line you just read."

] represents a range of ASCII characters from 0x20 (space) to 0x7E (tilde). These are considered the printable ASCII characters (linked page also has 0x7F/delete but that doesn't seem right). That makes the solution broken for anyone not using ASCII or anyone using, say, tab characters.. The more portable [^\n] includes a whole lot more characters. all of 'em except one, in fact.

Perl one-liner similar to jonas’s AWK solution:

This variation removes trailing white space before comparing:

This variation edits the file in-place:

This variation edits the file in-place, and makes a backup file.bak :

An alternative way using Vim (Vi compatible):

Delete duplicate, consecutive lines from a file:

vim -esu NONE +’g/\v^(.*)\n\1$/d’ +wq

Delete duplicate, nonconsecutive and nonempty lines from a file:

The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.

This is an infinite loop if the last line is blank and doesn’t have any characterss:

It doesn’t hang, but you lose the last line:

The explanation is at the very end of the sed FAQ:

The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one’s intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.

To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".