Python Bytes to String – How to Convert a Bytestring
Shittu Olumide
In this article, you will learn how to convert a bytestring. I know the word bytestring might sound technical and difficult to understand. But trust me – we will break the process down and understand everything about bytestrings before writing the Python code that converts bytes to a string.
So let’s start by defining a bytestring.
What is a bytestring?
A bytestring is a sequence of bytes, which is a fundamental data type in computing. They are typically represented using a sequence of characters, with each character representing one byte of data.
Bytes are often used to represent information that is not character-based, such as images, audio, video, or other types of binary data.
In Python, a bytestring is represented as a sequence of bytes, which can be encoded using various character encodings such as UTF-8, ASCII, or Latin-1. It can be created using the bytes() or bytearray() functions, and can be converted to and from strings using the encode() and decode() methods.
Note that in Python 3.x, bytestrings and strings are distinct data types, and cannot be used interchangeably without encoding or decoding.
This is because Python 3.x uses Unicode encoding for strings by default, whereas previous versions of Python used ASCII encoding. So when working with bytestrings in Python 3.x, it’s important to be aware of the encoding used and to properly encode and decode data as needed.
How to Convert Bytes to a String in Python
Now that we have the basic understanding of what bytestring is, let’s take a look at how we can convert bytes to a string using Python methods, constructors, and modules.
Using the decode() method
decode() is a method that you can use to convert bytes into a string. It is commonly used when working with text data that is encoded in a specific character encoding, such as UTF-8 or ASCII. It simply works by taking an encoded byte string as input and returning a decoded string.
Where byte_string is the input byte string that we want to decode and encoding is the character encoding used by the byte string.
Here is some example code that demonstrates how to use the decode() method to convert a byte string to a string:
In this example, we define a byte string b»hello world» and convert it to a string using the decode() method with the UTF-8 character encoding. The resulting decoded string is «hello world» , which is then printed to the console.
Note that the decode() method can also take additional parameters, such as errors and final , to control how decoding errors are handled and whether the decoder should expect more input.
Using the str() constructor
You can use the str() constructor in Python to convert a byte string (bytes object) to a string object. This is useful when we are working with data that has been encoded in a byte string format, such as when reading data from a file or receiving data over a network socket.
The str() constructor takes a single argument, which is the byte string that we want to convert to a string. If the byte string is not valid ASCII or UTF-8, we will need to specify the encoding format using the encoding parameter.
In this example, we define a byte string b»Hello, world!» and use the str() constructor to convert it to a string object. We specify the encoding format as utf-8 using the encoding parameter. Finally, we print the resulting string to the console.
Using the bytes() constructor
We can also use the bytes() constructor, a built-in Python function used to create a new bytes object. It takes an iterable of integers as input and returns a new bytes object that contains the corresponding bytes. This is useful when we are working with binary data, or when converting between different types of data that use bytes as their underlying representation.
In this example, we start by defining a string variable string . We then use the bytes() constructor to convert the string to a bytes object, passing in the string and the encoding ( utf-8 ) as arguments. We print the resulting bytes object to the console.
Next, we use the decode() method to convert the bytes object back to a string, passing in the same encoding ( utf-8 ) as before. We print the decoded string to the console as well.
Using the codecs module
The codecs module in Python provides a way to convert data between different encodings, such as between byte strings and Unicode strings. It contains a number of classes and functions that you can use to perform various encoding and decoding operations.
For us to be able to convert Python bytes to a string, we can use the decode() method provided by the codecs module. This method takes two arguments: the first is the byte string that we want to decode, and the second is the encoding that we want to use.
In this example, we have a byte string b_string which contains some non-ASCII characters. We use the codecs.decode() method to convert this byte string to a Unicode string.
The first argument to this method is the byte string to be decoded, and the second argument is the encoding used in the byte string (in this case, it is utf-8 ). The resulting Unicode string is stored in u_string .
To convert a Unicode string to a byte string using the codecs module, we use the encode() method. Here is an example:
In this example, we have a Unicode string u_string . We use the codecs.encode() method to convert this Unicode string to a byte string. The first argument to this method is the Unicode string to be encoded, and the second argument is the encoding to use for the byte string (in this case, it is utf-8 ). The resulting byte string is stored in b_string .
Conclusion
Understanding bytestrings and string conversion is important because it is a fundamental aspect of working with text data in any programming language.
In Python, this is particularly relevant due to the increasing popularity of data science and natural language processing applications, which often involve working with large amounts of text data.
For further learning, check out these helpful resources:
Let’s connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.
Как преобразовать байты в строку на Pythonх 2 и 3
В этой обучающей статье будет рассказано, как преобразовать байты в строки на Python 2.x и Python 3.x.
Преобразование байт в строку на Python 2.x
Байты на Python 2.7 идентичны str , поэтому переменная, инициализированная как bytes , по своей сути является строкой.
Преобразование байт в строку на Python 3.x
bytes — это новый тип данных, введённый на Python 3.
Тип элементов данных в байтах — int .
Конвертируйте байты в строки с помощью decode на Python 3.x
Метод .decode bytes может преобразовывать байты в строку с помощью заданного метода encoding . В большинстве случаев это нормально, если вы оставите метод encoding по умолчанию utf-8 , но это не всегда безопасно, так как байты могут быть закодированы другими методами кодирования, а не utf-8 .
Три способа декодирования bytes , как показано выше, идентичны, потому что в качестве метода кодирования используется utf-8 .
Это может привести к возникновению ошибок при использовании utf-8 , но байты при этом не кодируются.
Мы получаем UnicodeDecodeError , который говорит, что utf-8 — неправильный кодек .
У нас есть два подхода к решению этой проблемы кодирования .
backslashreplace , ignore или replace в качестве параметров к errors
decode имеет другой параметр, кроме encoding — errors . Он определяет поведение, когда происходит ужас . Значение по умолчанию errors — strict , что означает, что он вызывает ошибку, если ошибка происходит в процессе декодирования.
У error есть другие опции, такие как ignore , replace или другие зарегистрированные codecs.register_error имена, backslashreplace , например.
ignore игнорирует ошибки декодирования и создает выходную строку так, как это возможно.
replace заменяет соответствующие символы на символы, как определено в методе кодирования , как задано в give. backslashreplace заменяет символы, которые не могли быть декодированы, тем же содержимым, что и в исходных байтах .
MS-DOS cp437 кодировка может быть использована, если кодировка данных байт неизвестна.
chr для преобразования байт в строку на Python 3.x
chr(i, /) возвращает строку Unicode, состоящую из одного символа с порядковым номером. Это может преобразовать элемент bytes в строку , но не в полный bytes .
Мы могли бы использовать понимание списка или map для получения преобразованной строки из bytes при использовании chr для отдельного элемента.
Сравнение производительности и вывод различных методов преобразования байт в строку
Мы используем timeit для сравнения производительности методов, введенных в этом учебнике — decode и chr .
Из показанного выше времени выполнения видно, что decode() гораздо быстрее и chr() относительно неэффективна, так как ей нужно восстанавливать строку из одного символа строки.
Мы рекомендуем использовать decode в критичном для производительности приложении.
Конвертация между байтами и строками#
Избежать работы с байтами нельзя. Например, при работе с сетью или файловой системой, чаще всего, результат возвращается в байтах.
Соответственно, надо знать, как выполнять преобразование байтов в строку и наоборот. Для этого и нужна кодировка.
Кодировку можно представлять как ключ шифрования, который указывает:
как «зашифровать» строку в байты (str -> bytes). Используется метод encode (похож на encrypt)
как «расшифровать» байты в строку (bytes -> str). Используется метод decode (похож на decrypt)
Эта аналогия позволяет понять, что преобразования строка-байты и байты-строка должны использовать одинаковую кодировку.
encode, decode#
Для преобразования строки в байты используется метод encode:
Чтобы получить строку из байт, используется метод decode:
str.encode, bytes.decode#
Метод encode есть также в классе str (как и другие методы работы со строками):
А метод decode есть у класса bytes (как и другие методы):
В этих методах кодировка может указываться как ключевой аргумент (примеры выше) или как позиционный:
Как работать с Юникодом и байтами#
Есть очень простое правило, придерживаясь которого, можно избежать, как минимум, части проблем. Оно называется «Юникод-сэндвич»:
байты, которые программа считывает, надо как можно раньше преобразовать в Юникод (строку)
Python How to Convert Bytes to String (5 Approaches)
To convert bytes into a string in Python, use the bytes.decode() method.
This is the quick answer.
However, depending on the context and your needs, there are other ways to convert bytes to strings.
In this guide, you learn how to convert bytes to string in 5 different ways in different situations.
Here’s a short review of the byte-to-string converting methods:
Method | Example |
---|---|
1. The decode() method of a byte string | byte_string.decode(‘UTF-8’) |
2. The built-in str() method | str(byte_string, ‘UTF-8’) |
3. Codecs decode() function | codecs.decode(byte_string) |
4. Pandas dataframe decode() method | df[‘column’].str.decode(“utf-8”) |
5. The join() method with map() function | “”.join(map(chr, byte_str)) |
Let’s jump to it!
Bytes vs Strings in Python
There is a chance you are looking to convert bytes to strings because you do not know what they are. Before jumping into the conversions, let’s take a quick look at what are bytes in the first place.
Why Bytes?
A computer doesn’t understand the notion of “text” or “number” as is. This is because computers operate on bits, that is, 0s and 1s.
Storing data to a computer happens by using groups of bits, also known as bytes. Usually, there are 8 bits in a byte. But this might vary depending on what system you’re using.
Byte Strings in Python
In Python, a byte string is a sequence of bytes that the computer understands but humans can’t.
A string is a sequence of characters and is something we humans can understand but cannot directly store in a computer.
This is why any string needs to be converted to a byte string before the computer can use it.
In Python, a bytes object is a byte representation of a string. A bytes object is prefixed with the letter ‘b‘.
For example, take a look at these two variables:
- name1 is a str object.
- name2 is a bytes object.
You can verify this by printing out the data types of these variables:
As I mentioned earlier, the byte string is something that is hard to understand. In the above code, this isn’t clear as you can just read the b’Alice’ very clearly.
Byte String vs String in Python
To see the main difference between the byte string and a string, let’s print the words character by character.
First, let’s do the name1 variable:
Now, let’s print each byte in the name2 bytes object:
Here you can see there is no way for you to tell what those numbers mean. Those numbers are the byte values of the characters in a string. Something that a computer can understand.
To make one more thing clear, let’s see what happens if we print the bytes object name2 as-is:
As your surprize, it clearly says “Alice”. This isn’t too hard to read, is it?
The reason why the byte string prints out as a readable string is because what you see is actually a string representation of the bytes object.
Python does this for the developer’s convenience.
If there was no special string representation for a bytes object, printing bytes would be nonsense.
Anyway, now you understand what is a bytes object in Python, and how it differs from the str object.
Now, let’s see how to convert between bytes and string.
1. The decode() Function
Given a bytes object, you can use the built-in decode() method to convert the byte to a string.
You can also pass the encoding type to this function as an argument.
For example, let’s use the UTF-8 encoding for converting bytes to a string:
2. The str() Function
Another approach to convert bytes to string is by using the built-in str() function.
This method does the exact same thing as the decode() method in the previous example.
Perhaps the only downside to this approach is in the code readability.
If you compare these two lines:
You can see the latter is more explicit about decoding the bytes to a string.
3. Codecs decode() Function
Python also has a built-in codecs module for text decoding and encoding.
This module also has its own decode() function. You can use this function to convert bytes to strings (and vice versa).
4. Pandas decode() Function
If you are working with pandas and you have a data frame that consists of bytes, you can easily convert them to strings by calling the str.decode() function on a column.
5. map() Function: Convert a Byte List to String
In Python, a string is a group of characters.
Each Python character is associated with a Unicode value, which is an integer.
Thus, you can convert an integer to a character in Python.
To do this, you can call the built-in chr() function on an integer.
Given a list of integers, you can use the map() function to map each integer to a character.
Here is how it looks in code:
This piece of code:
- Converts the integers to corresponding characters.
- Returns a list of characters.
- Merges the list of characters to a single string.
To learn more about the map() function in Python, feel free to read this article.
Be Careful with the Encoding
There are dozens of byte-to-string encodings out there.
In this guide, we only used the UTF-8 encoding, which is the most popular encoding type.
The UTF-8 is also the default encoding type in Python. However, UTF-8 encoding is not always the correct one.
This error means there is no character in the UTF-8 encoding that corresponds to the bytes in the string.
In other words, you should be using a different encoding.
You can use a module like chardet to detect the character encodings. (Notice that this module is not maintained, but most of the info you learn about it is still applicable.)
However, no approach is 100% foolproof. This module gives you its best guess about the encoding and the probability associated with it.
Anyway, let’s say the above byte string can be decoded using the latin1 encoding as well as the iso_8559_5 encoding.
Now let’s make the conversion:
This time there is no error. Instead, it works with both encodings and produces a different result.
So be careful with the encodings!
If you see an error when doing a conversion, the first thing you need to do is to figure out the encoding used. Then you should use that particular encoding to encode/decode your values to get it right.
Conclusion
Today you learned how to convert bytes to strings in Python.
To recap, there is a bunch of ways to convert bytes to strings in Python.
- To convert a byte string to a string, use the bytes.decode() method.
- If you have a list of bytes, call chr() function on each byte using the map() function (or a for loop)
- If you have a pandas dataframe with bytes, call the .str.decode() method on the column with bytes.
By default, the Python character encoding is usually UTF-8.
However, this is not always applicable. Trying to encode a non-UTF-8 byte with UTF-8 produces an error. In this situation, you should determine the right character encoding before encoding/decoding. You can use a module like chardet to do this.