Как задать кодировку utf 8 в php

Установка локали UTF-8 в PHP

В любом PHP приложении нужно настраивать локаль и кодировку вне зависимости от настроек сервера. Это предотвратит неверное отображение и работу сайта при переезде на другой хостинг и других ситуаций.

Setlocale

Возможен вариант:

Вместо LC_ALL можно указать отдельную категорию функций, на которые будет влиять локаль:

LC_COLLATE – функции сравнения строк,
LC_CTYPE – функции преобразования и классификации строк,
C_MONETARYL – для функции localeconv(),
LC_NUMERIC – задает символ десятичного разделения,
LC_TIME – форматирование даты/времени,
LC_MESSAGES – для системных сообщений.

MB_string

Настройка функций для работы с многобайтовыми строками.

Часовой пояс

От него зависит результат работы функций с датами, подробнее о настройке временной зоны.

Кодировка контента

Ещё можно явно указать в какой кодировке передается контент, отправив заголовок:

Выставляем кодировку UTF-8

На сколько бы это глупо не казалось, но для удачного выставления кодировки необходимо выполнить целых 11(!) правил.
Хочу зарание предупредить, если какая-то из настроек в .htaccess повлечет за собой ошибку 500, это значит, что хостинг запретил менять этот параметр на сервере. В таком случае проверьте тот факт, что у Вас UTF-8 и в случае чего обратитесь к админам хостинга.
И для тех, кто попал на эту страницу с вопросами об Ajax: Ajax работает в кодировке UTF-8.

Правило №1: Указываем в HTML верстке в теге первой строчкой, кроме случаев, где мы будем использовать тег , так как он так же как и кодировка имеет приоритет над расположением, следующий код:

Правило №2: Указываем кодировку для PHP и самого файла, для этого нам необходимо выставить заголовок функцией header(). Выставляем его в самом начале нашего файла (абсолютно в самом начале), сразу после указания уровня вывода ошибок:

Правило №3: Кодировка для подключения к к БД MySQL. Устанавливается после подключения к БД и выбора бд (mysql_connect, mysql_select_db). Если у нас модуль mysql:

или улучшенный модуль mysqli:

Правило №4: Кодировка в .htaccess:

Правило №5: Кодировка для библиотеки mb, начиная с версии php 5.4 можно не указывать, так как по умолчанию будет использоваться именно UTF-8. Ну а пока прописываем её в файле .htaccess:

Либо в самом PHP, что в итоге выполнит одни и те же действия:

Правило №6: При сохранении файлов (обязательно ВСЕХ!) выбрать кодировку UTF-8 without BOM, повторюсь, without BOM — это необходимая настройка, в противном случае Ваш сайт не будет работать как надо. Для тех, кто пользуется удобной программой DreamWeaver:
Modify => Page Properties => Title/Encoding и выставляем «Encoding: UTF-8», после чего нажимаем ReLoad, убираем галочку с BOM «Include Unicode Signature (BOM)». Apply + OK.
Модификации => Свойства страницы => Заголовок/Кодировка и выставляем кодировку UTF-8. Нажимаем «перезагрузить», убрали галочку с Подключить Юникод Сигнатуры (BOM). Применить и OK.

Правило №7: если на данный момент какой-то из текстов был введён на странице или в БД — его необходимо перенабрать. Дело в том, что символ в одной кодировке представляет один набор бит для русских символов, а в другой — другой. Именно поэтому необходимо его либо перенабрать, либо перекодировать. Современные программы имеют возможность перевести текст из одной кодировки в другую. Об этой возможности интересуйтесь в мануалах Ваших программ.

Правило №8: Есть исключение, когда текст приходит к Вам на страницу с другого сайта в другой кодировке. Тогда на PHP есть удобная функция для перевода из одной кодировки в другую:

Правило №9: Для строковых функций strlen, substr, необходимо использовать их аналоги на библиотеке mb_, а именно: mb_strlen, mb_substr, то есть к функции дописываем mb_ .

Правило №10: Для работы с регулярными выражениями необходимо указывать модификатор u . Это обязательный параметр!

Правило №11: Для CSS файлов указывается кодировка так:

В заключение скажу, что символы в кодировке WIN-1251 состоят из 1 байта, то есть 8 бит, а в свою очередь в кодировке UTF-8 символы могут состоять от 1 до 4 байт, всё дело в том, что кодировка UTF-8 позволяет создавать мультиязычные сайты, так как все существующие в мире символы в ней присутствуют.
Ради любопытства русская буква в кодировке UTF-8 занимает 2 байта, именно поэтому за 1 символ функция strlen возвращает длину 2, то есть 2 байта, а mb_strlen возвращает уже правильную длину в 1 символ.

How to Set HTTP Header to UTF-8 in PHP

While working with PHP, sometimes it is necessary to set Header to UTF-8. In this tutorial, we will illustrate to you how to do that with the help of the header() function. The latter is used for sending a raw HTTP header.

So, let’s see how to set Header to UTF-8 using header() .

Using header()

Below, you can see how to use the header() function for setting Header to UTF-8.

So, anytime before sending any output to the client, you need to run the code above.

All you need is adding it to the beginning of the page. Be careful not to leave any blank space before it, as it will lead to an error. For checking whether the headers are already sent or not, just apply the headers_sent function.

Describing the header() Function

This function operates on all PHP versions.

It is used for sending a raw PHP header.

An essential thing to note: the header() function must be called before sending any actual output.

For more examples of using the header() function, you can refer to this source.

Describing the headers_sent Function

This PHP function allows checking whether or where the headers were sent. But note that after the header block has already been sent, no more header lines can be added with the header function.

For more examples of using the header_sent() function, you can refer to this source.

A Guide to UTF-8 Encoding in PHP and MySQL

Once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8.

Indeed, navigating through UTF-8 related issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these issues when working with PHP and MySQL in particular, based on practical experience and lessons learned.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

A Guide to UTF-8 Encoding in PHP and MySQL

Once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

By Francisco Clariá

Verified Expert in Engineering

Francisco is an engineer focused on cross-platform apps (Ionic/Cordova) and specialized in hardware-software technology integration.

Expertise

Experience

As a MySQL or PHP developer, once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8 encoding.

Unicode is a widely-used computing industry standard that defines a comprehensive mapping of unique numeric code values to the characters in most of today’s written character sets to aid with system interoperability and data interchange.

UTF-8 is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32. UTF-8 has become the dominant character encoding for the World Wide Web, accounting for more than half of all Web pages.

UTF-8 encodes each character using one to four bytes. The first 128 characters of Unicode correspond one-to-one with ASCII, making valid ASCII text also valid UTF-8-encoded text. It is for this reason that systems that are limited to use of the English character set are insulated from the complexities that can otherwise arise with UTF-8.

For example, the Unicode hexidecimal code for the letter A is U+0041, which in UTF-8 is simply encoded with the single byte 41. In comparison, the Unicode hexidecimal code for the character

utf8 symbol

On a previous job, we began running into data encoding issues when displaying bios of artists from all over the world. It soon became apparent that there were problems with the stored data, as sometimes the data was correctly encoded and sometimes it was not.

This led programmers to implement a hodge-podge of patches, sometimes with JavaScript, sometimes with HTML charset meta tags, sometimes with PHP, and so on. Soon, we ended up with a list of 600,000 artist bios with double- or triple-encoded information, with data being stored in different ways depending on who programmed the feature or implemented the patch. A classical technical rat’s nest.

Indeed, navigating through UTF-8 data encoding issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these UTF-8 issues when working with PHP and MySQL in particular, based on practical experience and lessons learned (and with thanks, in part, to information discovered here and here along the way).

Data encoding with UTF-8 unicode for PHP and MySQL makes complex languages simple.

Specifically, we’ll cover the following in this post:

Mods you’ll need to make to your php.ini file and PHP code.
Mods you’ll need to make to your my.ini file and other MySQL-related issues to be aware of (including config mods needed if you’re using Sphinx)
How to migrate data from a MySQL database previously encoded in latin1 to instead use a UTF-8 encoding

PHP UTF-8 Encoding – modifications to your php.ini file:

The first thing you need to do is to modify your php.ini file to use UTF-8 as the default character set:

(Note: You can subsequently use phpinfo() to verify that this has been set properly.)

OK cool, so now PHP and UTF-8 should work just fine together. Right?

Well, not exactly. In fact, not even close.

While this change will ensure that PHP always outputs UTF-8 as the character encoding (in browser response Content-type headers), you still need to make a number of modifications to your PHP code to make sure that it properly processes and generates UTF-8 characters.

PHP UTF-8 Encoding – modifications to your code:

To be sure that your PHP code plays well in the UTF-8 data encoding sandbox, here are the things you need to do:

Set UTF-8 as the character set for all headers output by your PHP code

In every PHP output header, specify UTF-8 as the encoding:

Specify UTF-8 as the encoding type for XML

Strip out unsupported characters from XML

Since not all UTF-8 characters are accepted in an XML document, you’ll need to strip any such characters out from any XML that you generate. A useful function for doing this (which I found here) is the following:

Here’s how you can use this function in your code:

Specify UTF-8 as the character set for all HTML content

For HTML content, specify UTF-8 as the encoding:

In HTML forms, specify UTF-8 as the encoding:

Specify UTF-8 as the encoding in all calls to htmlspecialchars

*Note: As of PHP 5.6.0, default_charset value is used as the default. From PHP 5.4.0, UTF-8 was the default, but prior to PHP 5.4.0, ISO-8859-1 was used as the default. It’s therefore a good idea to always explicitly specify UTF-8 to be safe, even though this argument is technically optional.

Also note that, for UTF-8, htmlspecialchars and htmlentities can be used interchangeably.

Set UTF-8 as the default character set for all MySQL connections

Specify UTF-8 as the default character set to use when exchanging data with the MySQL database using mysql_set_charset :

Note that, as of PHP 5.5.0, mysql_set_charset is deprecated, and mysqli::set_charset should be used instead:

Always use UTF-8 compatible versions of string manipulation functions

There are several PHP functions that will fail, or at least not behave as expected, if the character representation needs more than 1 byte (as UTF-8 does). An example is the strlen function that will return the number of bytes rather than the number of characters.

Two options are available for dealing with this:

The iconv functions that are available by default with PHP provide multibyte compatible versions of many of these functions (e.g., iconv_strlen , etc.). Remember, though, that the strings you provide to these functions must themselves be properly encoded.

There is also the mbstring extension to PHP (information on enabling and configuring it is available here). This extension provides a comprehensive set of functions that properly account for multibyte encoding.

MySQL UTF-8 Encoding – modifications to your my.ini file:

On the MySQL/UTF-8 side of things, modifications to the my.ini file are required as follows:

Set the following config parameters after each corresponding tag:

After making the above changes to your my.ini file, restart your MySQL daemon.

To verify that everything has properly been set to use the UTF-8 encoding, execute the following query:

The output should look something like:

If you instead see latin1 listed for any of these, double-check your configuration and make sure you’ve properly restarted your mysql daemon.

MySQL UTF-8 Encoding – other things to consider:

MySQL UTF-8 is actually a partial implementation of the full UTF-8 character set. Specifically, MySQL UTF-8 encoding uses a maximum of 3 bytes, whereas 4 bytes are required for encoding the full UTF-8 character set. This is fine for all language characters, but if you need to support astral symbols (whose code points range from U+010000 to U+10FFFF), those require a four byte encoding which is not supported in MySQL UTF-8. In MySQL 5.5.3, this was addressed with the addition of support for the utf8mb4 character set which uses a maximum of four bytes per character and thereby supports the full UTF-8 character set. So if you’re using MySQL 5.5.3 or later, use utf8mb4 instead of UTF-8 as your database/table/row character set. More info is available here.

If the connecting client has no way to specify the encoding for its communication with MySQL, after the connection is established you may have to run the following command/query:

When determining the size of varchar fields when modeling the database, don’t forget that UTF-8 characters may require as many as 4 bytes per character.

MySQL UTF-8 Encoding – if you use Sphinx:

In your Sphinx configuration file (i.e., sphinx.conf ):

Set your index definition to have:

Add the following to your source definition:

Restart the engine and remake all indices.

If you want to configure sphinx so that letters like C c Ć ć Ĉ ĉ Ċ ċ Č č are all treated as equivalent for search purposes, you will need to configure a charset_table (a.k.a. character folding) which is essentially an equivalency mapping between characters. More information is available here.

Migrating database data that is already encoded in latin1 to UTF-8

If you have an existing MySQL database that is already encoded in latin1, here’s how to convert the latin1 to UTF-8:

Make sure you’ve made all the modifications to the configuration settings in your my.ini file, as described above.

Execute the following command:

Via command line, verify that everything is properly set to UTF-8

Create a dump file with latin1 encoding for the table you want to convert:

Do a global search and replace of the charset in the dumpfile from latin1 to UTF-8:

Note to Windows users: This charset string replacement (from latin1 to UTF-8) can also be done using find-and-replace in WordPad (or some other text editor, such as vim). Be sure to save the file just as it is though (don’t save it as unicode txt file!).

From this point, we will start messing with the database data, so it would probably be prudent to backup the database if you haven’t already done so. Then, restore the dump into the database:

Search for any records that may not have converted properly and correct them. Since non-ASCII characters are multi-byte by design, we can find them by comparing the byte length to the character length (i.e., to identify rows that may hold double-encoded UTF-8 characters that need to be fixed).

See if there are any records with multi-byte characters (if this query returns zero, then there don’t appear to be any records with multi-byte characters in your table and you can proceed to Step 8).

Copy rows with multi-byte characters into a temporary table:

Convert double-encoded UTF-8 characters to proper UTF-8 characters

This is actually a bit tricky. A double encoded string is one that was properly encoded as UTF-8. However, MySQL then did us the erroneous favor of converting it (from what it thought was latin1) to UTF-8 again, when we set the column to UTF-8 encoding. Resolving this therefore requires a two step process through which we “trick” MySQL in order to preclude it from doing us this “favor”.

First, we set the encoding type for the column back to latin1, thereby removing the double encoding:

Note: Be sure to use the correct field type for your table. In the example above, for our table, the correct field type for ‘ArtistName’ was varchar(128), but the field in your table could be text or any other type. Be sure to specify it properly!

The problem is that now, if we set the column encoding back to UTF-8, MySQL will run the latin1 to UTF-8 data encoding for us again and we’ll be back to where we started. To avoid this, we change the column type to blob and THEN we set it to UTF-8. This exploits the fact that MySQL will not attempt to encode a blob. We are thereby able to “fool” the MySQL charset conversion to avoid the double encoding issue.

(Again, as noted above, be sure to use the proper field type for your table.)

Remove rows with only single-byte characters from the temporary table:

Re-insert fixed rows back into the original table (before doing this, you may want to run some selects on the temptable to verify that it appears to be properly corrected, just as a sanity check).

Verify the remaining data and, if necessary, repeat the process in step 7 (this could be necessary, for example, if the data was triple encoded). Further errors, if any, may be easiest to resolve manually.

Source code and resource files

One other thing to remember and verify is that your source code files, resources files, and so on, are all being saved properly with UTF-8 data encoding. Otherwise, any “special” characters in these files may not be handled correctly.

In Netbeans, for example, you can right-click on your project, choose properties and then in “Sources” you will find the data encoding option (it usually defaults to UTF-8, but it’s worth checking).

Or in Windows Notepad, use the “Save As…” option in the File menu, and select the UTF-8 encoding option at the bottom of the dialog. (Note that the “Unicode” option that Notepad provides is actually UTF-16, so that’s not what you want.)

Wrap-up

Although it can be somewhat tedious, taking the time to go through these steps to systematically address your MySQL and PHP UTF-8 data encoding issues can ultimately save you a great deal of time and grief. In the long run, this type of methodical approach is far superior to the all-too-common tendency to just keep patching the system.

This guide hopefully emphasizes the importance of taking the charset definition into consideration when setting up a project environment in the first place and working in a software project environment that properly accounts for character encoding in its manipulation of text and strings.

Understanding the basics

What is UTF-8 character set?

Defined by the Unicode standard, UTF-8 is an 8-bit character encoding capable of storing ay Unicode character. It is backwards compatible with ASCII.

What does UTF-8 stand for?

UTF is short for Unicode Transformation Format, while the “8” suffix denotes the use of 8-bit blocks to represent characters.

How to insert Unicode characters in MySQL using PHP?

In order to insert Unicode characters in MySQL, you need to create a table with Unicode support, select the appropriate encoding/collation settings, and specify the charset in the MySQL connection. Then, you can proceed and employ PHP code to insert Unicode as you please.

Как задать кодировку utf 8 в php

Установка локали UTF-8 в PHP

Setlocale

Возможен вариант:

MB_string

Часовой пояс

Кодировка контента

Выставляем кодировку UTF-8

How to Set HTTP Header to UTF-8 in PHP

Using header()

Describing the header() Function

Describing the headers_sent Function

A Guide to UTF-8 Encoding in PHP and MySQL

By Francisco Clariá

Expertise

Experience

PHP UTF-8 Encoding – modifications to your php.ini file:

PHP UTF-8 Encoding – modifications to your code:

MySQL UTF-8 Encoding – modifications to your my.ini file:

MySQL UTF-8 Encoding – other things to consider:

MySQL UTF-8 Encoding – if you use Sphinx:

Migrating database data that is already encoded in latin1 to UTF-8

Source code and resource files

Wrap-up

Further Reading on the Toptal Blog:

Understanding the basics

What is UTF-8 character set?

What does UTF-8 stand for?

How to insert Unicode characters in MySQL using PHP?

Добавить комментарий Отменить ответ