Как перенести строку в java

Chapter 3. Lexical Structure

This chapter specifies the lexical structure of the Java programming language.

Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters. Line terminators are defined (§3.4) to support the different conventions of existing host systems while maintaining consistent line numbers.

The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements (§3.5), which are white space (§3.6), comments (§3.7), and tokens. The tokens are the identifiers (§3.8), keywords (§3.9), literals (§3.10), separators (§3.11), and operators (§3.12) of the syntactic grammar.

3.1. Unicode

Programs are written using the Unicode character set (§1.7). Information about this character set and its associated character encodings may be found at https://www.unicode.org/ .

The Java SE Platform tracks the Unicode Standard as it evolves. The precise version of Unicode used by a given release is specified in the documentation of the class Character .

Versions of the Java programming language prior to JDK 1.1 used Unicode 1.1.5. Upgrades to newer versions of the Unicode Standard occurred in JDK 1.1 (to Unicode 2.0), JDK 1.1.7 (to Unicode 2.1), Java SE 1.4 (to Unicode 3.0), Java SE 5.0 (to Unicode 4.0), Java SE 7 (to Unicode 6.0), Java SE 8 (to Unicode 6.2), Java SE 9 (to Unicode 8.0), Java SE 11 (to Unicode 10.0), Java SE 12 (to Unicode 11.0), Java SE 13 (to Unicode 12.1), and Java SE 15 (to Unicode 13.0).

The Unicode standard was originally designed as a fixed-width 16-bit character encoding. It has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, using the hexadecimal U+n notation. Characters whose code points are greater than U+FFFF are called supplementary characters . To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range (U+D800 to U+DBFF), and the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.

Some APIs of the Java SE Platform, primarily in the Character class, use 32-bit integers to represent code points as individual entities. The Java SE Platform provides methods to convert between 16-bit and 32-bit representations.

This specification uses the terms code point and UTF-16 code unit where the representation is relevant, and the generic term character where the representation is irrelevant to the discussion.

Except for comments (§3.7), identifiers (§3.8), and the contents of character literals, string literals, and text blocks (§3.10.4, §3.10.5, §3.10.6), all input elements (§3.5) in a program are formed only from ASCII characters (or Unicode escapes (§3.3) which result in ASCII characters).

ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters.

3.2. Lexical Translations

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:

A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \u xxxx , where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx . This translation step allows any program to be expressed using only ASCII characters.

A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).

A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens that are the terminal symbols of the syntactic grammar (§2.3).

The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There are two exceptions to account for situations that need more granular translation: in step 1, for the processing of contiguous \ characters (§3.3), and in step 3, for the processing of contextual keywords and adjacent > characters (§3.5).

The input characters a—b are tokenized as a , — , and b , which is not part of any grammatically correct program, even though the tokenization a , — , — , b could be part of a grammatically correct program. The tokenization a , — , — , b can be realized with the input characters a- -b (with an ASCII SP character between the two — characters).

It might be supposed that the raw input \\u1234 is translated to a \ character and (following the «longest possible» rule) a Unicode escape of the form \u1234 . In fact, the leading \ character causes this raw input to be translated to seven distinct characters: \ \ u 1 2 3 4 .

3.3. Unicode Escapes

A compiler for the Java programming language («Java compiler») first recognizes Unicode escapes in its raw input, translating the ASCII characters \u followed by four hexadecimal digits to a raw input character which denotes the UTF-16 code unit (§3.1) for the indicated hexadecimal value. One Unicode escape can represent characters in the range U+0000 to U+FFFF; representing supplementary characters in the range U+010000 to U+10FFFF requires two consecutive Unicode escapes. All other characters in the compiler’s raw input are recognized as raw input characters and passed unchanged.

This translation step results in a sequence of Unicode input characters, all of which are raw input characters (any Unicode escapes having been reduced to raw input characters).

The \ , u , and hexadecimal digits here are all ASCII characters.

The UnicodeInputCharacter production is ambiguous because an ASCII \ character in the compiler’s raw input could be reduced to either a RawInputCharacter or the \ of a UnicodeEscape (to be followed by an ASCII u ). To avoid ambiguity, for each ASCII \ character in the compiler’s raw input, input processing must consider the most recent raw input characters that resulted from this translation step:

If the most recent raw input character in the result was itself translated from a Unicode escape in the compiler’s raw input, then the ASCII \ character is eligible to begin a Unicode escape.

For example, if the most recent raw input character in the result was a backslash that arose from a Unicode escape \u005c in the raw input, then an ASCII \ character appearing next in the raw input is eligible to begin another Unicode escape.

Otherwise, consider how many backslashes appeared contiguously as raw input characters in the result, back to a non-backslash character or the start of the result. (It is immaterial whether any such backslash arose from an ASCII \ character in the compiler’s raw input or from a Unicode escape \u005c in the compiler’s raw input.) If this number is even, then the ASCII \ character is eligible to begin a Unicode escape; if the number is odd, then the ASCII \ character is not eligible to begin a Unicode escape.

For example, the raw input «\\u2122=\u2122″ results in the eleven characters » \ \ u 2 1 2 2 = ™ » because while the second ASCII \ character in the raw input is not eligible to begin a Unicode escape, the third ASCII \ character is eligible, and \u2122 is the Unicode encoding of the character ™ .

If an eligible \ is not followed by u , then it is treated as a RawInputCharacter and remains part of the escaped Unicode stream.

If an eligible \ is followed by u , or more than one u , and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.

The character produced by a Unicode escape does not participate in further Unicode escapes.

For example, the raw input \u005cu005a results in the six characters \ u 0 0 5 a , because 005c is the Unicode value for a backslash. It does not result in the character Z , which is Unicode value 005a , because the backslash that resulted from processing the Unicode escape \u005c is not interpreted as the start of a further Unicode escape.

Note that \u005cu005a cannot be written in a string literal to denote the six characters \ u 0 0 5 a . This is because the first two characters resulting from translation, \ and u , are interpreted in a string literal as an illegal escape sequence (§3.10.7).

Fortunately, the rule about contiguous backslash characters helps programmers to craft raw inputs that denote Unicode escapes in a string literal. Denoting the six characters \ u 0 0 5 a in a string literal simply requires another \ to be placed adjacent to the existing \ , such as «\\u005a is Z» . This works because the second \ in the raw input \\u005a is not eligible to begin a Unicode escape, so the first \ and the second \ are preserved as raw input characters, as are the next five characters u 0 0 5 a . The two \ characters are subsequently interpreted in a string literal as the escape sequence for a backslash, resulting in a string with the desired six characters \ u 0 0 5 a . Without the rule, the raw input \\u005a would be processed as a raw input character \ followed by a Unicode escape \u005a which becomes a raw input character Z ; this would be unhelpful because \Z is an illegal escape sequence in a string literal. (Note that the rule translates \u005c\u005c to \\ because the translation of the first Unicode escape to a raw input character \ does not prevent the translation of the second Unicode escape to another raw input character \ .)

The rule also allows programmers to craft raw inputs that denote escape sequences in a string literal. For example, the raw input \\\u006e results in the three characters \ \ n because the first \ and the second \ are preserved as raw input characters, while the third \ is eligible to begin a Unicode escape and thus \u006e is translated to a raw input character n . The three characters \ \ n are subsequently interpreted in a string literal as \ n which denotes the escape sequence for a linefeed. (Note that \\\u006e may be written as \u005c\u005c\u006e because each Unicode escape \u005c is translated to a raw input character \ and so the remaining raw input \u006e is preceded by an even number of backslashes and processed as the Unicode escape for n .)

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u — for example, \u xxxx becomes \uu xxxx — while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u ‘s are present to a sequence of Unicode characters with one fewer u , while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

A Java compiler should use the \u xxxx notation as an output format to display Unicode characters when a suitable font is not available.

Экранирование символов

Когда-то давно вы узнали, что чтобы записать в коде строку символов, нужно обернуть эти символы в двойные кавычки: получится строковой литерал .

А что делать, если нам нужно, чтобы кавычки были внутри строкового литерала? Строка, содержащая кавычки — что может быть проще.

Допустим, мы хотим вывести текст Фильм «Друзья» номинирован на «Оскар» . Как это сделать?

Код	Примечания
Этот вариант работать не будет!

Все дело в том, что по мнению компилятора тут записан совсем другой код:

Код	Примечания
Этот вариант работать не будет!

После того, как компилятор встретит двойные кавычки в коде, он будет считать их началом строкового литерала. Следующие двойные кавычки — окончанием строкового литерала.

Так как же записать в двойные кавычки внутри литерала?

2. Экранирование символов

Способ есть, ему даже дали название — экранирование символов . Вы просто пишете внутри строки текста кавычки, а перед кавычками добавляете символ \ ( обратная косая черта или обратный слеш или бекслеш , от англ. backslash ).

Вот как будет выглядеть правильно записанный строковой литерал:

Код	Примечания
Это сработает!

Компилятор все поймет правильно и не будет считать кавычки, расположенные после обратной косой черты , обычными кавычками.

Более того, если вывести данную строку на экран, кавычки с обратной косой чертой будут правильно обработаны, и на экран будет выведена надпись без обратной косой черты: Фильм «Друзья» номинирован на «Оскар»

Еще важный момент. Кавычки, предваренные обратной косой чертой — это один символ: мы просто пользуемся таким хитрым способом записи, чтобы не мешать компилятору распознавать строковые литералы в коде. Вы можете присвоить кавычки в переменную char :

Код	Примечания
\» — это один символ, а не два
так тоже можно: двойная кавычка внутри одинарных кавычек

3. Часто возникающие ситуации при экранировании символов

Часто возникающие ситуации экранирования символов

Кроме двойных кавычек, есть еще много символов, которые по-особому обрабатываются компилятором. Например, перенос строки.

Как добавить в литерал перенос строки? Для этого тоже есть специальная комбинация:

Если вы хотите добавить в строковой литерал перенос строки, вам нужно просто добавить пару символов – \n .

Код	Вывод на экран

Всего таких специальных комбинаций 8: их еще называют escape-последовательностями , вот они:

Код	Описание
\t	Вставить символ табуляции
\b	Вставить символ возврата на один символ
\n	Вставить символ новой строки
\r	Вставить символ возврата каретки
\f	Вставить символ прогона страницы
\’	Вставить одинарную кавычку
\»	Вставить двойную кавычку
\\	Вставить обратный слеш

С двумя из них вы познакомились, а что значат остальные 6?

Символ табуляции – \t

Данный символ в тексте эквивалентен нажатию на клавиатуре клавиши Tab при наборе текста. Он сдвигает следующий за ним текст с целью его выровнять.

Код	Вывод на экран

Возврат на один символ назад – \b

Данный символ в тексте эквивалентен нажатию на клавиатуре клавиши Backspace при наборе текста. Он удаляет последний выведенный символ перед ним:

Код	Вывод на экран

Символ возврата каретки – \r

Этот символ переносит курсор в начало текущей строки, не меняя текста. Следующий выводимый текст будет перетирать существующий.

Код	Вывод на экран

Символ прогона страницы – \f

Это символ дошел до нас из эпохи первых матричных принтеров. Если подать такой символ на печать, это приводило к тому, что принтер просто прокручивал текущий лист, не печатая текст, пока не начнется новый.

Сейчас бы мы назвали его разрыв страницы или новая страница .

Обратный слэш – \\

Ну а тут вообще все просто. Если мы используем обратную косую черту (обратный слэш) в тексте, чтобы экранировать символы, то как тогда записать в текстовой строке сам символ косой черты?

Все просто: чтобы добавить в текст символ обратной косой черты , его нужно написать два раза подряд.

Код	Вывод на экран
Компилятор будет ругаться на неизвестные экранированные символы.
Вот так правильно!

4. Кодировка Unicode

Как вы уже знаете, каждому символу, отображаемому на экране, соответствует определенный числовой код. Стандартизированный набор таких кодов называют кодировкой .

Когда-то давно, когда только изобрели компьютеры, для кодировки всех символов было достаточно семи бит (меньше одного байта) – первая кодировка содержала всего 128 символов. Называлась такая кодировка ASCII .

ASCII (англ. American Standard Code for Information Interchange) — американская стандартная кодировочная таблица для печатных символов и некоторых специальных кодов.

Она состояла из 33 непечатных управляющих символов (влияющих на обработку текста и пробелов) и 95 печатных символов, включая цифры, буквы латинского алфавита в строчном и прописном вариантах и ряд пунктуационных символов.

Кодировка Unicode

Рост популярности компьютеров привел к тому, что каждая страна начала выпускать свою кодировку. Обычно за основу брали ASCII и заменяли редко используемые символы на символы национальных алфавитов.

Со временем появилась идея: создать одну кодировку, в которой разместить все символы всех мировых кодировок.

Кодировка Unicode 1

В 1993 году была создана кодировка Unicode , и язык Java был первым языком программирования, который использовал ее как стандарт хранения текста. Сейчас же Unicode — стандарт всей ИТ-индустрии.

И хотя Unicode сам по себе является стандартом, у него есть несколько форм представления (Unicode transformation format, UTF): UTF-8, UTF-16 и UTF-32, и пр.

В Java используется продвинутая разновидность кодировки Unicode – UTF-16: каждый символ в которой кодировался 16 битами (2 байтами). Она способна вместить до 65,536 символов!

В этой кодировке можно найти почти все символы всех алфавитов мира. Но наизусть ее, естественно, никто не знает: нельзя знать все, но все можно загуглить.

Чтобы записать в коде программы символ кодировки Unicode по его коду, нужно написать \u + шестнадцатеричные цифры кода . Например \u00A9

Как в String добавить перенос строки

NeatBeans предлагает только конкатенацию строк. А как все-таки записать с учетом переноса строки.

Используй System.lineSeparator() . Это сделает перевод строки и в Windows и в Linux.

(или то же, но с конкатенацией) будет самым частым ответом, причём для многих C-подобных языков. Перевод строки на разных платформах и их версиях может быть разным, но в Java используются Unicode-cтроки, так что работает \n.

Однако в случае форматированных строк, чтобы вставить платформо-зависимый разделитель строк, нужно использовать символ «%n»:

— (в данном случае на место «%s» подставятся второй и третий аргумент метода format())

Нажимая «Принять все файлы cookie» вы соглашаетесь, что Stack Exchange может хранить файлы cookie на вашем устройстве и раскрывать информацию в соответствии с нашей Политикой в отношении файлов cookie.

Распечатать новую строку в Java

В этом посте мы рассмотрим, как печатать новую строку в Java.

Новая строка (она же конец строки (EOL), перевод строки или разрыв строки) означает конец строки и начало новой. Различные операционные системы используют разные обозначения для представления новой строки с помощью одного или двух управляющих символов. В системах Unix/Linux и macOS новая строка представлена «\n» ; в системах Microsoft Windows с помощью «\r\n» ; и на классической Mac OS с «\r» .

1. Использование зависящего от платформы символа новой строки

Обычно используемое решение заключается в использовании зависит от платформы символы новой строки. Например, «\n» на Unix и «\r\n» на ОС Windows. Проблема этого решения в том, что ваша программа не будет переносимой.