How to properly parse the XML if the text inside the tags has line breaks?

Question

There is an XML file containing the following construction:

<text> Какой-то текст в две строки </text>

If you strictly follow the specification, how should the contents of the <text> element be read?

A) 'Какой-то текст в две строки'
B) #32'Какой-то текст в две строки'#32
B) #10'Какой-то текст'#10'в две строки'#10
D) 'Какой-то текст'#10'в две строки'
L) Another option (which one)?

I read the specifications , but with that number of cross-references and my knowledge of English I could not find an unequivocal answer.

What programming language do you wish to use in solving this problem?

Community spirit ♦ one · Accepted Answer · 2016-05-20T13:47:54

The corresponding paragraphs of the links themselves and do not contain:

It is often convenient to use "white space" (for spaces, tabs, and lines) for greater readability. Such white space is the version of the document. There is no need for a prefix.
An XML processor must be passed through to the application.

I translate:

When editing XML documents, it is often convenient to use "whitespace" (spaces, tabs, blank lines) to separate markup in order to improve readability. These "whitespace" characters are usually not intended to be included in the supplied version of the document. On the other hand, there are often significant white space characters that need to be saved, for example, in verses or program code.
All whitespace characters that are not directly related to the markup must be passed to the application in the data that is parsed by the XML processor.

(However, there are details on the normalization of non-standard string conversions to single \n / #xA / 10 , thanks to @ ru-volt )

The markup in this case are only the tags, delimited by angle brackets, therefore, in the text, everything that is between the end of the opening tag > and the beginning of the closing < , including whitespace characters, should be saved.

Actually, libxml (pulled from Ruby through Nokogori) does just that:

 require 'nokogiri' Nokogiri::XML(<<XML).first_element_child.text <text> Какой-то текст в две строки </text> XML

 "\nКакой-то текст\nв две строки\n"

Here, Ruby shows lines with escaping, \n is a "newline character", and double quotes are part of the view, not part of the data. The technology you use can display such strings differently.

This paragraph may also be relevant here: w3.org/TR/2004/REC-xml11-20040204/#sec-line-ends Roughly speaking, “to translate all line breaks into ordinary line feed”
It also says for attributes that all translations are replaced with spaces and deleted at the beginning and end.
@Alekcvp and it makes no sense to describe this particular case separately, if the more general rule is sufficient.
@ D-side Yes, but there are nuances, for example # xD # xA is replaced with #xA before reading the file, and this normalization implicitly violates the selected text.
But ultimately, line breaks remain line breaks, just replaced with standard ones.

How to properly parse the XML if the text inside the tags has line breaks?

1 answer 1

More articles: