In a DTD, you can determine that an element can contain both #PCDATA and other elements. This type of content is called mixed. To specify a mixed content type, it suffices to list #PCDATA along with valid child elements.

 <?xml version = "1.0" standalone="yes"?> <!DOCTYPE DOCUMENT [ <!ELEMENT DOCUMENT (CUSTOMER)*> <!ELEMENT CUSTOMER (NAME,DATE,ORDERS)> <!ELEMENT NAME (LAST_NAME,FIRST_NAME)> <!ELEMENT LAST_NAME (#PCDATA)> <!ELEMENT FIRST_NAME (#PCDATA)> <!ELEMENT DATE (#PCDATA)> <!ELEMENT ORDERS (ITEM)*> <!ELEMENT ITEM (PRODUCT, NUMBER, PRICE)> <!--mixed--> <!ELEMENT PRODUCT (#PCDATA | PRODUCT_ID )*> <!ELEMENT NUMBER (#PCDATA)> <!ELEMENT PRICE (#PCDATA)> <!ELEMENT PRODUCT_ID (#PCDATA)> ]> <DOCUMENT> <CUSTOMER> <NAME> <LAST_NAME>Smith</LAST_NAME> <FIRST_NAME>Sam</FIRST_NAME> </NAME> <DATE>October 15, 2003</DATE> <ORDERS> <ITEM> <PRODUCT>Tomatoes</PRODUCT> <NUMBER>8</NUMBER> <PRICE>$1.25</PRICE> </ITEM> <ITEM> <PRODUCT> <PRODUCT_ID> 124829548702121 </PRODUCT_ID> </PRODUCT> <NUMBER>24</NUMBER> <PRICE>$4.98</PRICE> </ITEM> </ORDERS> </CUSTOMER> </DOCUMENT> 

I noticed when checking the correctness of the file using the so-called. validators (.NET XML Parser, MSXML SAX, MSXML DOM, Java build-in), that if #PCDATA is at the top of the list - the check passes. If there is any element in front of #PCDATA , validation errors appear (each parser has its own, but the essence is the same).

 <?xml version = "1.0" standalone="yes"?> <!DOCTYPE DOCUMENT [ <!ELEMENT DOCUMENT (CUSTOMER)*> <!ELEMENT CUSTOMER (NAME,DATE,ORDERS)> <!ELEMENT NAME (LAST_NAME,FIRST_NAME)> <!ELEMENT LAST_NAME (#PCDATA)> <!ELEMENT FIRST_NAME (#PCDATA)> <!ELEMENT DATE (#PCDATA)> <!ELEMENT ORDERS (ITEM)*> <!ELEMENT ITEM (PRODUCT, NUMBER, PRICE)> <!-- mixed --> <!-- ошибка. Почему? --> <!ELEMENT PRODUCT (NUMBER | #PCDATA | PRODUCT_ID )*> <!ELEMENT NUMBER (#PCDATA)> <!ELEMENT PRICE (#PCDATA)> <!ELEMENT PRODUCT_ID (#PCDATA)> ]> <DOCUMENT> <CUSTOMER> <NAME> <LAST_NAME>Smith</LAST_NAME> <FIRST_NAME>Sam</FIRST_NAME> </NAME> <DATE>October 15, 2003</DATE> <ORDERS> <ITEM> <PRODUCT>Tomatoes</PRODUCT> <NUMBER>8</NUMBER> <PRICE>$1.25</PRICE> </ITEM> <ITEM> <PRODUCT> <PRODUCT_ID> 124829548702121 </PRODUCT_ID> </PRODUCT> <NUMBER>24</NUMBER> <PRICE>$4.98</PRICE> </ITEM> </ORDERS> </CUSTOMER> </DOCUMENT> 

Why should #PCDATA be in the first place in the mixed element?

  • The answer "because it is written in the specification " will suit you? - Roman
  • @Roman Thank you very much for the link! Unfortunately, there it is not written that the element #PCDATA should always be in the first place. Therefore, in fact, the question arose. Just in the specification, the following is given: sh Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*' | '(' S? '#PCDATA' S? ')' sh Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*' | '(' S? '#PCDATA' S? ')' is, it is written in the specification that the data element S can be in the first place (or not - ? ). There are, of course, examples: sh <!ELEMENT p (#PCDATA|a|ul|b|i|em)*> <!ELEMENT b (#PCDATA)> But they do not describe all possible options. - java1cprog
  • Maybe it is the imperfection of parsers? There is much to develop. Not yet implemented. - java1cprog
  • I figured it out. In the S? specification S? - these are whitespace characters. Now everything is clear! - java1cprog

1 answer 1

w3C specification §3.2.2. defines mixed content

 [51] Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*' 

where S is a space character (the so-called white-space).