Help with the analysis of HTML pages! The program has the ability to set comments in the form of an HTML page. Inside HTML, what the user threw there is - mostly text, tables, lists. For example:

<BODY> <P>Комментарий</P> <P>1.</P> <P> <TABLE cellSpacing=0 cellPadding=0 width=249 border=1> <COLGROUP> <COL span=3 width=83> <TBODY> <TR height=20> <TD class=xl66 height=20 width=83> <FONT face=Calibri>&nbsp;1:1</FONT> </TD> <TD class=xl64 width=83> <FONT face=Calibri>&nbsp;1:2</FONT> </TD> <TD class=xl65 width=83> <FONT face=Calibri>1:3&nbsp;</FONT> </TD> </TR> <TR height=20> <TD class=xl67 height=20> <FONT face=Calibri>&nbsp;2:1</FONT> </TD> <TD class=xl63> <FONT face=Calibri>&nbsp;2:2</FONT> </TD> <TD class=xl63> <FONT face=Calibri>&nbsp;2:3</FONT> </TD> </TR> <TR height=20> <TD class=xl63 height=20> <FONT face=Calibri>&nbsp;</FONT> </TD> <TD class=xl63> <FONT face=Calibri>&nbsp;</FONT> </TD> <TD class=xl63> <FONT face=Calibri>&nbsp;</FONT> </TD> </TR> <TR height=20> <TD class=xl63 height=20> <FONT face=Calibri>&nbsp;</FONT> </TD> <TD class=xl63> <FONT face=Calibri>&nbsp;</FONT></TD> <TD class=xl63> <FONT face=Calibri>&nbsp;</FONT> </TD> </TR> </TBODY> </TABLE>&nbsp; </P> <P>2.</P> <P> <TABLE cellSpacing=0 cellPadding=0 width=192 border=1> <COLGROUP> <COL span=3 width=64> <TBODY> <TR height=20> <TD class=xl66 height=40 rowSpan=2 width=64> <FONT face=Calibri>&nbsp;1:1</FONT> </TD> <TD class=xl64 width=128 colSpan=2> <FONT face=Calibri>1:2</FONT> </TD> </TR> <TR height=20> <TD class=xl63 height=20> <FONT face=Calibri>2:2</FONT> </TD> <TD class=xl63> <FONT face=Calibri>&nbsp;2:3</FONT> </TD> </TR> <TR style="HEIGHT: 15pt" height=20> <TD class=xl63 height=20> <FONT face=Calibri>&nbsp;3:1</FONT> </TD> <TD class=xl63> <FONT face=Calibri>3:2</FONT> </TD> <TD class=xl63> <FONT face=Calibri>3:3</FONT> </TD> </TR> <TR height=20> <TD class=xl63 height=20> <FONT face=Calibri>&nbsp;</FONT> </TD> <TD class=xl63> <FONT face=Calibri>&nbsp;</FONT> </TD> <TD class=xl63> <FONT face=Calibri>&nbsp;</FONT> </TD> </TR> </TBODY> </TABLE> </P> <P>3.</P> <P> <TABLE cellSpacing=0 cellPadding=0 width=256 border=1> <COLGROUP> <COL span=4 width=64> <TBODY> <TR height=20> <TD class=xl63 height=20 width=64 align=right><FONT face=Calibri>1</FONT></TD> <TD class=xl63 width=64 align=right><FONT face=Calibri>2</FONT></TD> <TD class=xl63 width=64 align=right><FONT face=Calibri>3</FONT></TD> <TD class=xl63 width=64 align=right><FONT face=Calibri>4</FONT></TD></TR> <TR height=20> <TD class=xl63 height=20 align=right><FONT face=Calibri>1</FONT></TD> <TD class=xl63 align=right><FONT color=#ff8000 face=Calibri>2</FONT></TD> <TD class=xl63 align=right> <P align=center><FONT color=#ff8000 face=Calibri>3</FONT></P></TD> <TD class=xl63 align=right><FONT color=#ff8000 face=Calibri>4</FONT></TD></TR> <TR height=20> <TD class=xl63 height=20 align=right><FONT color=#ff8000 face=Calibri>1</FONT></TD> <TD class=xl63 align=right><FONT color=#ff8000 face=Calibri>2</FONT></TD> <TD class=xl63 align=right><FONT face=Calibri>3</FONT></TD> <TD class=xl63 align=right><FONT face=Calibri>4</FONT></TD></TR> <TR style="HEIGHT: 15pt" height=20> <TD class=xl63 height=20 align=right><FONT face=Calibri>1</FONT></TD> <TD class=xl63 align=right><FONT face=Calibri>2</FONT></TD> <TD class=xl63 align=right> <P align=center><FONT face=Calibri>3</FONT></P></TD> <TD class=xl63 align=right><FONT face=Calibri>4</FONT></TD></TR> <TR height=20> <TD class=xl63 height=20 align=right><FONT face=Calibri>1</FONT></TD> <TD class=xl63 align=right><FONT face=Calibri>2</FONT></TD> <TD class=xl63 align=right><FONT face=Calibri>3</FONT></TD> <TD class=xl63 align=right><FONT face=Calibri>4</FONT></TD></TR> <TR height=20> <TD class=xl63 height=20 align=right><FONT face=Calibri>1</FONT></TD> <TD class=xl63 align=right><FONT face=Calibri>2</FONT></TD> <TD class=xl63 align=right> <P align=center><FONT face=Calibri>3</FONT></P></TD> <TD class=xl63 align=right><FONT face=Calibri>4</FONT></TD></TR></TBODY></TABLE></P> <P>&nbsp;4.</P> <P class=MsoListParagraphCxSpFirst style="MARGIN: 0cm 0cm 0pt 36pt; TEXT-INDENT: -18pt; mso-list: l0 level1 lfo1"><SPAN lang=EN-US style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-ansi-language: EN-US"><SPAN style="mso-list: Ignore">·<SPAN style='FONT: 7pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </SPAN></SPAN></SPAN><SPAN lang=EN-US style="mso-ansi-language: EN-US"><FONT face=Calibri>Ntcn 1<?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></FONT></SPAN></P> <P class=MsoListParagraphCxSpMiddle style="MARGIN: 0cm 0cm 0pt 36pt; TEXT-INDENT: -18pt; mso-list: l0 level1 lfo1"><SPAN lang=EN-US style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-ansi-language: EN-US"><SPAN style="mso-list: Ignore">·<SPAN style='FONT: 7pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </SPAN></SPAN></SPAN><SPAN lang=EN-US style="mso-ansi-language: EN-US"><FONT face=Calibri>Ntcn 2<o:p></o:p></FONT></SPAN></P> <P class=MsoListParagraphCxSpLast style="MARGIN: 0cm 0cm 8pt 36pt; TEXT-INDENT: -18pt; mso-list: l0 level1 lfo1"><SPAN lang=EN-US style="FONT-FAMILY: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol; mso-ansi-language: EN-US"><SPAN style="mso-list: Ignore">·<SPAN style='FONT: 7pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </SPAN></SPAN></SPAN><SPAN lang=EN-US style="mso-ansi-language: EN-US"><FONT face=Calibri>Ntcn 3<o:p></o:p></FONT></SPAN></P> <P class=MsoNormal style="MARGIN: 0cm 0cm 8pt"><SPAN lang=EN-US style="mso-ansi-language: EN-US"><o:p><FONT face=Calibri>&nbsp;</FONT></o:p></SPAN></P> <P class=MsoListParagraphCxSpFirst style="MARGIN: 0cm 0cm 0pt 36pt; TEXT-INDENT: -18pt; mso-list: l1 level1 lfo2"><SPAN lang=EN-US style="mso-bidi-font-family: Calibri; mso-ansi-language: EN-US; mso-bidi-theme-font: minor-latin"><SPAN style="mso-list: Ignore"><FONT face=Calibri>1.</FONT><SPAN style='FONT: 7pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </SPAN></SPAN></SPAN><SPAN lang=EN-US style="mso-ansi-language: EN-US"><FONT face=Calibri>Ntcn 1<o:p></o:p></FONT></SPAN></P> <P class=MsoListParagraphCxSpMiddle style="MARGIN: 0cm 0cm 0pt 36pt; TEXT-INDENT: -18pt; mso-list: l1 level1 lfo2"><SPAN lang=EN-US style="mso-bidi-font-family: Calibri; mso-ansi-language: EN-US; mso-bidi-theme-font: minor-latin"><SPAN style="mso-list: Ignore"><FONT face=Calibri>2.</FONT><SPAN style='FONT: 7pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </SPAN></SPAN></SPAN><SPAN lang=EN-US style="mso-ansi-language: EN-US"><FONT face=Calibri>Ntcn 2<o:p></o:p></FONT></SPAN></P> <P class=MsoListParagraphCxSpLast style="MARGIN: 0cm 0cm 8pt 36pt; TEXT-INDENT: -18pt; mso-list: l1 level1 lfo2"><SPAN lang=EN-US style="mso-bidi-font-family: Calibri; mso-ansi-language: EN-US; mso-bidi-theme-font: minor-latin"><SPAN style="mso-list: Ignore"><FONT face=Calibri>3.</FONT><SPAN style='FONT: 7pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </SPAN></SPAN></SPAN><SPAN lang=EN-US style="mso-ansi-language: EN-US"><FONT face=Calibri>Ntcn 3</FONT></SPAN></P> <P class=MsoListParagraphCxSpLast style="MARGIN: 0cm 0cm 8pt 36pt; TEXT-INDENT: -18pt; mso-list: l1 level1 lfo2">&nbsp;</P> <P class=MsoListParagraphCxSpLast style="MARGIN: 0cm 0cm 8pt 36pt; TEXT-INDENT: -18pt; mso-list: l1 level1 lfo2">5.</P> <P class=MsoListParagraphCxSpLast style="MARGIN: 0cm 0cm 8pt 36pt; TEXT-INDENT: -18pt; mso-list: l1 level1 lfo2">&nbsp;</P> </BODY> 

Each such HTML comment is arbitrary and does not contain any predefined structure. I need to go through the entire page in order and determine which blocks follow which, i.e. determine the order in which the elements follow, and then, based on this data, form your own presentation for printing. Due to the fact that the comment itself is created by the user, then there are places in the markup not according to the standard - in the example the first table is inside the <p> , in the third table the <p> tag is in the <td> cells. Tables are created in Word / Excel and copied from there to comments and are obtained with such curves, the lists are also made by separate tags <p> . I would like to get something like this:

  1. Text (Comment)
  2. Text (1.)
  3. Table
  4. Text (2.)
  5. Table
  6. Text (3.)
  7. Table
  8. Text (4.)
  9. List (1.)
  10. List (2.)
  11. etc.
  • Ok, what's the problem? - Nick Volynkin
  • The problem is that it is impossible to get what you want, because of HTML not according to the standard, <table> cannot contain <table>, for example. And it turns out that inside the text I have a table, and I would just like a table. - Nova
  • can the user insert HTML markup in comments himself or is the markup added programmatically? If programmatically, then the resulting markup is valid from the XML point of view (let's forget about the HTML standards for a while)? If valid as XML, then you can screw up sly XSLT and get valid HTML, like you didn’t confuse anything, but without answers to the first 2 questions, this is only a guess - rdorn
  • To enter a comment, this code.msdn.microsoft.com/windowsapps ... is used , if you create text, tables and insert into this component in Word / Excel , you get HTML, the user does not have access to the markup. - Nova

1 answer 1

Find the port of this miracle in C # and learn to work with it: HTML Tidy . It seems that there is a lot more in NuGet at the request of Tidy or, for example, you can try the libraries of these answers to EN: SO . I think this is the maximum that you can squeeze out of the wrong markings without bicycles.

  • Tried, and indeed in some moments it works out - the <p> tag around the table deletes, but <td> <p> text </ p> </ td> leaves unchanged. And the main trouble with this library under .Net is that there are no descriptions of the methods and documentation, and the names of the methods do not become clearer. - Nova
  • @Nova, considering that this is a ported library, then in theory you can take the documentation off. Site: tidy.sourceforge.net . By the way, there lies the .NET wrapper. - Alex Krass