OpenXML C # read .docx

Question

Trying to get all the text from a .docx document.

using (var wordDocument = WordprocessingDocument.Open(fileName as string, false)) { //получаем весь текст var text = wordDocument.MainDocumentPart.Document.Body.InnerText; Console.WriteLine(rawText); }

The whole text is actually obtained and written into a variable, but, it turns out to be unformatted and we get something like this at the output:

Although in the word file itself:

I assumed that when receiving the text, it would be at least taking into account the translation of the line, but it turned out to be not so simple.

What are the options to save the line feed?

RusArt RusArt 2,390 5 silver marks 22 bronze marks · Accepted Answer · 2017-04-21T14:05:15

Try this way:

 public string GetPlainText(OpenXmlElement element) { StringBuilder text = new StringBuilder(); foreach (OpenXmlElement section in element.Elements()) { switch (section.LocalName) { // Text case "t": PlainTextInWord.Append(section.InnerText); break; case "cr": // Carriage return case "br": // Page break PlainTextInWord.Append(Environment.NewLine); break; // Tab case "tab": PlainTextInWord.Append("\t"); break; // Paragraph case "p": PlainTextInWord.Append(GetPlainText(section)); PlainTextInWord.AppendLine(Environment.NewLine); break; default: PlainTextInWord.Append(GetPlainText(section)); break; } } return text.ToString(); } var text = GetPlainText(wordDocument.MainDocumentPart.Document.Body); Console.WriteLine(text);

A source

OpenXML C # read .docx

1 answer 1

More articles: