.NET Converting HTML to Plain Text in C#

Alastair Grant | Thu 22 Dec 2016

I've got a requirement to process emails automatically and insert them into a legacy system. The system being of questionable quality only supports plain-text emails, and since the rise of stupid big company logos being included on emails, people are increasingly sending their email in HTML.

Some sort of conversion is in order. Seems like a simple requirement to start off with - just extract the text. Right?

You cannot use the HTML source, as that is full of tags. And "human readable" is really stretching things, especially after Outlook has had its go at it. A popular method kicking around is to simply strip everything that is enclosed in <> angle brackets. But that leaves in place CSS & Javascript, and probably some other things that you don't realise until the process has been running for a week. Even if you filter out all the gumph, and deal with decoding & and the like - you still have to address formatting. At the basic level, simple paragraphs are defined as a tag in HTML, if you're lucky, there will be a carriage return in the text along with it to match, but there is no reason why this would be.

The next solution would be to parse the HTML (or SGML) and have your system understand it. There are some libraries out there that do this for various languages. There are two popular approaches for .NET. The first more "out-of-the-box" solution is to harness mshtml.dll - Internet Explorer's Trident engine. This will render the document and you should be able to access that output (without actually spawning IE). The first hurdle is you're having to interop with COM - which is made easier with Primary Interop Assemblies but still a little messy to develop and more importantly, distribute.

The second more critical hurdle is that Trident is a browser engine (not one known for security either), and it will download all content linked to in the page, and run any Javascript. I don't really fancy having a server process doing that on anything that can be emailed in.

The remaining popular solution is the Html Agility Pack. This is a handy library that allows you to use HTML documents in the same way as you would XmlDocument. It's more forgiving than the XML parser in .NET and will load up badly formed HTML and deal with non-XHTML syntax. This are other similar libraries for other languages/frameworks, and they all work great for targeting specific bits of data on a web-page etc. But for bulk conversion of text? That's tricky as whilst you can get your "//text()" nodes, you have to reassemble them into something useful.

My solution to the problem is slightly different and has its own draw-backs. And that is to use the HtmlUtilities.ConvertToText() method. Which, "parses the HTML-formatted data, no scripts are run and no secondary downloads occur". Ideal, apart from it's a part of the WinRT API - geared for Windows Store apps, and not desktop (or server) applications.

Fortunately, whilst the WinRT API is for Windows Store, Microsoft does support it with desktop applications, you just have to go through a few hoops to get there. This blog by Andrei Marukovich covers how. In summary:

You have to edit your *.csproj (or whatever proj) file in a text pad and add in the line <TargetPlatformVersion>8.0</TargetPlatformVersion> into the top/main ProjectGroup node. You will then be able to add a reference to the "Windows" assembly in a new tab on the Add References dialogue. You will also need to reference C:/Program Files (x86)/Reference Assemblies/Microsoft/Framework/.NETCore/v4.5/System.Runtime.WindowsRuntime.dll.

You can then simply call Windows.Data.Html.HtmlUtilities.ConvertToText() and receive a text view of your HTML. Formatting such as paragraphs are preserved. It handles tables too, but once they're collapsed adjacent columns are stuck together with no spaces which can be confusing.

Breaking from the voyeuristic norms of the Internet, any comments can be made in private by contacting me.

Related