Cleaning HTML Snippets in .NET with HtmlAgilityPack

Sometimes you may have to sanitize a piece of HTML before storing or displaying it.

Say you have the following snippet and you want to

  • Remove the itemscope and itemtype attributes from the first div
  • Drop completely the meta tag
1
2
3
4
<div itemscope itemtype="http://schema.org/Product">
<meta itemprop="name" content="something">
<p> Some stuff </p>
</div>

You might be tempted to reach for Regex.Replace, but that would be a terrible idea

Instead use HtmlAgilityPack, which provides the ability to load, examine, and modify HTML documents and snippets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
private void Sanitize(string s)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
var div = doc.DocumentNode
.Descendants("div")
.FirstOrDefault(n => n.GetAttributeValue("itemtype", null) == "http://schema.org/Product");
if (div != null)
{
div["itemtype"].Remove();
div["itemscope"].Remove();
}
var meta = doc.DocumentNode
.Descendants("meta")
.FirstOrDefault(n => n.GetAttributeValue("itemprop", null) == "name");
if (meta != null)
meta.Remove();
return doc.DocumentNode.WriteContentTo();
}

This example demonstrates how to remove attributes and nodes from an HTML snippet.