Cleaning HTML Snippets in .NET With HtmlAgilityPack

Sometimes you may have to sanitize a piece of HTML before storing or displaying it.

Say you have the following snippet and you want to

  • Remove the itemscope and itemtype attributes from the first div
  • Drop completely the meta tag
1
2
3
4
<div itemscope itemtype="http://schema.org/Product">
  <meta itemprop="name" content="something">
  <p> Some stuff </p>
</div>

You might be tempted to reach for Regex.Replace, but that would be a terrible idea

Instead use HtmlAgilityPack, which provides the ability to load, examine, and modify HTML documents and snippets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
private void Sanitize(string s)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(s);

    var div = doc.DocumentNode
        .Descendants("div")
        .FirstOrDefault(n => n.GetAttributeValue("itemtype", null) == "http://schema.org/Product");

    if (div != null)
    {
        div["itemtype"].Remove();
        div["itemscope"].Remove();
    }

    var meta = doc.DocumentNode
        .Descendants("meta")
  .FirstOrDefault(n => n.GetAttributeValue("itemprop", null) == "name");

    if (meta != null)
        meta.Remove();

    return doc.DocumentNode.WriteContentTo();
}

This example demonstrates how to remove attributes and nodes from an HTML snippet.

Comments