HtmlAgilityPack -- Does form close itself for some reason?
This is also reported in this workitem. It contains a suggested workaround from DarthObiwan.
You can change this without recompiling. The ElementFlags list is a
static property on the HtmlNode class. It can be removed withHtmlNode.ElementsFlags.Remove("form");
before doing the document load
HTML Agility Pack stripping self-closing tags from input
In the end it pains me to say that I fell back on processing the HTML with regex to add in the mising self-closing tag. I'd love a better solution as this is hacky and not future proof - it has to be added in for every tag that needs correcting:
sXHTML = Regex.Replace(sXHTML, "<input(.*?)>", "<input $1 />");
HtmlAgilityPack produces missing closing tags in OuterHtml
There are several options that you can set when you are loading the document.
OptionAutoCloseOnEnd
Defines if closing for non closed nodes must be done at the end or directly in the document. Setting this to true can actually change how browsers render the page.
document = new HtmlDocument();
document.OptionAutoCloseOnEnd = true;
document.LoadHtml(content);
Related sources worth reading:
HtmlAgilityPack Drops Option End Tags
Image tag not closing with HTMLAgilityPack
Selecting Inner Text Using HtmlAgilityPack
HTMLAgilityPack by default leaves options tags empty (you can see the author's reason for this at HtmlAgilityPack -- Does <form> close itself for some reason?). To fix it, add this line before selecting the nodes:
HtmlNode.ElementsFlags.Remove("option");
Html Agility Pack xPath issue
This is because the FORM tag has a special treatment by the HTML Agility Pack. The reasons are described here: HtmlAgilityPack -- Does <form> close itself for some reason?
So, you basically need to remove that special treatment, like this (must happen before any load):
// instruct the library to treat FORM like any other tag
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument l_missionsDoc = new HtmlDocument();
l_missionsDoc.Load(l_stream);
XPathNavigator l_navigator = l_missionsDoc.CreateNavigator();
XPathNodeIterator l_iterator = l_navigator.Select("//form[@id='formliste']/table");
if (l_iterator.Count <= 0) continue;
Get entire form element as string using Html Agility Pack
Seems you're looking for HtmlNode.OuterHtml
:
//
// Summary:
// Gets or Sets the object and its content in HTML.
public virtual string OuterHtml { get; }
So you just have to select your form node and get its OuterHtml property:
HtmlDocument doc = ... // load your HTML
HtmlNode formNode = doc.DocumentNode.SelectSingleNode("//form[@id='aspnetForm']");
string entireElementAsString = formNode.OuterHtml;
UPDATE
It seems there's a very old bug with how HAP treats form
tags. Or maybe it's a feature!
In any case, here's a workaround:
HtmlNode.ElementsFlags.Remove("form");
So this should work:
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = ... // load your HTML
HtmlNode formNode = doc.DocumentNode.SelectSingleNode("//form[@id='aspnetForm']");
string entireElementAsString = formNode.OuterHtml;
Add form tag around a body tag using HtmlAgilityPack
The FORM element has a special treatment. See here on SO for more: HtmlAgilityPack -- Does <form> close itself for some reason?
So, you could do this:
var doc = new HtmlDocument();
HtmlNode.ElementsFlags.Remove("form"); // remove special handling for FORM
doc.LoadHtml(input);
var body = doc.DocumentNode.SelectSingleNode("//body");
if (doc.DocumentNode.SelectNodes("//form[@action]") == null)
{
var form = doc.CreateElement("form");
form.Attributes.Add("action", "/pages/event/10302");
body.PrependChild(form);
}
but it will get you this:
<html>
<head>
<title></title>
</head>
<body>
<form action="/pages/event/10302"></form>
<p>Full name: <input name="FullName" type="text" value=""></p>
<p><input name="btnSubmit" type="submit" value="Submit"></p>
</body>
</html>
Which is logical, you don't surround anything in that new form. So, instead you can do this:
var doc = new HtmlDocument();
doc.LoadHtml(input);
var body = doc.DocumentNode.SelectSingleNode("//body");
if (doc.DocumentNode.SelectNodes("//form[@action]") == null)
{
var form = body.CloneNode("form", true);
form.Attributes.Add("action", "/pages/event/10302");
body.ChildNodes.Clear();
body.PrependChild(form);
}
which will get you this:
<html>
<head>
<title></title>
</head>
<body><form action="/pages/event/10302">
<p>Full name: <input name="FullName" type="text" value=""></p>
<p><input name="btnSubmit" type="submit" value="Submit"></p>
</form></body>
</html>
This is not the only way, but it works, and you don't necessarily have to remove the FORM special treatment.
Problem parsing children of a node with HtmlAgilityPack
Well, I've given up on HtmlAgilityPack for now. Seems like there is still more work to do in that library to get everything working. To solve this problem I've moved the code over to use the SGMLReader library from here: http://developer.mindtouch.com/SgmlReader
Using this library all my unit tests pass properly and the sample code works as expected.
Related Topics
Put Wpf Control into a Windows Forms Form
How to Embed an Application Manifest into an Application Using VS2008
C# Picturebox Transparent Background Doesn't Seem to Work
Why Enums Require an Explicit Cast to Int Type
Serialize Property, But Do Not Deserialize Property in JSON.Net
Importing Nested Namespaces Automatically in C#
How to Select Xml Nodes with Xml Namespaces from an Xmldocument
Entity Framework (.Net Full Framework) Ordering Includes
Getting a System.Type from Type's Partial Name
Force JSON.Net to Include Milliseconds When Serializing Datetime (Even If Ms Component Is Zero)
ASP.NET Urlencode Ampersand for Use in Query String
Passing Command-Line Arguments in C#
How to Pinvoke to Getwindowlongptr and Setwindowlongptr on 32-Bit Platforms