In C#, how can I read an HTML document if the website source is kept in a string variable?

c# html html-agility-pack html-content-extraction

Question

I have tried to do this on my own but couldn't.

I have an html document, and I'm trying to extract the addresses for all the pictures in it into a c# collection and I'm not sure of the syntax. I'm using HTMLAgilityPack... Here is what I have so far. Please advise.

The HTML Code is the following:

<div style='padding-left:12px;' id='myWeb123'>
<b>MyWebSite Pics</b>
<br /><br />
<img src="http://myWebSite.com/pics/HHTR_01.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_02.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_03.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_04.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_05.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_06.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_07.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_08.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_09.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_10.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<a href="http://www.myWebSite.com/" target="_blank" rel="nofollow">Source</a>
</div>

And the c# code is the following:

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();

document.Load("FileName.html");

// Targets a specific node
HtmlNode someNode = document.GetElementbyId("myWeb123");

//HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//a[@href]");

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

if (linkNodes != null)
{
    int count = 0;
    foreach(HtmlNode linkNode in linkNodes)
    {

        string linkTitle = linkNode.GetAttributeValue("src", string.Empty);

        Debug.Print("linkTitle = " + linkTitle);

        if (linkTitle == string.Empty)
        {
            HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
            if (imageNode != null)
            {
                Debug.Print("imageNode = " + imageNode.Attributes.ToString());
            }
        }
        count++;
        Debug.Print("count = " + count);
    }
}

I tried to use the HtmlAgilityPack Documentation but this pack lacks examples and the information about its methods and classes are really hard for me to understand without examples.

1
2
11/25/2011 9:53:07 AM

Accepted Answer

try this, sorry if it will not be buildable, I have overwritten our code to your situation

List<string> result = new List<string>();
foreach (HtmlNode link in document.DocumentNode.SelectNodes("//img[@src]"))
{
    HtmlAttribute att = link.Attributes["src"];

    string temp = att.Value;
    string urlValue;
    do
    {
        urlValue = temp;
        temp = HttpUtility.UrlDecode(HttpUtility.HtmlDecode(urlValue));
    } while (temp != urlValue);

    result.Add(temp);
}
3
11/25/2011 9:53:29 AM

Popular Answer

You can use the overload of Load which takes a TextReader:

document.Load(new StringReader(text));

(I haven't looked over the rest of the code, but that addresses the "what do I do if I've already got the HTML in a string?" part.)



Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow