Extracting images urls from html in c# using html agility pack and writing them in a xml file

c# html-agility-pack xml

Question

I am new to c# and I really need help with the following problem. I wish to extract the photos urls from a webpage that have a specific pattern. For example I wish to extract all the images that have the following pattern name_412s.jpg. I use the following code to extract images from html, but I do not kow how to adapt it.

public void Images()
    {
        WebClient x = new WebClient();
        string source = x.DownloadString(@"http://www.google.com");

        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
        document.Load(source);

        foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img")
        {
          images[] = link["src"];
       }
}

I also need to write the results in a xml file. Can you also help me with that?

Thank you !

Accepted Answer

To limit the query results, you need to add a condition to your XPath. For instance, //img[contains(@src, 'name_412s.jpg')] will limit the results to only img elements that have an src attribute that contains that file name.

As far as writing out the results to XML, you'll need to create a new XML document and then copy the matching elements into it. Since you won't be able to directly import an HtmlAgilityPack node into an XmlDocument, you'll have to manually copy all the attributes. For instance:

using System.Net;
using System.Xml;

// ...

public void Images()
{
    WebClient x = new WebClient();
    string source = x.DownloadString(@"http://www.google.com");
    HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
    document.Load(source);
    XmlDocument output = new XmlDocument();
    XmlElement imgElements = output.CreateElement("ImgElements");
    output.AppendChild(imgElements);
    foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img[contains(@src, '_412s.jpg')]")
    {
        XmlElement img = output.CreateElement(link.Name);
        foreach(HtmlAttribute a in link.Attributes)
        {
            img.SetAttribute(a.Name, a.Value)
        }
        imgElements.AppendChild(img);
    }
    output.Save(@"C:\test.xml");
}



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why