HTML Agility Pack Ottieni contenuti di <p itemprop>

c# html-agility-pack parsing xpath

Domanda

Sto cercando di ottenere il contenuto dell'utilizzo del pacchetto agilità HTML. Ecco un esempio dell'HTML che sto cercando di analizzare:

         <p itemprop="articleBody">
    Hundreds of thousands of Ukrainians filled the streets of Kiev on Sunday, first to hear speeches and music and then to fan out and erect barricades in the district where government institutions have their headquarters.</p><p itemprop="articleBody">
    Carrying blue-and-yellow Ukrainian and European Union flags, the teeming crowd filled 
Independence Square, where protests have steadily gained momentum since Mr. Yanukovich refused on Nov. 21 to sign trade and political agreements with the European Union. The square has been transformed by a vast and growing tent encampment, and demonstrators have occupied City Hall and other public buildings nearby. Thousands more people gathered in other cities across the country.        </p><p itemprop="articleBody">
    “Resignation! Resignation!” people in the Kiev crowd chanted on Sunday, demanding that Mr. Yanukovich and the government led by Prime Minister Mykola Azarov leave office.        </p>

Sto cercando di analizzare l'HTML sopra usando il seguente codice:

         <p itemprop="articleBody">
    Hundreds of thousands of Ukrainians filled the streets of Kiev on Sunday, first to hear speeches and music and then to fan out and erect barricades in the district where government institutions have their headquarters.</p><p itemprop="articleBody">
    Carrying blue-and-yellow Ukrainian and European Union flags, the teeming crowd filled 
Independence Square, where protests have steadily gained momentum since Mr. Yanukovich refused on Nov. 21 to sign trade and political agreements with the European Union. The square has been transformed by a vast and growing tent encampment, and demonstrators have occupied City Hall and other public buildings nearby. Thousands more people gathered in other cities across the country.        </p><p itemprop="articleBody">
    “Resignation! Resignation!” people in the Kiev crowd chanted on Sunday, demanding that Mr. Yanukovich and the government led by Prime Minister Mykola Azarov leave office.        </p>

MODIFICARE:

Ma sembra che articleBodyScope sia vuoto, perché:

         <p itemprop="articleBody">
    Hundreds of thousands of Ukrainians filled the streets of Kiev on Sunday, first to hear speeches and music and then to fan out and erect barricades in the district where government institutions have their headquarters.</p><p itemprop="articleBody">
    Carrying blue-and-yellow Ukrainian and European Union flags, the teeming crowd filled 
Independence Square, where protests have steadily gained momentum since Mr. Yanukovich refused on Nov. 21 to sign trade and political agreements with the European Union. The square has been transformed by a vast and growing tent encampment, and demonstrators have occupied City Hall and other public buildings nearby. Thousands more people gathered in other cities across the country.        </p><p itemprop="articleBody">
    “Resignation! Resignation!” people in the Kiev crowd chanted on Sunday, demanding that Mr. Yanukovich and the government led by Prime Minister Mykola Azarov leave office.        </p>

Non stampa "CONTENT NOT NULL" e articleBodyText rimane vuoto. Se qualcuno potesse indicarmi la soluzione sarei grato, grazie in anticipo!

Risposta popolare

Sembra che il New York Times in realtà rilevi che non accetti i cookie da loro. Come tali, ti presentano un avvertimento sui cookie e una casella di accesso. Fornendo effettivamente un CookieContainer , .NET è in grado di gestire l'intera attività sui cookie e NYT presenterà effettivamente il suo contenuto:

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace UnitTestProject3
{
    using System.Net;
    using System.Runtime;

    using HtmlAgilityPack;

    [TestClass]
    public class UnitTest1
    {
        [TestMethod]
        public void WhenProvidingCookiesYouSeeContent()
        {
            HtmlDocument doc = new HtmlDocument();

            WebClient wc = new WebClientEx(new CookieContainer());

            string contents = wc.DownloadString(
                "http://www.nytimes.com/2013/12/10/world/asia/thailand-protests.html?partner=rss&emc=rss&_r=1&");
            doc.LoadHtml(contents);

            var nodes = doc.DocumentNode.SelectNodes(@"//p[@itemprop='articleBody']");

            Assert.IsNotNull(nodes);
            Assert.IsTrue(nodes.Count > 0);
        }
    }

    public class WebClientEx : WebClient
    {
        public WebClientEx(CookieContainer container)
        {
            this.container = container;
        }

        private readonly CookieContainer container = new CookieContainer();

        protected override WebRequest GetWebRequest(Uri address)
        {
            WebRequest r = base.GetWebRequest(address);
            var request = r as HttpWebRequest;
            if (request != null)
            {
                request.CookieContainer = container;
            }
            return r;
        }

        protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
        {
            WebResponse response = base.GetWebResponse(request, result);
            ReadCookies(response);
            return response;
        }

        protected override WebResponse GetWebResponse(WebRequest request)
        {
            WebResponse response = base.GetWebResponse(request);
            ReadCookies(response);
            return response;
        }

        private void ReadCookies(WebResponse r)
        {
            var response = r as HttpWebResponse;
            if (response != null)
            {
                CookieCollection cookies = response.Cookies;
                container.Add(cookies);
            }
        }
    }
}

Grazie a questa risposta per la classe estesa WebClient .

Nota

Potrebbe essere contrario ai termini di utilizzo del NYT per cancellare in modo sfacciato le nuove storie dal loro sito web.




Autorizzato sotto: CC-BY-SA with attribution
Non affiliato con Stack Overflow
È legale questo KB? Sì, impara il perché
Autorizzato sotto: CC-BY-SA with attribution
Non affiliato con Stack Overflow
È legale questo KB? Sì, impara il perché