How to read JavaScript object with XPath/HTMLAgilityPack

c# html-agility-pack javascript xpath

Question

I need to extract product information from a JavaScript object for my crawler project.

How can I use the following JavaScript to efficiently extract object details? I employ HTMLAgilityPack and XPath.

<script type="text/javascript">
    var product = {
        identifier: '2051189775',     //PRODUCT ID
        fn: 'Fit- Whiskered Dark Wash Skirt',
        category: ['sale'],
        brand: 'Brand Name',
        price: '22.90',  // this would be the discount price
        amount: '31.80',  // this would be the original price
        currency: 'USD',
        //List can me even more.
    };
</script>

I've never attempted extracting information from JavaScript objects. For other crawlers, I was collecting information straight from HTML.

1
1
7/19/2013 9:56:02 AM

Accepted Answer

The javascript code should be treated as plain text as the HTML Agility Pack doesn't evaluate any of the HTML's contents. Apply theSelectSingleNode Grab the InnerHtml to access the contents after using a method to locate the Javascript component.

Either locate a C# javascript parser (such as Metal JS), or create a parser using common text-editing methods (String.* or Regex to extract the desired pieces.

The portions within the curly brackets seem to be legitimate JSON, so once you get them, you can parse them using the previously described parser or a library like Json.NET.

1
7/19/2013 2:52:11 PM


Related Questions





Related

Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow