R How to Check if XPath Exists

hoping someone more knowledgeable than me can throw some light here.

As part of a larger web-scraper I want to pull meta data out of a set of pages. When I ran this it fell over, investigation shows this was due to one of the Xpath's being requested not existing.

I can see one potential solution is to grab ALL the meta for a page into a vector and to check if each required one exists before building a new vector of just those I want.

BUT

It would be even better if I only grabbed the bits I want if they exist in the page.

require(XML)
require(RCurl)
parsed <- htmlParse("http://www.coindesk.com/information")

meta <- list()
meta[1] <- xpathSApply(parsed, "//meta[starts-with(@property, ""og:title"")]", xmlGetAttr,"content")
meta[2] <- xpathApply(parsed, "//meta[starts-with(@property, ""og:description"")]", xmlGetAttr,"content")
meta[3] <- xpathApply(parsed, "//meta[starts-with(@property, ""og:url"")]",  xmlGetAttr,"content")
meta[4] <- xpathApply(parsed, "//meta[starts-with(@property, ""article:published_time"")]",  xmlGetAttr,"content")
meta[5] <- xpathApply(parsed, "//meta[starts-with(@property, ""article:modified_time"")]",  xmlGetAttr,"content")

This will throw an error as og:description isn't in this page.

Error in meta[2] <- xpathApply(parsed, "//meta[starts-with(@property, ""og:description"")]",  : 
  replacement has length zero

Can anyone suggest a simple test that will check for its existence before trying to extract it, falling over gracefully with perhaps a NULL response?

Asked By: BarneyC
||

Answer #1:

Assuming the error comes when you try and process the empty list...

> parsed <- htmlParse("http://www.coindesk.com/information")
> meta <- xpathApply(parsed, "//meta[starts-with(@property, ""og:description"")]", xmlGetAttr,"content")
> meta
list()
> length(meta)==0
[1] TRUE

Then test for length(meta)==0 - which is TRUE if the element is missing. Otherwise its FALSE - as in this example of extracting the title property:

> meta <- xpathApply(parsed, "//meta[starts-with(@property, ""og:title"")]", xmlGetAttr,"content")
> meta
[[1]]
[1] "Beginner's guide to bitcoin - CoinDesk's Information Center"

> length(meta)==0
[1] FALSE
Answered By: BarneyC

Answer #2:

The answer to this has been hard to nail down. Whilst there are a couple of custom implementations of xpathApply knocking around that handle NULL results the solution to the question posed did lay in Spacedman's suggestion.

The first part of the IF statement calls the xPath and checks to see if the return length = 0. If it does then it applies a custom message to the list, "Title NA" or "Description NA" but if the length isn't 0 (i.e. there is a match) then it applies the xPath to the list.

Simples.

 require(XML)
    require(RCurl)
    parsed <- htmlParse("http://www.coindesk.com/information")

    meta    <- list()
    meta[1] <- if(length(xpathSApply(parsed, "//meta[starts-with(@property, ""og:title"")]", xmlGetAttr,"content"))==0) 
               {
                 "Title NA"
               } 
               else 
               {
                 xpathSApply(parsed, "//meta[starts-with(@property, ""og:title"")]", xmlGetAttr,"content")
               }
    meta[2] <- if(length(xpathApply(parsed, "//meta[starts-with(@property, ""og:description"")]", xmlGetAttr,"content"))==0) 
               {  
                 "Description NA" 
               } 
               else 
               {
                  xpathApply(parsed, "//meta[starts-with(@property, ""og:description"")]", xmlGetAttr,"content")
               } 
Answered By: Spacedman
The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .



# More Articles