I am trying to scrape metadata from a bunch of websites. For most, using Cheerio to get things like $('meta[property="article:published_time"]').attr('content') works fine. However, for others this meta data property is not explicitly defined but the data is present in some form in the HTML.
For example, if I scrape this page, there is no published_time metadata property, but this text is present in the file ...
{"@context":"http://schema.org","@type":"NewsArticle","mainEntityOfPage":"https://news.yahoo.com/venezuela-deploys-soldiers-face-guyana-175722970.html","headline":"Venezuela Deploys Troops to East Caribbean Coast, Citing Guyana Threat","datePublished":"2023-12-28T19:53:10.000Z","dateModified":"2023-12-28T19:53:10.000Z","keywords":["Nicolas Maduro","Venezuela","Bloomberg","Guyana","Essequibo","Exxon Mobil Corp"],"description":"(Bloomberg) -- Venezuela has decided to deploy more than 5,000 soldiers on its eastern Caribbean coast after neighboring Guyana received a warship from the...","publisher":{"@type":"Organization","name":"Yahoo News","logo":{"@type":"ImageObject","url":"https://s.yimg.com/rz/p/yahoo_news_en-US_h_p_news_2.png","width":310,"height":50},"url":"https://news.yahoo.com/"},"author":{"@type":"Person","name":"Andreina Itriago Acosta","url":"","jobTitle":""},"creator":{"@type":"Person","name":"Andreina Itriago Acosta","url":"","jobTitle":""},"provider":{"@type":"Organization","name":"Bloomberg","url":"https://www.bloomberg.com/","logo":{"@type":"ImageObject","width":339,"height":100,"url":"https://s.yimg.com/cv/apiv2/hlogos/bloomberg_Light.png"}},"image":{"@type":"ImageObject","url":"https://s.yimg.com/ny/api/res/1.2/hs3Vjof2BqloeagLdsvfDw--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD0xMjAy/https://media.zenfs.com/en/bloomberg_politics_602/2db14d66c52bec70cb0ec6d0553968c6","width":1200,"height":1202},"thumbnailUrl":"https://s.yimg.com/ny/api/res/1.2/hs3Vjof2BqloeagLdsvfDw--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD0xMjAy/https://media.zenfs.com/en/bloomberg_politics_602/2db14d66c52bec70cb0ec6d0553968c6"}
There is a "datePublished" field in this object. How do I get this property with Cheerio?
The data you want is in JSON format inside a
<script>tag. To find the data, I would select all<script>tags, then loop over them to find one with the'"datePublished":'substring, extract the text, run it throughJSON.parse()and finally access the.datePublishedproperty:See this post for a general tutorial on this particular technique. It's in Python, but the same concepts apply in Node. Sometimes the JSON within the
<script>is a JS object, or assigned to a variable, which makes the parsing a bit trickier than the straightforward scenario here, typically requiring a bit of regex or JSON5 to parse. See this answer for a more complex example of parsing data from a<script>tag using Cheerio.