Extract a html tag that contains a string in openrefine?

594 views Asked by At

There is not much to add to the title. It's what i'm trying to do. Any suggestions?

I reviewed the docs at github and googled extensively.

The best i got is:

value.parseHtml().select('p[contains('xyz')]')

It results in a syntax error.

2

There are 2 answers

0
Owen Stephens On

The 'select' syntax is based on the select syntax in Beautiful Soup (http://jsoup.org/cookbook/extracting-data/selector-syntax)

In this case I believe the syntax you need is:

value.parseHtml().select("p:contains(xyz)")

Owen

0
Thad Guidry On

Perhaps you missed my writeup (and WARNING) on the wiki :) here ?

https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML#extract-html-attributes-text-links-with-integrated-grel-jsoup-commands

WARNING: Make sure to use .toString() suffixes when needed to output strings into Refine cells while working with the built-in HTML GREL commands (the default output is org.jsoup.nodes objects). Otherwise you'll get a preview just fine in the Expression Editor, BUT no data shown in the Refine cells when you apply it!

BTW, How could we make the docs better and where, so that someone doesn't miss this in the future ?

I even gave folks a nice example in our docs that shows using .toString() : https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions#selectelement-e-string-s