Substr instr help needed in sql oracle

56 views Asked by At

i am trying to fetch some text from HTML string from a table. the text are like this and i am trying to get the output

ID,TEXT
1,<font face="Microsoft Sans Serif" size="8" color="#000000">ABC 123</font>
2,<font face="Microsoft Sans Serif" size="8" color="#000000">1 ETG</font>
3,<font face="Microsoft Sans Serif" size="8" color="#000000">Gatelys PERC</font>
4,<font face="Microsoft Sans Serif" size="8" color="#000000">Gml TK 144</font>

ID,TEXT
1,ABC 123
2,1 ETG
3,Gatelys PERC
4,Gml TK 144

I have tried this but not getting right output.

select ID,regexp_substr(TEXT, '[^>]+',1,2) from TABLE
where TEXT is not null;
2

There are 2 answers

1
MT0 On BEST ANSWER

If your HTML is always valid XHTML then you should use XML functions to parse the string:

SELECT id,
       XMLQUERY('*/text()' PASSING XMLTYPE(text) RETURNING CONTENT) AS text
FROM   table_name;

Which, for the sample data:

CREATE TABLE table_name (id, text) AS
  Select 1, '<font face="Microsoft Sans Serif" size="8" color="#000000">ABC 123</font>' FROM DUAL UNION ALL
  Select 2, '<font face="Microsoft Sans Serif" size="8" color="#000000">1 ETG</font>' FROM DUAL UNION ALL
  Select 3, '<font face="Microsoft Sans Serif" size="8" color="#000000">Gatelys PERC</font>' FROM DUAL UNION ALL
  Select 4, '<font face="Microsoft Sans Serif" size="8" color="#000000">Gml TK 144</font>' FROM DUAL UNION ALL
  Select 5, '<font attr=">">XYZ</font>' FROM DUAL;

Outputs:

ID TEXT
1 ABC 123
2 1 ETG
3 Gatelys PERC
4 Gml TK 144
5 XYZ

If you want to naively remove all tags using string functions then you can use:

SELECT id,
       REGEXP_REPLACE(text, '<.*?>') AS text
FROM   table_name;

Which, for your sample data, outputs:

ID TEXT
1 ABC 123
2 1 ETG
3 Gatelys PERC
4 Gml TK 144
5 ">XYZ

Note, it will fail if you have attributes inside a tag that contain a > character.

If you want a more comprehensive query then:

SELECT id,
       REGEXP_REPLACE(text, q'{<([^"']+?|".*?"|'.*?')+?>}') AS text
FROM   table_name;

Which outputs the same as the XML query.

fiddle

1
SelVazi On

This is a way to do it using regexp_substr :

select ID,regexp_substr(TEXT, '>([^<>]+)<', 1, 1, null, 1) 
from mytable
where TEXT is not null;