Oracle: Optimize large input regex_substr parsing to use POSITION parameter instead of OCCURRENCE parameter

155 views Asked by At

I need help optimizing my code or providing alternatives that will provide time efficient results.

I use Oracle 11GR2 11.2.0.4.0

I am writing a package that will take a clob that is filled with HTML source and grab the text between the html tags. The code that I have works as expected, but depending on the size of the contents in the clob, it can take forever to execute.

I understand that regexp_substr using the occurrence parameter can be a bottle neck, and I believe if I can migrate it to using the position parameter then it will run faster.

I have four functions in total, where two of the functions are to pipe the rows returned from an array of the functions respective list function.

Each list function (return_rows_list or return_rows_list2) will parse the HTML differently. For instance, return_rows_list will use sub-query factoring and return_rows_list2 will use PL/SQL and variables to store the results from the regexp_substr. I created the return_rows_list2 function to get an idea on which method would be the quickest. Both functions are to slow for any real use scenarios and are within 10 seconds of each other.

I need to somehow replace the regexp_substr that uses the occurrence parameter to instead use the position parameter using regexp_instr.

The package below requires some source HTML code to work properly. I suggest taking any webpage and feeding the source into the a variable. I am using select statements to feed into a clob, because in reality the source will be larger than 32767characters and PL/SQL won't accept string literals larger than that.

Package Source:

CREATE OR REPLACE package pipe_html_contents as
type rows_t is record( rec varchar2(32767));

type t_rows_list is table of rows_t index by binary_integer;
type t_rows_tab is table of rows_t;

function return_rows_list return t_rows_list;

function return_rows return t_rows_tab PIPELINED;

function return_rows2 return t_rows_tab PIPELINED;

function return_rows_list2 return t_rows_list;

end pipe_html_contents;
/

Package Body:

CREATE OR REPLACE package body pipe_html_contents as

function return_rows return t_rows_tab PIPELINED IS
l_returnValue t_rows_list;
begin
l_returnValue:=return_rows_list;
for i in 1 .. l_returnValue.count loop
    pipe row (l_returnValue(i));
  end loop;

  return;
end;

function return_rows2 return t_rows_tab PIPELINED IS
l_returnValue t_rows_list;
begin
l_returnValue:=return_rows_list2;
for i in 1 .. l_returnValue.count loop
    pipe row (l_returnValue(i));
  end loop;

  return;
end;

function return_rows_list return t_rows_list IS
l_returnValue t_rows_list;
p_clob clob:='';
c_temp clob;
    l_count                        pls_integer := 0;       
begin
   for l_rec in ( with src    as (select (select SOURCE from temp)) as html from dual) --Source HTML Code goes here. Replace the select SOURCE from temp with the source HTML code from your website of choice.
                    ,fields as (select html
                                      ,regexp_replace(regexp_replace(html
                                                                    ,'<[^<>]+>'
                                                                    ,'▌')
                                                     ,'▌+'
                                                     ,'▌') data
                                      ,length(regexp_replace(regexp_replace(regexp_replace(html
                                                                                          ,'<[^<>]+>'
                                                                                          ,'▌')
                                                                            ,'▌+'
                                                                            ,'▌') 
                                                            ,'[^▌]+'
                                                            ,''))-2 num_fields
                                from src)
                select trim(regexp_substr(data,'[^▌]+',1,level)) field
                from   fields
                connect by level <= num_fields)
                LOOP

                   if length(l_rec.field) > 2 then
                     l_count:=l_count +1;
                     l_returnValue(l_count).rec:=l_rec.field;
                   end if;

                END LOOP;

    return l_returnValue;
end;


function return_rows_list2 return t_rows_list IS
l_returnValue t_rows_list;
p_clob clob;
c_temp clob;
l_Matches pls_integer;
v_match clob:=null;
    l_count                        pls_integer := 0;       
begin

select SOURCE into p_clob from temp; --Source HTML Code goes here. Replace the select SOURCE into p_clob from temp with the source HTML code from your website of choice.

c_temp:=regexp_replace(regexp_replace(p_clob , '<[^<>]+>' ,'▌'),'^[▌]+' ,'▌');

l_Matches:=length(regexp_replace(regexp_replace(regexp_replace(p_clob, '<[^<>]+>','▌'),'^[▌]+','▌'),'[^▌]+',''))-2;

for i in 1 .. l_Matches
LOOP
    v_match:=trim(regexp_substr(c_temp,'[^▌]+',length(regexp_replace(regexp_replace(regexp_replace(c_temp, '<[^<>]+>','▌'),'^[▌]+','▌'),'[^▌]+',''))+1,i));

    if length(v_match) > 2 then
        l_count:= l_count+1;
        l_returnValue(l_count).rec:=v_match;
    end if;
END LOOP;                                         
    return l_returnValue;
exception 
  when others then
     dbms_output.put_line(sqlerrm);
end;
end pipe_html_contents;
/

Usage Example:

select * from table(pipe_html_contents.return_rows); -- for sub-query factoring

select * from table(pipe_html_contents.return_rows2); -- for Make shift sub-query facoring
0

There are 0 answers