I have a problem getting hyperlinks from IHTMLDocument2 in Delphi. For instance, instead of returning the full link "http://ena.ge/explanatory-online", IHTMLDocument2 returns "about:/explanatory-online". The simple substitution of "about" with root URL is not working for all cases.
Here is the code I am using:
procedure process_url(MyURL: string; var MyHTML, MyHyperlinks: TStrings; var MyInnerText,MyInnerHTML:widestring);
var
resp: TMemoryStream;
IdHTTP: TidHTTP;
v: Variant;
iDoc: IHTMLDocument2;
links: OleVariant;
MyHyperlink, aHref: string;
i: integer;
begin
resp := TMemoryStream.Create;
IdHTTP := TidHTTP.Create(nil);
iDoc := coHTMLDocument.Create as IHTMLDocument2;
try
IdHTTP.Get(MyURL, resp);
resp.Position := 0;
MyHTML.LoadFromStream(resp,TEncoding.UTF8);
finally
resp.Free;
IdHTTP.Free;
end;
v := VarArrayCreate([0, 0], VarVariant);
v[0] := MyHTML.text;
iDoc.write(PSafeArray(System.TVarData(v).VArray));
iDoc.designMode := 'off';
while iDoc.readyState <> 'complete' do
Application.ProcessMessages;
showmessage(idoc.url);
MyInnerText:=idoc.body.innerText;
MyInnerHTML:=idoc.body.innerHTML;
links := iDoc.all.tags('A');
if links.Length > 0 then
begin
for i := 0 to -1 + links.Length do
begin
aHref := links.Item(i).href;
MyHyperlinks.Add(aHref);
end;
end;
end;
Look at the source of the page and you will see what the links look like, for example: href="/explanatory-online" If you download the IdHttp page, IHTMLDocument2 does not have the original page address. You can use TWebBrowser or manually replace string or use IHTMLDocument4.
Example 1 (TWebBrowser):
Example 2 (replace string):
Example 3 (IHTMLDocument4):