I am trying to get the links in a web page using core java. I am following the below code given in Extract links from a web page with some modifications.
try {
url = new URL("http://www.stackoverflow.com");
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
if(line.contains("href="))
System.out.println(line.trim());
}
}
With respect extracting each link, most of the answers in the above post suggests using pattern matching. However as per my understanding Pattern matching is expensive operation. So I want to use indexOf and substring operations to get the link text from each line as below
private static Set<String> getUrls(String line, int firstIndexOfHref) {
int startIndex = firstIndexOfHref;
int endIndex;
Set<String> urls = new HashSet<>();
while(startIndex != -1) {
try {
endIndex = line.indexOf("\"", startIndex + 6);
String url = line.substring(startIndex + 6, endIndex);
urls.add(url);
startIndex = line.indexOf("href=\"http", endIndex);
} catch (Exception e) {
e.printStackTrace();
}
}
return urls;
}
I have tried this on few pages and it's working properly. However I am not sure if this approach always works. I want to know if this logic can fail in some real time scenarios.
Please help.
Your code is relying a good format of html in one line, it will not handle various other ways to reference
<a href
such as with single quotes, no quotes, extra whitespace including new lines between "a" and "href" and "=", relative paths, other protocols such as file: or ftp:.Some examples you would need to consider:
or
or
That's why the other question has many answers including HTML validator, and regex patterns.