How to use RegEx to filter links from a html document?

317 views Asked by At

How do I grab specific links in a document using regex? I have a html file that contains google drive links mixed in with a bunch of html code and other stuff. I am trying to grab the 50 links from the text by finding them all using RegEx to search for keywords they have in common which is drive, google, & sharing

Example:"https://drive.google.com/file/d/1wXbzf0nvddZ0vlz6-fdN7HV/view?usp=sharing"

I want to select the beginning and the end of the links and then be able to copy them all, paste them into another file or erase the other content and just keep those links inside the html document.

I have tried

http\:\/\/www\.[a-zA-Z0-9\.\/\-]+ & `.*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)`

I tried drive which resulted in finding nothing but http & www comes up with results to other links in the file that i am not trying to hit but atleast shows some results instead of me going for specific keywords that i listed.

Im not sure if this is the proper way to go about this and if I should be using another method such as javascript to achieve this etc etc.

I am using Sublime Text on Mac to try and figure this out. I am new to regular expressions.

2

There are 2 answers

4
marcos On BEST ANSWER

Following should work:

.*drive.google.com.*sharing
  • . means any character

  • * The character before can appear multiple times

0
Automaton On

It sounds like you are trying to do this in some editor in Mac, but the question is tagged with "perl", so here is one way you can do this in Perl.

First, it helps to have a full example input and output to make sure we understand the desired behavior, so here is an example input test.doc:

<p>https://drive.google.com/file/d/0B3GNg0pNzNCWWdFSXNzd00/view?usp=sharing</p><br /><p>https://drive.google.com/sharing/oSmNg0pNzRjWEFyNDRzam8/view?usp=sharing<br /></p></div>
<p>http://drive.google.com/file/d/0B3GNg0pNzNCWWdFSXNzd00/view?usp=sharing</p><br/><p>https://drive.google.com/file/sharing/view?usp=sharing<br /></p></div>
https://drive.abc.com/file/d/efg/view?usp=sharing
https://drive.apple.com/file/d/abc/efg/view?usp=sharing
https://drive.google.com/file/d/xyz/skipme?usp=sharing https://drive.google.com/file/d/ef/view?usp=sharing 

I'll assume links are enclosed in whitespace or *ml tags <> here. Here is a Linux one-liner that will take the input test.doc and spit out matching html links. The [^\s<>]+ part will capture one or more characters that aren't whitespace \s or <> (i.e. negated character class due to [^), to prevent it from running ahead and matching more than one link on the same line:

perl -ne '@m = $_ =~ m{(https?://drive\.google\.com/[^\s<>]+view\?usp=sharing)}g; print "$_\n" for @m;' test.doc

This would give the following output:

https://drive.google.com/file/d/0B3GNg0pNzNCWWdFSXNzd00/view?usp=sharing
https://drive.google.com/sharing/oSmNg0pNzRjWEFyNDRzam8/view?usp=sharing
http://drive.google.com/file/d/0B3GNg0pNzNCWWdFSXNzd00/view?usp=sharing
https://drive.google.com/file/sharing/view?usp=sharing
https://drive.google.com/file/d/ef/view?usp=sharing

If the above doesn't exactly cover what you need, then please give a different input/output text fragment and someone can chime in on how you'd change the one-liner to match it.