I have a text file that I want to extract links from.

The problem is that the text file is only one line with a lot of links!

Or that when I open it in Notepad it shows it in a lot files but not organized.

Sample text:

[{"participants": ["minanageh379", "xcsadc"], "conversation": [{"sender": "minanageh379", "created_at": "2019-04-12T12:51:56.560361+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:51.923138+00:00", "text": "sd"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:41.689524+00:00", "text": "sdsa"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:50:57.283147+00:00", "text": "‍❤️‍‍"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:34.352752+00:00", "text": "dsad"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:30.889023+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "hi hi hi"}]}]

expected result

https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2

https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2

the updated one

{"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"},

5 Answers

0
Julio On Best Solutions

Try with this:

First of all, we are going to remove all characters that do not form part of a valid url plus quotes and spaces. That will remove emojis that seem to cause trouble with boost regexes on notepad++ on some circustances.

Our first replacement will be:

Search: [^a-zA-Z0-9_\-.~:\/?#\[\]@!$&'()*+,;=%"\s]

Replace by: (leave empty)

Replace all

(That previous step may not be needed on future versiones on notepad++)

After the clean up, we do the following replacement:

Search: (?i)(?:(?:(?!https?:).(?!https?:))*?"sender"\s*+:\s*+"([^"]*)"|\G)(?:.(?!"sender"\s*+:\s*+))*?(https?:.*?(?=[^a-zA-Z0-9_\-.~:\/?#\[\]@!$&'()*+,;=%]|https?:))|.*

Replacement: (?{1}\n\n\1\t\2:(?{2}\t\2)

Replace all

This should work even with "text" attributes that have several urls inside. The urls will be separated by tabulators.

So after applying the previous procedure to this data:

[{"participants": ["minanageh379", "xcsadc"], "conversation": [{"sender": "minanageh379", "created_at": "2019-04-12T12:51:56.560361+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2   http://foo.barhttps://bar.foo"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:51.923138+00:00", "text": "sd"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:41.689524+00:00", "text": "sdsa"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:50:57.283147+00:00", "text": "‍❤️‍‍"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:34.352752+00:00", "text": "dsad"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:30.889023+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "hi hi hi"}, {"sender": "no_media_no_text", "created_at": "2019-04-12T12:38:54.823472+00:00"}, {"sender": "url_inside_text", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "Hi! {check} this url: \"http://foo.bar\" another url: https://new.url.com/ yet another one: https://google.com/"}, {"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"}, {"sender": "ny", "created_at": "2017-10-22T20:49:50.042588+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}]}]

We get:

minanageh379    https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2    http://foo.bar  https://bar.foo

xcsadc  https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2

url_inside_text http://foo.bar  https://new.url.com/    https://google.com/

ncccy   https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2

ny  https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2

It may happen that you may get duplicated urls if those are duplicated on the original input (on same or different attributes).

Once processed you can remove duplicates with this regex:

Search: (?i)\t(https?:\S++)(?=[^\n]+\1)

Replace by: (nothing)

Replace All

0
Toto On
  • Ctrl+H
  • Find what: (?:^|\G).*?"media": "(https://[^"]+)(?:(?!https:).)*
  • Replace with: $1\n
  • check Wrap around
  • check Regular expression
  • UNCHECK . matches newline
  • Replace all

Explanation:

(?:^|\G)            # beginning of line OR restart from last match position
.*?                 # 0 or more any character but newline, not greedy
"media": "          # literally
(                   # start group 1
  https://[^"]+     # https:// fllowed by 1 or more not double quote, the url
)                   # end group 1
(?:(?!https:).)*    # Tempered greedy token, make sure we haven't "https" after

Replacement:

$1         # content of group 1, the URL

Result for given example:

https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2
https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2

Screen capture:

enter image description here

0
robinCTS On

To extract just the links from the text file, do a regular expression Replace All using the following:

Find what:

.*?(https?:[^"]+)(?(?!.*?https?:).*)

Replace with:

$1\n\n

Demo 1

Note that you need to have Wrap around checked in case the insertion point is not at the start of the text.

Explanation:

.*?(https?:[^"]+)(?(?!.*?https?:).*)
|_||____________||_________________|
 |    ____|               |
 |   |    ________________|
 |   |   |
 |   |  [3] If there are no more following links, grab and discard the rest of the text
 |  [2] Store the link in $1 (starting with http and ending just before the first following")
[1] Grab and discard everything up 'til the first link (i.e. starting with http: or https:)

When using Replace All, the searching and replacing automatically continues until the regex fails to match, starting at the last point the data was matched up 'til, which in this case would be just before the double quote at the end of the current link if there are more links, or the end of the text otherwise.



To also extract the sender, use the following:

Find what:

.*?\{(?:([^"]*)"){4}[^{}]*?(https?:[^"]+)(?(?!.*?https?:).*)

Replace with:

$1 $2\n\n

Demo 2

Explanation:

Coming tomorrow


An alternative regex to do the same, but which is probably a little clearer is:

.*?"sender": "([^"]*)[^}]*?(https?:[^"]+)(?(?!.*?https?:).*)

Demo 3

Explanation:

.*?"sender": "([^"]*)[^}]*?(https?:[^"]+)(?(?!.*?https?:).*)
|_||_________||_____||____||____________||_________________|
 |   ___|  ______|  ___|  _______|  _____________|
 |  |   __|  ______|  ___|  _______|
 |  |  |   _|   _____|  ___|
 |  |  |  |   _|  _____|
 |  |  |  |  |   |
 |  |  |  |  |  [6] If there are no more following links, grab and discard the rest of the text
 |  |  |  | [5] Store the link in $2 (starting with http and ending just before the first following")
 |  |  | [4] Grab and discard everything within the current set of braces up 'til the link
 |  | [3] Store the sender name in $1 
 | [2] Grab and discard "sender": " (i.e. up to the opening quote of the sender name)
[1] Grab and discard everything up 'til the first "sender" key which has an associated link

Step [1] works by initially starting at the beginning of the text and grabbing everything up to the first sender key, then grabbing the key via [2], grabbing the sender name in [3], and grabbing everything up to the associated link if it exists in [4]. If there is no associated link, [5] fails, and the regex backtracks to step [1] which continues grabbing everything after the first sender key up to the second sender key. This cycle is repeated until a sender key is found which has an associated link.

At this point, step [5] succeeds and then step [6] either grabs the rest of the text or nothing.

Finally, all the grabbed text is replaced by $1 $2\n\n, i.e. the sender name followed by a space, the link and then two newline characters.

This completes the first "replace". Since Replace All was selected, the whole process starts again, but with the text pointer either at the double quote at the end of the previously found link, or at the end of the text instead of at the start.

0
JKing On

While the other answers do exactly what you need, one thing to note is that the string you gave is a valid JSON string. You can verify that it's valid JSON here.

If you're dealing with this string in a program, you may want to consider using a JSON parser for your language. Here's the one for Python

0
Julio On

Yet another alternative would be to parse the JSON data.

You can do this with javascript.

The following snipplet should work for parsing your data. It should even work with several urls inside the same text message:

yourJSON
[0].conversation
.filter(x => x.media !== undefined || x.text !== undefined && /https?:/i.test(x.text))
.map(x => {
    const tmp = x.text + ' ' + x.media;
    const urls = tmp.match(/https?:[\w\-.~:\/?#\[\]@!$&'()*+,;=%]*/g);
    return x.sender + ":\n" + urls.join("\n");
})
.join("\n\n");

You can paste that javascript (changing yourJSONwith your data) into some browser that has some javascript console like Firefox or Chrome. In firefox you can launch the console with (Control + Shift + K) and in Chrome with (Control + Shift + I, then click 'console')

As an alternative, you may use this jsfiddle instead.

Edit the javascript square to use your data and then push the "Run" button.