How to extract video urls and titles from OK.ru/video using the CLI

2.6k views Asked by At

EDIT 1: I'd like to extract video urls and titles from "https://ok.ru/video/c1404844" results using the CLI.

Here's want I've done so far :

The ERE pattern for each video relative URL is : /video/\d+ and the video absolute URL looks like this : https://ok.ru$videoRelativeURL

I can use this command to extract the video urls (I use uniq because many video IDs appear 3 times) :

$ curl -s https://ok.ru/video/c1404844 | grep -oP "/video/\d+" | uniq | sed "s|^|https://ok.ru|" | head -5
https://ok.ru/video/1896971373228
https://ok.ru/video/1896971438764
https://ok.ru/video/1896971569836
https://ok.ru/video/1896971635372
https://ok.ru/video/1898415590060

Then I tried extracting the video relativeURLs + title with pup.

EDIT 3 : I replaced the class name video-card_n ellip by video-card_n.ellip. However pup only outputs the attribute of the second class (video-card_n.ellip), strange :

$ curl -s https://ok.ru/video/c1404844 | pup '.video-card_lk attr{href}, .video-card_n.ellip attr{title}' | head -5
Death.in.Paradise.S02E05.WEBRip.x264-ION10
Death.in.Paradise.S02E02.WEBRip.x264-ION10
Death.in.Paradise.S02E04.WEBRip.x264-ION10
Death.in.Paradise.S02E03.WEBRip.x264-ION10
Death.in.Paradise.S02E06.WEBRip.x264-ION10

It didn't work so I converted the expanded html to json with this command :

$ curl -s https://ok.ru/video/c1404844 | pup 'json{}' > c1404844.json

Now I want to try and extract the title from video-card_n ellip and the href from video-card_lk from the resulting json file with the jq tool but I know how to use jq enough.

I'd like jq (or pup) to output a flat file : the url as the first column and the title as the second column.

EDIT 2 : A big thank you to @peak for his help on jq !

DONE :

$ curl -s https://ok.ru/video/c1404844 | pup 'json{}' | jq -r 'recurse | arrays[] | select(.class == "video-card_lk").href,select(.class == "video-card_n ellip").title' | awk '{videoRelativeURL = $0;url="https://ok.ru"gensub("?.*$","",videoRelativeURL); getline title; print url" # "title}' | head
https://ok.ru/video/1898417425068 # Death.in.Paradise.S02E05.WEBRip.x264-ION10
https://ok.ru/video/1898417359532 # Death.in.Paradise.S02E02.WEBRip.x264-ION10
https://ok.ru/video/1898417293996 # Death.in.Paradise.S02E04.WEBRip.x264-ION10
https://ok.ru/video/1898417228460 # Death.in.Paradise.S02E03.WEBRip.x264-ION10
https://ok.ru/video/1898417162924 # Death.in.Paradise.S02E06.WEBRip.x264-ION10
https://ok.ru/video/1898417097388 # Death.in.Paradise.S02E07.WEBRip.x264-ION10
https://ok.ru/video/1898417031852 # Death.in.Paradise.S02E08.WEBRip.x264-ION10
https://ok.ru/video/1898416966316 # Death.in.Paradise.S02E01.WEBRip.x264-ION10
https://ok.ru/video/1898416769708 # Death.in.Paradise.S07E02.The.Stakes.Are.High.WEBRip.x264-ION10
https://ok.ru/video/1898416704172 # Death.in.Paradise.S07E03.Written.in.Murder.WEBRip.x264-ION10
...
2

There are 2 answers

3
Reino On BEST ANSWER

If you want to scrape specific information from a HTML-source, then there's no need for 5 different tools! Please have a look at . It can do it all.

$ xidel -s https://ok.ru/video/c1404844 -e '
  //div[@data-id]/join(
    (
      div[@class="video-card_img-w"]/a/resolve-uri(substring-before(@href,"?")),
      div[@class="video-card_n-w"]/a
    ),
    " # "
  )
'
https://ok.ru/video/1898417425068 # Death.in.Paradise.S02E05.WEBRip.x264-ION10
https://ok.ru/video/1898417359532 # Death.in.Paradise.S02E02.WEBRip.x264-ION10
https://ok.ru/video/1898417293996 # Death.in.Paradise.S02E04.WEBRip.x264-ION10
https://ok.ru/video/1898417228460 # Death.in.Paradise.S02E03.WEBRip.x264-ION10
https://ok.ru/video/1898417162924 # Death.in.Paradise.S02E06.WEBRip.x264-ION10
https://ok.ru/video/1898417097388 # Death.in.Paradise.S02E07.WEBRip.x264-ION10
https://ok.ru/video/1898417031852 # Death.in.Paradise.S02E08.WEBRip.x264-ION10
https://ok.ru/video/1898416966316 # Death.in.Paradise.S02E01.WEBRip.x264-ION10
https://ok.ru/video/1898416769708 # Death.in.Paradise.S07E02.The.Stakes.Are.High.WEBRip.x264-ION10
https://ok.ru/video/1898416704172 # Death.in.Paradise.S07E03.Written.in.Murder.WEBRip.x264-ION10
[...]
7
peak On

After using to convert the HTML of the top-level page to JSON, the following jq filter produces 24 pairs, the first two of which are shown under "Output" below:

[ [ .. | arrays[] | select(.class == "video-card_n ellip").title],
  [ .. | arrays[] | select(.class == "video-card_lk").href]]
| transpose

Output


[
  [
    "Замечательная пара, красивая песня и чудесное исполнение! Золотые голоса!",
    "/video/2406311403450?st._aid=VideoState_open_top"
  ],
  [
    "#СидимДома",
    "/video/1675421949619?st._aid=VideoState_open_top"
  ],
  ...