Scrape automatic captions from Youtube using Selenium and R

998 views Asked by At

I'm trying to get auto-captions from Youtube for doing some research. Instead of scraping with Selenium, I've used the Youtube Data API with the package "tuber" to extract captions. That only works in case there is a caption track provided by the user. Unfortunately, there are no uploaded captions for the videos I need to analyze.

My Thought was using Selenium to get access to the specific content. The html code looks something like this:

<div class="caption-window ytp-caption-window-bottom ytp-caption-window-rollup" id="caption-window-1" dir="ltr" tabindex="0" aria-live="assertive" style="touch-action: none; text-align: left; left: 21.2%; height: 40px; width: 287px; bottom: 2%;" data-layer="4" lang="en"><span class="captions-text"><span style="background: rgba(8, 8, 8, 0.75) none repeat scroll 0% 0%; box-decoration-break: clone; border-radius: 2px; font-size: 16px; color: rgb(255, 255, 255); fill: rgb(255, 255, 255); font-family: &quot;YouTube Noto&quot;,Roboto,&quot;Arial Unicode Ms&quot;,Arial,Helvetica,Verdana,&quot;PT Sans Caption&quot;,sans-serif;">&nbsp;load the our selenium package into this<span style="color: rgb(204, 204, 204); fill: rgb(204, 204, 204);">&nbsp;<br>&nbsp;session</span> so it's loaded now&nbsp;</span></span></div>

As you can see, the plain caption text is embedded in a <span></span> element. I used this code to retrieve the caption text.

install.packages("RSelenium")

require(RSelenium)

# starting driver on port/browser
rD <- rsDriver(port = 4555L, browser = "firefox")
# remote driver client-side
remDr <- rD[["client"]]
# navigate to web page
remDr$navigate("https://www.youtube.com/watch?v=qUKEPurS6-s")

# stop autoplay
play_button <- remDr$findElement(using = 'class', value = "ytp-play-button")
play_button$clickElement()

# activate subtitles
subtitle_button <- remDr$findElement(using = "class", value = "ytp-subtitles-button")
subtitle_button$clickElement()



# captions text element
caption_window <- remDr$findElement(using = "class", value = "captions-text")
# retrieve plain text
text <- caption_window$getElementText()

Now to my question:

How can I capture the changes made to the dom element and retrieve the text everytime there is a new word occurring? I think an AJAX call is updating the element, but I don't know exactly.

Thanks :)

0

There are 0 answers