Convert WebVTT to a Transcript using Python

I want to convert YouTube’s auto-generated subtitles into a plain transcript. Why is this so hard?

This blog post gives a more detailed explanation than my answer to this StackOverflow question.

Here’s what the subtitles look like when you view a video:

And here’s what the code which generates those subtitles looks like:

00:00:00.930 --> 00:00:03.080 align:start position:0%

and<00:00:01.230><c> now</c><00:00:01.439><c> can</c><00:00:01.709><c> we</c><00:00:01.800><c> have</c><c.colorCCCCCC><00:00:01.920><c> a</c></c><c.colorE5E5E5><00:00:01.979><c> round</c><00:00:02.370><c> of</c><00:00:02.460><c> applause</c></c>

00:00:03.080 --> 00:00:03.090 align:start position:0%
and now can we have<c.colorCCCCCC> a</c><c.colorE5E5E5> round of applause
 </c>

00:00:03.090 --> 00:00:04.849 align:start position:0%
and now can we have<c.colorCCCCCC> a</c><c.colorE5E5E5> round of applause
for</c><c.colorCCCCCC><00:00:03.120><c> Terrence</c><00:00:03.629><c> Edwards</c><00:00:03.899><c> and</c><00:00:04.170><c> his</c></c><c.colorE5E5E5><00:00:04.200><c> talk</c><00:00:04.529><c> the</c></c>

00:00:04.849 --> 00:00:04.859 align:start position:0%
for<c.colorCCCCCC> Terrence Edwards and his</c><c.colorE5E5E5> talk the
 </c>

WTF? You’re looking at WebVTT – Web Video Text Tracks Format – this allows words to be displayed as they’re said. Each sentence and word is given a time-code and a position, colours are also possible to identify multiple speakers. It’s great for subtitles, but it is lousy if all you want to do is read a transcript.

So, how do we convert the above to something like:

and now can we have a round of applause for Terrence Edwards and his talk the connected house of horrors

Python – the quick and dirty way

Using the open source WebVTT-PY Python library, we can directly get the raw text of each line of the subtitles

import webvtt
vtt = webvtt.read('subtitles-en.vtt')

vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'
vtt[5].text
'connected house of horrors good\n '
vtt[6].text
'connected house of horrors good\nafternoon'

Manually looking through the text, we can see that the 2nd element has the first complete sentence, then the 6th. Starting at 2, we can increment by 4 and grab elements 6, 10, 14, etc to build up a transcript. Does that work?

Yes! This is what happens if we slice the array:

sub = vtt[2::4]

sub[0].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
sub[1].text
'connected house of horrors good\nafternoon'
sub[2].text
'AMF thank you so much for for coming\nhere my name is Terrence Eaton I need to'
sub[3].text
'tell you three things about this talk so\nthe first thing is that this does'

But are we sure that will work for all the subtitles? Or even for the entirety of this subtitle file?

Python the hard way

Let’s take a look again at the first 4 subtitle entries.

vtt[0].text
' \nand now can we have a round of applause'
vtt[1].text
'and now can we have a round of applause\n '
vtt[2].text
'and now can we have a round of applause\nfor Terrence Edwards and his talk the'
vtt[3].text
'for Terrence Edwards and his talk the\n '
vtt[4].text
'for Terrence Edwards and his talk the\nconnected house of horrors good'

We can split those double lines using

vtt[2].text.splitlines()
['and now can we have a round of applause', 'for Terrence Edwards and his talk the']

Let’s create a new array. Add all the lines split by \n.

lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())

Which gives us:

>>> lines[0]
'and now can we have a round of applause'
>>> lines[1]
'and now can we have a round of applause'
>>> lines[2]
'and now can we have a round of applause'
>>> lines[3]
'for Terrence Edwards and his talk the'
>>> lines[4]
'for Terrence Edwards and his talk the'
>>> lines[5]
'for Terrence Edwards and his talk the'
>>> lines[6]
'connected house of horrors good'

And now, to de-duplicate them:

transcript = ""
previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

Putting it all together

Ta-da!

import webvtt
vtt = webvtt.read('subtitles.vtt')
transcript = ""

lines = []
for line in vtt:
    lines.extend(line.text.strip().splitlines())

previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

print(transcript)

One thing to note is that there is no punctuation. So it’s not as good as a proper transcription.

LyricFind: The World’s Largest Lyric Licensing Service

LyricFind is the world’s leader in legal lyric solutions. ..

In order to provide sucessul lyrics services, LyricFind has not only amassed licensing from over 4,000 music publishers, including all four majors —

  1. EMI Music Publishing,
  2. Universal Music Publishing Group,
  3. Warner/Chappell Music Publishing, and
  4. Sony/ATV Music Publishing —

but has also built a quality-controlled, vetted database of lyrics for licensing and service to over 100 countries. Behind the scenes, LyricFind tracks, reports, and pays royalties to those publishers on a song-by song and territory-by-territory basis.

Additionally, LyricFind has a customized search solution available to licensees to identify music based on lyrics, and answer that age-old question of “What’s that song?”