Convert WebVTT to a Transcript using Python

I want to convert YouTube’s auto-generated subtitles into a plain transcript. Why is this so hard?

This blog post gives a more detailed explanation than my answer to this StackOverflow question.

Here’s what the subtitles look like when you view a video:

And here’s what the code which generates those subtitles looks like:

00:00:00.930 --> 00:00:03.080 align:start position:0%

and<span class="hljs-tag"><<span class="hljs-name">00:00:01.230</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> now<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.439</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> can<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.709</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> we<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.800</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> have<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.920</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> a<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:01.979</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> round<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:02.370</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> of<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:02.460</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> applause<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"></<span class="hljs-name">c</span>></span>

00:00:03.080 --> 00:00:03.090 align:start position:0%
and now can we have<span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span> a<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span> round of applause
 <span class="hljs-tag"></<span class="hljs-name">c</span>></span>

00:00:03.090 --> 00:00:04.849 align:start position:0%
and now can we have<span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span> a<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span> round of applause
for<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:03.120</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> Terrence<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:03.629</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> Edwards<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:03.899</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> and<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:04.170</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> his<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:04.200</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> talk<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">00:00:04.529</span>></span><span class="hljs-tag"><<span class="hljs-name">c</span>></span> the<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"></<span class="hljs-name">c</span>></span>

00:00:04.849 --> 00:00:04.859 align:start position:0%
for<span class="hljs-tag"><<span class="hljs-name">c.colorCCCCCC</span>></span> Terrence Edwards and his<span class="hljs-tag"></<span class="hljs-name">c</span>></span><span class="hljs-tag"><<span class="hljs-name">c.colorE5E5E5</span>></span> talk the
 <span class="hljs-tag"></<span class="hljs-name">c</span>></span>

WTF? You’re looking at WebVTT – Web Video Text Tracks Format – this allows words to be displayed as they’re said. Each sentence and word is given a time-code and a position, colours are also possible to identify multiple speakers. It’s great for subtitles, but it is lousy if all you want to do is read a transcript.

So, how do we convert the above to something like:

and now can we have a round of applause for Terrence Edwards and his talk the connected house of horrors

Python – the quick and dirty way

Using the open source WebVTT-PY Python library, we can directly get the raw text of each line of the subtitles

<span class="hljs-keyword">import</span> webvtt
vtt = webvtt.read(<span class="hljs-string">'subtitles-en.vtt'</span>)

vtt[<span class="hljs-number">0</span>].text
<span class="hljs-string">' \nand now can we have a round of applause'</span>
vtt[<span class="hljs-number">1</span>].text
<span class="hljs-string">'and now can we have a round of applause\n '</span>
vtt[<span class="hljs-number">2</span>].text
<span class="hljs-string">'and now can we have a round of applause\nfor Terrence Edwards and his talk the'</span>
vtt[<span class="hljs-number">3</span>].text
<span class="hljs-string">'for Terrence Edwards and his talk the\n '</span>
vtt[<span class="hljs-number">4</span>].text
<span class="hljs-string">'for Terrence Edwards and his talk the\nconnected house of horrors good'</span>
vtt[<span class="hljs-number">5</span>].text
<span class="hljs-string">'connected house of horrors good\n '</span>
vtt[<span class="hljs-number">6</span>].text
<span class="hljs-string">'connected house of horrors good\nafternoon'</span>

Manually looking through the text, we can see that the 2nd element has the first complete sentence, then the 6th. Starting at 2, we can increment by 4 and grab elements 6, 10, 14, etc to build up a transcript. Does that work?

Yes! This is what happens if we slice the array:

sub = vtt[<span class="hljs-number">2</span>::<span class="hljs-number">4</span>]

sub[<span class="hljs-number">0</span>].text
<span class="hljs-string">'and now can we have a round of applause\nfor Terrence Edwards and his talk the'</span>
sub[<span class="hljs-number">1</span>].text
<span class="hljs-string">'connected house of horrors good\nafternoon'</span>
sub[<span class="hljs-number">2</span>].text
<span class="hljs-string">'AMF thank you so much for for coming\nhere my name is Terrence Eaton I need to'</span>
sub[<span class="hljs-number">3</span>].text
<span class="hljs-string">'tell you three things about this talk so\nthe first thing is that this does'</span>

But are we sure that will work for all the subtitles? Or even for the entirety of this subtitle file?

Python the hard way

Let’s take a look again at the first 4 subtitle entries.

vtt[<span class="hljs-number">0</span>].<span class="hljs-built_in">text</span>
' \nand now can we have a <span class="hljs-built_in">round</span> <span class="hljs-keyword">of</span> applause'
vtt[<span class="hljs-number">1</span>].<span class="hljs-built_in">text</span>
'<span class="hljs-keyword">and</span> now can we have a <span class="hljs-built_in">round</span> <span class="hljs-keyword">of</span> applause\n '
vtt[<span class="hljs-number">2</span>].<span class="hljs-built_in">text</span>
'<span class="hljs-keyword">and</span> now can we have a <span class="hljs-built_in">round</span> <span class="hljs-keyword">of</span> applause\nfor Terrence Edwards <span class="hljs-keyword">and</span> his talk <span class="hljs-keyword">the</span>'
vtt[<span class="hljs-number">3</span>].<span class="hljs-built_in">text</span>
'<span class="hljs-keyword">for</span> Terrence Edwards <span class="hljs-keyword">and</span> his talk <span class="hljs-keyword">the</span>\n '
vtt[<span class="hljs-number">4</span>].<span class="hljs-built_in">text</span>
'<span class="hljs-keyword">for</span> Terrence Edwards <span class="hljs-keyword">and</span> his talk <span class="hljs-keyword">the</span>\nconnected house <span class="hljs-keyword">of</span> horrors good'

We can split those double lines using

vtt[<span class="hljs-number">2</span>].<span class="hljs-built_in">text</span>.splitlines()
['<span class="hljs-keyword">and</span> now can we have a <span class="hljs-built_in">round</span> <span class="hljs-keyword">of</span> applause', '<span class="hljs-keyword">for</span> Terrence Edwards <span class="hljs-keyword">and</span> his talk <span class="hljs-keyword">the</span>']

Let’s create a new array. Add all the lines split by \n.

lines = []
<span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> vtt:
    lines.extend(line.text.strip().splitlines())

Which gives us:

<span class="hljs-meta">>></span>> lines[<span class="hljs-number">0</span>]
<span class="hljs-string">'and now can we have a round of applause'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">1</span>]
<span class="hljs-string">'and now can we have a round of applause'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">2</span>]
<span class="hljs-string">'and now can we have a round of applause'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">3</span>]
<span class="hljs-string">'for Terrence Edwards and his talk the'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">4</span>]
<span class="hljs-string">'for Terrence Edwards and his talk the'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">5</span>]
<span class="hljs-string">'for Terrence Edwards and his talk the'</span>
<span class="hljs-meta">>></span>> lines[<span class="hljs-number">6</span>]
<span class="hljs-string">'connected house of horrors good'</span>

And now, to de-duplicate them:

transcript = <span class="hljs-string">""</span>
previous = <span class="hljs-literal">None</span>
<span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> lines:
    <span class="hljs-keyword">if</span> line == previous:
       <span class="hljs-keyword">continue</span>
    transcript += <span class="hljs-string">" "</span> + line
    previous = line

Putting it all together

Ta-da!

<span class="hljs-keyword">import</span> webvtt
vtt = webvtt.read(<span class="hljs-string">'subtitles.vtt'</span>)
transcript = <span class="hljs-string">""</span>

lines = []
<span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> vtt:
    lines.extend(line.text.strip().splitlines())

previous = <span class="hljs-literal">None</span>
<span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> lines:
    <span class="hljs-keyword">if</span> line == previous:
       <span class="hljs-keyword">continue</span>
    transcript += <span class="hljs-string">" "</span> + line
    previous = line

print(transcript)

One thing to note is that there is no punctuation. So it’s not as good as a proper transcription.