1. 7

I’m working on a side project to identify audio clips that have been (poorly) pieced together. A short clip may actually be the result of piecing together clips from one or more of series of larger works of approximately the same audio quality. These are voice clips from podcasts. A goal is to detect alterations that significantly change what the speaker said, e.g. by leaving out context as larger as parts of a sentence to as small as cutting out “not”.

Ideally, I’d like to split the smaller clip at the silent moments (actual silence according to the waveforms in Audacity) and then identify where in the other clips these small sections came from.

I’ve found a host of tools that are aimed at music: AcoustID, echoprint, etc. What tooling I’ve found seems to check presence of a needle in the haystack but not expose where in the clip it came from: music identification doesn’t need the precision I seek.

I’d appreciate pointers to software that might be able to just do it or advice on how to implement something like this. This is my first time doing anything with audio programatically so I’m out of my element but looking to learn.

  1.  

  2. 3

    Not really ready-made tools but some good overview of the theory behind audio segmentation:

    Maybe a NN/GRU-based approach like what has been integrated in Opus 1.3 may be a good practical starting point: https://people.xiph.org/~jm/opus/opus-1.3/

    1. 1

      Neat, thanks for the links!

    2. 2

      A question: how could you tell who did the audio edit?

      A splice by the creator of the podcast should be indistinguishable from a malicious one from someone else.

      1. 1

        should be indistinguishable

        It should be, but in comparing one target needle with a few candidate haystacks, the haystacks weren’t ever silent: there was always a light hum or some background noise when someone wasn’t talking, i.e. the active speaker was taking a moment to pause.

      2. 2

        Something you might want to try instead is looking for an open-source closed-captioning tool. If you have a variety of news sources / speeches / etc etc that you run CC on, then you can create a big database of speech transcriptions. Later, when you have a “suspicious” recording, you could CC that too and then do an index lookup against your DB to see how closely it matches original context.

        It’s not the same as detecting edits “from nowhere”, but it could be useful for quickly spotting shady editing for high-profile sources like political figures and celebrities.

        1. 1

          I’d considered automatic transcription, then fall back to the harder stuff, but I’m concerned that the quality might not be enough. However, presumably, the transcription system would output the same text for the same input, even if cut up, right?

        2. 1

          If I remember right, the most common way to do this is to build a spectrogram of your sampled audio (which is basically an FFT over time) and look at what spectrograms of reference audio it appears to be a subset of. There’s no reason why you couldn’t adapt another implementation to report not just a match, but also where the match was found. You might find it needs tuned because there’s more information carried in music than speech, or that the overall approach doesn’t work too well, but it’s what I know for now.

          As an aside, what does “silence on the waveform” actually mean? A zero crossing point? A number of samples all at 0? This might be a worthwhile step forward but it’s trivially defeated by overlaying small amounts of noise, or carefully putting the two subsections back together after removing a word, etc.

          1. 1

            what does “silence on the waveform” actually mean

            Forgive me, I’m still building my vocabulary in this context! I think I mean 0 for an extended time: Audacity shows the waveform flat at 0 when zoomed really far in. In the candidate haystacks, there’s virtually none of that.

            trivially defeated

            Yeah, it would be. Detection beyond a sloppy “copy and paste” job is out of scope right now.

            1. 1

              Cool. I think it’s a worthwhile and probably interesting project anyway. As mentioned, one approach would be to create a spectrogram, and then identify features in the time-frequency-intensity space, and look for those same features in other places.

              There’s probably useful research on this in computer vision, where they instead view it as X-Y-intensity for b/w images.

          2. 1

            Something you might want to try instead is looking for an open-source closed-captioning tool. If you have a variety of news sources / speeches / etc etc that you run CC on, then you can create a big database of speech transcriptions. Later, when you have a “suspicious” recording, you could CC that too and then do an index lookup against your DB to see how closely it matches original context.

            It’s not the same as detecting edits “from nowhere”, but it could be useful for quickly spotting shady editing for high-profile sources like political figures and celebrities.

            1. 1

              This is a pretty interesting area. I think there are a few cool ways of doing it. The most brute force way would be to create audio edits en masse and then approximate a function to classify those edits against unedited audio. If you had a particular type of audio you want to detect on, you would want to gather a test set regardless of method.

              For a classification function you could try messing with sklearn or tensorflow. Getting the labelled data might be the hardest part.