May 12, 2016
Ever wonder how we built the Political TV Ad Archive? This post explains what happens back stage — how we are using advanced technology to generate the counts for how many times a particular ad has aired on television, where, and when, in markets that we track.
There are three pieces to the Political TV Ad Archive:
- The Internet Archive collects, prepares, and serves the TV content in markets where we have feeds. Collection of TV is part of a much larger effort to meet the organization’s mission of providing “Universal Access to All Knowledge.” The Internet Archive is the online home to millions of free books, movies, software, music, images, web pages and more.
- The Duplitron 5000 is our whimsical name for an open source system responsible for taking video and creating unique, compressed versions of the audio tracks. These are known as audio fingerprints. We create an audio fingerprint for each political ad that we discover, which we then match against our incoming stream of broadcast television to find each new copy, or airing, of that ad. These results are reported back to the Internet Archive.
- The Political TV Ad Archive is a WordPress site that presents our data and our videos and presents it to the rest of the world. On this website, for the sake of posterity, we also archive copies of political ads that may be airing in markets we don’t track, or exclusively on social media. But for the ads that show up in areas where we’re collecting TV, we are able to present the added information about airings.
Step 1: recording television
We have a whole bunch of hardware spread around the country to record television. That content is then pieced together to form the programs that get stored on the Internet Archive’s servers. We have a few ways to collect TV content. In some cases, such as the San Francisco market, we own and manage the hardware that records local cable. In other cases, such as markets in Ohio and Iowa, the content is provided to us by third party services.
Regardless of how we get the data, the pipeline takes it to the same place. We record in minute-long chunks of video and stitch them together into programs based on what we know about the station’s schedule. This results in video segments of anywhere from 30 minutes to 12 hours. Those programs are then turned into a variety of file formats for archival purposes.
The ad counts we publish are based on actual airings, as opposed to reported airings. This means that we are not estimating counts by analyzing Federal Election Commission (FEC) reports on spending by campaigns. Nor are we digitizing reports filed by broadcasting stations with the Federal Communications Commission (FCC) about political ads, though that is a worthy goal. Instead we generate counts by looking at what actually has been broadcast to the public.
Because we are working from the source, we know we aren’t being misled. On the flip side, this means that we can only report counts for the channels we actively track and record. In the first phase of our project, we tracked more than 20 markets in 11 key primary states (details here.) We’re now in the process of planning which markets we’ll track for the general elections. Our main constraint is simple: money. Capturing TV comes at a cost.
A lot can go wrong here. Storms can affect reception, packets can be lost or corrupted before they reach our servers. The result can be time shifts or missing content. But most of the time the data winds up sitting comfortably on our hard drives unscathed.
Step 2: searching television
Video is terrible when you’re trying to look for a specific piece of it. It’s slow, it’s heavy, it is far better suited for watching than for working with, but sometimes you need to find a way.
There are a few things to try. One is transcription; if you have a time-coded transcript you can do anything. Like create a text editor for video, or search for key phrases, like “I approve this message.”
The problem is that most television is not precisely transcribed. Closed captions are required for most U.S. TV programs, but not for advertisements. Shockingly, most political ads are not captioned. There are a few open source tools out there for automated transcript generation, but the results leave much to be desired.
Introducing audio fingerprinting
We use a free and open tool called audfprint to convert our audio files into audio fingerprints.
An audio fingerprint is a summarized version of an audio file, one that has removed everything except the most “interesting“ pieces of every few milliseconds. The trick is that the summaries are formed in a way that makes it easy to compare them, and because they are summaries, the resulting fingerprint is a lot smaller and faster to work with than the original.
The audio fingerprints we use are based on a thing called frequency. Sounds are made up of waves, and each wave repeats–oscillates–at different rates. Faster repetitions are linked to higher sounds, lower repetitions are lower sounds.
An audio file contains instructions that tell a computer how to generate these waves. Audfprint breaks the audio files into tiny chunks (around 20 chunks per second) and runs a mathematical function on each fragment to identify the most prominent waves and their corresponding frequencies.
The rest is thrown out, the summaries are stored, and the result is an audio fingerprint.
If the same sound exists across two files, a common set of dominant frequencies will be seen in both fingerprints. Audfprint makes it possible to compare the chunks between two sound files, count how many they have in common, and how many appear in roughly the same distance from one another.
This is what we use to find copies of political ads.
Step 3: cataloguing political ads
When we discover a new political ad the first thing we do is register it on the Internet Archive, kicking off the ingestion process. The person who found it types in some basic information such as who the ad mentions, who paid for it, and what topics are discussed.
The ad is then sent to the system we built to manage our fingerprinting workflow, we whimsically call the Duplitron 5000—or the “DT5k.” This uses audfprint to generate fingerprints, organizes how the fingerprints are stored, process the comparison results, and allows us to scale to process across millions of minutes of television.
DT5k generates a fingerprint for the ad, stores it, and then compares that fingerprint with hundreds of thousands of existing fingerprints for the shows that had been previously ingested into the system. It takes a few hours for all of the results to come in. When they do, the Duplitron makes sense of the numbers and tells the archive which programs contain copies of the ad and what time the ad aired.
These result end up being fairly accurate, but not perfect. The matches are based on audio, not video, which means we face trouble when the same soundtrack is used in a political ad as has been used in, for instance, an infomercial.
We are working on improving the system to filter out these kinds of false positives, but even with no changes these fingerprints have provided solid data across the markets we track.
Step 4: enjoying the results
And so you understand a little bit more about our system. You can download our data and watch the ads at the Political TV Ad Archive. (For more on our metadata–what’s in it, and what can you can do with it, read here.)
Over the coming months we are working to make the system more accurate. We are also exploring ways to identify newly released political ads without any need for manual entry.
P.S. We’re also working to make it as easy as possible for any researchers to download all of our fingerprints to use in their own local copies of the Duplitron 5000. Would you like to experiment with this capability? If so, contact me on Twitter at @slifty.