January 20, 2016
What is the Political TV Ad Archive?
A new searchable, viewable, shareable—and free—online archive of 2016 primary election political TV ads, married with fact-checking and reporting citizens can trust.
How does the Political TV Ad Archive collect ads?
Building on techniques used for the Internet Archive’s broadcast news library,we are collecting local television from 20 markets in eight key early primary states: Iowa, New Hampshire, Nevada, Ohio, Colorado, South Carolina, North Carolina, and Florida.
How do you find the political ads?
We find political ads one of two ways: either a team member identifies an ad on the TV News Archive, YouTube, or some other site, or our Duplitron, as we call it, identifies these ads. (More on that below.) Either way, once we confirm a particular video segment is, indeed, a political ad, we create an audio fingerprint of it using a tool called audfprint. This is like a snapshot of the sound of that video segment, and, like a human fingerprint, is unique. Once we have the fingerprint for a political ad, we can match it to other, identical or similar fingerprints; in this way, we determine how many times a particular ad has aired, as well as when and where.
What exactly is audfprint?
Developed by Dan Ellis at Columbia University, audfprint is a tool that converts media files to audio fingerprints, and is able to compare audio fingerprints with one another to identify overlaps. This tool is open source, meaning anybody may use it to build their own applications.
What is the Duplitron?
Developed by Dan Schultz (@slifty), the Duplitron, powered by audfprint, is our system for identifying a segment of TV out in the wild of the airwaves that may be a political ad, bringing it to the attention of one of our team, fingerprinting it, and then using those fingerprints to find other copies. The Duplitron is open source, freely available for anyone to adapt, here.
How does the the Political TV Ad Archive relate to the TV News Archive?
If the Internet Archive is the grandparent to the Political Ad Archive, the TV News Search and Borrow service is the parent. Launched in 2012, the TV news research library leverages the Internet Archive’s TV collection and offers access to programing dating back to 2009. It repurposes closed captioning as a searchable index. The Political TV Archive is built on top of the engineering of Tracey Jaquith, the architect of the TV Archive. Within the TV News Archive, TV is collected either in MPEG-TS format, as transmitted, or as a smaller video file. Whatever the source, the video is recorded in one-minute increments, which gets merged with information from TV program guides to define shows, and then is uploaded to servers. From there, audfprint audio fingerprints are made of the segments, which are then fed to the Duplitron feeding the Political TV Ad Archive.
What technical challenges did the team encounter?
Plenty of them. Dan Schultz and Tracey Jaquith, along with other senior Internet Archive engineers, have encountered issues ranging from poor video quality, to stubborn audio drifts in clips, to the sheer amount of computing power needed to process and compare hundreds of thousands of minutes of video. Once the launch is behind us we will be releasing more detailed information about the challenges we faced and the approaches we took to solve them.
What kind of data does the project generate?
The Duplitron automatically records data on how many ads have been discovered, where, and how often they have aired on television. We are also partnering with the Center for Responsive Politics, OpenSecrets.org, to obtain their research on sponsors of political TV ads and what type of legal entity they are — a candidate committee, a super PAC, a nonprofit 501(c) group, etc. In addition, our metadata features information on the subjects covered by the ad, which we are classifying using partner PolitiFact’s subject index. We also will include searchable transcripts of the ads, as well as searchable fields on candidates mentioned in the ad and whether or not the ad is positive, negative, or mixed in message. These are downloadable in CSV format.