Here are two provacative examples of publishing source data sets, one set by journalists and another by historians:
Example 1: Following the take-over of the US Capital on Jan 6, 2021, Pro-Publica acquired, curated and published a set a videos taken by rioters. I appreciate that they have also published thier rationale and a description of the diligent process that they created to be responsible in doing this. They persuasively argue that what they've done is "transformative fair use." More details below.
Example 2: In their research, environmental health historians David Rosner and Gerald Markowitz have very productively used industry documents given to them by lawyers involved in cases against chemical companies. They've now built a platform - ToxicDocs -- where these documents (thousands of pages) are freely and publically accessible. More details below.
On Sunday, Jan. 17, we published a trove of more than 500 videos posted on the social media site Parler by participants and witnesses to the Jan. 6 assault on the Capitol. The material was an instant sensation, scoured by ordinary readers, amateur sleuths and FBI agents.
The story of how we got this story is a classic ProPublica caper, a mix of shoe-leather journalism, 21st-century programming prowess and teamwork. By the time we surfaced the videos, Amazon Web Services had indefinitely suspended Parler’s hosting service, taking the site and all of its content offline. But the posts had been saved by several groups of computer-savvy technologists determined to preserve the historical record.
As is often the case, this tale of internet ingenuity features an intriguing cast of characters. One central figure was a self-described hacker who goes by the Twitter handle @donk_enby. There was a “collective” of “rogue programmers” dedicated to archiving internet content, a confidential source whose identity we’ve promised to protect and our very own team of computer nerds.
The story of how they preserved this crucial evidence and brought it to public view is a reminder of something that’s been obscured as the modern internet became a place to organize riots, propagate hate and spread disinformation. Every now and again, cyberspace still fulfills the vision of its idealistic pioneers, serving as a virtual town hall for civic-minded people to gather and serve the public good.
When thousands of people marched on the Capitol and forced their way into the building, smashing windows and beating police officers, the members of the crowd we saw on cable news and social media appeared giddy at what they’d done. Although their conduct involved an array of criminal acts, we witnessed many recording the events on their cellphones. In the minutes and hours that followed, many of those people posted a high-definition video record of their actions to Parler, a social networking site created in 2018.
By Saturday, Jan. 9, the tech giants who had made it possible for Parler to operate — Google, Apple and Amazon — were moving to suspend their contracts with the company, citing its failure to “moderate” or remove violent content. The most significant of these was Amazon, which had provided the web hosting infrastructure needed to keep Parler online. Citing what it termed “the serious risk” that Parler users would use the platform to “further incite violence,” Amazon announced that it would cut off the app “effective January 10th at 11:59 PST.”
The race to preserve Parler’s posts, what @donk_enby, whose bio describes her love for “data spelunking” and “free speech as in free-for-everyone,” later called “the big pull,” was on.
As it happens, there is no easy way to copy and store an entire website. The posts on social networking sites like Twitter each have unique internet addresses but usually no way to know where they all are, which makes it extremely difficult to write a program that vacuums up everything a site makes public.
The Parler posts seemed destined to turn into a pumpkin at midnight on Sunday. But then @donk_enby made an important discovery. Parler created the web addresses for its posts with a numbering scheme that could be easily predicted. This made the task of scraping massive numbers of posts from the site far easier than it should have been. (Ars Technica details precisely how Parler’s coding diverged from the industry standard for those who follow such things.)
@donk_enby wrote a script that would harvest the Parler material. The task was gargantuan, involving many terabytes of information. The ArchiveTeam, computer hackers who describe themselves as “a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage,” set to work to preserve the Parler material. The team invited people around the world to donate computing power from their own machines, and many did so, allowing the downloading of nearly all of the Parler content. Some of the files were briefly posted on archive.org but then taken down.
That didn’t stop ProPublica’s Jeff Kao. A gifted programmer who is capable of uncovering internet skullduggery in both Chinese and English, Jeff and his boss Scott Klein immediately understood the potential value in obtaining the Parler posts. ProPublica has an entire team of journalists whose work involves creating searchable databases of newsworthy material, from names of doctors who are paid by Big Pharma companies to the identities of Catholic priests accused of misconduct.
One of our journalists, Derek Willis, was already trying to download the posts identified by @donk_enby. It was slow going and Jeff turned to an old-fashioned methodology to solve the problem: reporting. He searched for people who were working on archiving Parler and found someone who had downloaded every post with video and was willing to share them. (Our source, let’s call him or her Deep Parler, prefers to remain anonymous because of personal safety concerns.)
When Scott, Jeff, Derek and senior reporter Jack Gillum scrolled through some of the links, it was immediately clear they were onto something extraordinary. The videos were chaotic, jumbled. But taken together, they offered an immersive experience of what it was like to be inside the Capitol that afternoon.
As they analyzed the posts, our team looked for a very specific piece of text in the video files. Social networking sites typically remove any data that shows precisely when and where a particular video was recorded. A fair number of the Parler videos went up with their metadata intact, meaning that it was possible to know which of them were shot near the Capitol. Jeff quickly wrote a program that identified any videos shot from near the building around Jan. 6.
Then, our team had a second insight. Thousands of Parler videos from Jan. 6 lacked geographic data, and most of them were not useful — for example, there were videos copied from other websites. However, many included a tiny piece of text noting the make and model of the device, meaning you could search just for original videos shot by smartphones. Sifting through the videos posted at the time of the riot gave us a whole lot more firsthand footage from the sacking of the Capitol.
The next step was the most time consuming. We decided that every video should be manually reviewed by a ProPublica staffer to make sure it was newsworthy — directly related to the storming of the Capitol — and did not violate our basic standards of decency. An email went out to the staff asking for volunteers and dozens stepped forward to help. They marked the videos by significance, allowing our veteran political reporter, Alec MacGillis, to zero in on key moments for an essay he was writing to accompany the package. We whittled the pile of more than 2,500 to a selection of a little more than 500 videos.
Lena Groeger, Moiz Syed and Al Shaw on our news applications team worked on the design and the technical complexities of creating the interactive. By Sunday afternoon, the editor of that team, Ken Schwencke, was satisfied that everything was ready and we pushed the button. The story and interactive carried the names of 37 ProPublicans, eight of whom were bylined.
Within 24 hours, more than 1 million people had dipped into the files. Within days, FBI agents began filing affidavits in support of arrests. Several drew on images from the interactive.
We thought a lot about privacy, and how the material could be used for good and for ill, before publishing it. All of the information we published was posted publicly by the participants. We took care to remove videos that were irrelevant, that showed graphic violence or that didn’t contribute to our readers’ understanding of the event. Assembled together, the trove has obvious news value, and we worked hard to give the videos context. Still, these were not straightforward decisions. Scott and Jeff explained our thinking in an editors’ note we published with the interactive and Alec’s story.
We journalists like to complain about the internet. And, to be fair, it has destroyed the old business model of nearly all of the world’s newspapers. It can be a monument to mankind’s worst instincts, an endless flow of misformation, disinformation and misogyny. But it’s also an amazing reporting tool, and it’s given us reporting capabilities we are only starting to understand and exploit. “We learned a ton from this,” Schwencke said. “Several of our competitors did very high-quality, well-curated walkthroughs of a limited number of videos from Jan. 6. But there’s also real power in giving people the ability to see things through the eyes of the people who took part as they move from a Trump rally to a set of aggrieved people approaching the police to an all-out riot.”