What Should We Do About Big Data Leaks?
What is transparency in the age of massive database drops? The data is available, but locked in MP3s and PDFs and other documents; it’s not searchable in the way a web page is searchable, not easy to comment on or share.
.. This, said the consortia of journalists that notably did not include The New York Times, The Washington Post, etc., is the big one.
.. Organs of journalism are among the only remaining cultural institutions that can fund investigations of this size and tease the data apart, identifying linkages and thus constructing informational webs that can, with great effort, be turned into narratives, yielding something like what we call “a story” or “the truth.”
.. If this is the age of the citizen journalist, or at least the citizen opinion columnist, it’s also the age of the data journalist, with the news media acting as product managers of data leaks, making the information usable, browsable, attractive.
.. There’s a glut of data, but most of it comes to us in ugly formats. What would happen if the things released in the interest of transparency were released in actual transparent formats? By which I mean, not as a pile of unstructured documents, not even as pure data, but, well, as software? Put cost aside and imagine for a minute that the FCIC report was not delivered as web pages, PDFs, finding aids, and the like, but as a database filled with searchable, formatted text, including documents attributed to the individuals within, audio files transcribed, and so forth.
.. I look at that FCIC data and see at least 300 hours of audio. That’s $18,000 worth of transcription. Those documents could be similarly turned into searchable text, as could any of the PDFs. We can do the same for emails. These tools exist and are open. If there are any faxes they can be OCRed.