CW Anderson, Culture Daily. The Materiality of Algorithms.
In what reads like a starting point for more posts on the subject, CUNY Prof Chris Anderson discusses what documents journalists may want to design algorithms for, and just how hard that task will be.
Algorithms doing magic inside massive data sets and search engines, while not mathematically simple, are generally easy to conceptualize — algorithms and their data are sitting in the computer, the algorithm sifts through the excel sheet in the background and bam! you have something.
But if you’re working with poorly organized documents, it’s difficult to simply plug them in.
Chris writes that the work required to include any document in a set will shape the algorithm that makes sense of the whole bunch. This will be a problem for journalists who want to examine any documents made without much forethought, which is to say: government documents, phone records from different companies and countries, eye witness reports, police sketches, mugshots, bank statements, tax forms, and hundreds of other things worth investigating.
The recovered text [from these documents] is a mess, because these documents are just about the worse possible case for OCR [optical character recognition]: many of these documents are forms with a complex layout, and the pages have been photocopied multiple times, redacted, scribbled on, stamped and smudged. But large blocks of text come through pretty well, and this command extracts what text there is into one file per page.
To read the rest of Stray’s account, see his Overview Project.
And to see more with Chris Anderson, see our recent video interviews with him.