Posts tagged with ‘data’
I’d bumped into a friend who I’d not seen in a while and this was the first question I asked him. He didn’t realise at the time that I’d be in self-imposed smutty exile for an untold number of weeks, working on the largest study of porn stars ever undertaken, and now I was out and eager to spread the news.
‘Erm, yeah, I suppose,’ he said.
‘What does she look like?’ I asked, struggling to hide my smile.
When he replied by saying ‘a blonde with big boobs’, I must admit I relished the opportunity to lean in, let the grin spread across my tired face, and say ‘That’s what everyone says. And in fact, it’s wrong’.
‘Oh,’ he said, after I explained how I knew what the average porn star actually looks like, as well what her name probably is, how many films she’s most likely done and the probability of her having a tattoo or body piercing.
‘So you’ve spent all this time watching hundreds of porn movies?’
‘No,’ I said. ‘I’ve spent all this time analysing the demographic profiles and filmographies of ten thousand adult performers. There is a difference.’
‘I see’, he then said. ‘And how, dare I ask, does one go about doing that?’
There’s data porn and there’s porn data. Combining the two is Jon Millward, a self-described “Ideas Detective”.
Millward spent six months going over a ten thousand person porn star database to determine “what the average performer looks like, what they do on film, and how their role has evolved over the last forty years.”
The result is both a longread analysis and multiple data visualizations of things you never know you’d be interested to know.
Jon Millward, Deep Inside: A Study of 10,000 Porn Stars and Their Careers.
Somewhat related: Sex Diseases Cost $16 Billion a Year to Treat, CDC Says
He’s wrong again, but he’s getting warmer. According to FBI data, 8,583 people were murdered with firearms in 2011. Only 496 people were killed by blunt objects, a category that includes not just hammers and baseball bats but crowbars, rocks, paving stones, statuettes, and electric guitars. Broun was off by a factor of at least 17 this time, a significant improvement on his estimate of the age of the Earth. The blue planet is 4.54 billion years old, or more than 500,000 times older than Broun believes it to be.
FJP: …but he’s getting warmer.
Well, not just the Times, scientists are also digging through Wikipedia among many other sites.
Researchers at Microsoft and the Technion-Israel Institute of Technology are creating software that analyzes 22 years of New York Times archives, Wikipedia and about 90 other web resources to predict future disease outbreaks, riots and deaths — and hopefully prevent them.
The new research is the latest in a number of similar initiatives that seek to mine web data to predict all kinds of events. Recorded Future, for instance, analyzes news, blogs and social media to “help identify predictive signals” for a variety of industries, including financial services and defense. Researchers are also using Twitter and Google to track flu outbreaks.
Technology Review outlines how it can work.
The system provides striking results when tested on historical data. For example, reports of droughts in Angola in 2006 triggered a warning about possible cholera outbreaks in the country, because previous events had taught the system that cholera outbreaks were more likely in years following droughts. A second warning about cholera in Angola was triggered by news reports of large storms in Africa in early 2007; less than a week later, reports appeared that cholera had become established. In similar tests involving forecasts of disease, violence, and a significant numbers of deaths, the system’s warnings were correct between 70 to 90 percent of the time.
See Kira Radinsky and Eric Horvitz, Mining the Web to Predict Future Events (PDF).
Jeff Jarvis, BuzzMachine. Public is public… except in journalism?
While the above quote may stand on its own, a little context: not everyone liked the map of gun permit owners that was published in the aftermath of the Sandy Hook shooting. Jarvis believes that the decision of whether or not the map is morally sound belongs to the public — not to journalists.
Other media thinkers have said otherwise. The Times’ David Carr argued yesterday that the map, which showed the addresses of gun permit owners in New York’s Westechester and Rockland counties, isn’t journalism.
Well, is it?
The Breaking News team to Twitter: Your Mechanical Turk team can’t compete with our actual journalists. (via shortformblog)
FJP: Some Background — The Twitter Engineering blog posted yesterday about how it uses real people alongside its search algorithms to determine the “meaning” of trending terms. It does this with both in-house evaluators and Amazon’s Mechanical Turk, a crowdsourced marketplace for accomplishing (relatively) small tasks.
The goals is to contextualize and understand, for example, that something like #BindersFullOfWomen is related to politics.
Here’s what Twitter has to say about what happens when topics begin to trend:
As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query… For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.
The Wall Street Journal reports that a little known government agency now has the authority to hold and monitor data on US citizens for up to five years, even if the individual has never committed a crime.
The goal, it appears, is to use the data to predict future — or potential — criminal activity.
Via the Wall Street Journal*:
[New] rules now allow the little-known National Counterterrorism Center to examine the government files of U.S. citizens for possible criminal behavior, even if there is no reason to suspect them. That is a departure from past practice, which barred the agency from storing information about ordinary Americans unless a person was a terror suspect or related to an investigation.
Now, NCTC can copy entire government databases—flight records, casino-employee lists, the names of Americans hosting foreign-exchange students and many others. The agency has new authority to keep data about innocent U.S. citizens for up to five years, and to analyze it for suspicious patterns of behavior. Previously, both were prohibited…
The changes also allow databases of U.S. civilian information to be given to foreign governments for analysis of their own. In effect, U.S. and foreign governments would be using the information to look for clues that people might commit future crimes.
Under the new rules, the NCTC can request access to any governmental database that it “reasonably believes” contains “terrorism information.”
Considering the National Security Agency is currently building a massive information center in Utah to monitor almost “all forms of communication, including the complete contents of private emails, cell phone calls, and Google searches, as well as all sorts of personal data trails,” the NCTC wont be want for information.
BONUS: Looking for more about government surveillance? Check the FJP Surveillance Tag.
Wall Street Journal, U.S. Terrorism Agency to Tap a Vast Database of Citizens.
* This WSJ article is paywalled if you go directly to the site. If you want to read it, copy the title, paste it in Google and follow the search result back to the WSJ.
CW Anderson, Culture Daily. The Materiality of Algorithms.
In what reads like a starting point for more posts on the subject, CUNY Prof Chris Anderson discusses what documents journalists may want to design algorithms for, and just how hard that task will be.
Algorithms doing magic inside massive data sets and search engines, while not mathematically simple, are generally easy to conceptualize — algorithms and their data are sitting in the computer, the algorithm sifts through the excel sheet in the background and bam! you have something.
But if you’re working with poorly organized documents, it’s difficult to simply plug them in.
Chris writes that the work required to include any document in a set will shape the algorithm that makes sense of the whole bunch. This will be a problem for journalists who want to examine any documents made without much forethought, which is to say: government documents, phone records from different companies and countries, eye witness reports, police sketches, mugshots, bank statements, tax forms, and hundreds of other things worth investigating.
The recovered text [from these documents] is a mess, because these documents are just about the worse possible case for OCR [optical character recognition]: many of these documents are forms with a complex layout, and the pages have been photocopied multiple times, redacted, scribbled on, stamped and smudged. But large blocks of text come through pretty well, and this command extracts what text there is into one file per page.
To read the rest of Stray’s account, see his Overview Project.
And to see more with Chris Anderson, see our recent video interviews with him.