posts about or somewhat related to ‘data’

All Lebron Shots: Last 5 Seasons
FJP: Crazy balance.
Image:  Kirk Goldsberry, Grantland. The Evolution of King James. Select to embiggen.

All Lebron Shots: Last 5 Seasons

FJP: Crazy balance.

Image:  Kirk Goldsberry, Grantland. The Evolution of King James. Select to embiggen.

Mostly Cloudy During Sunshine Week

To coincide with Sunshine Week, the Sunlight Foundation released their Open Legislative Data Report Card. Some states are doing well, many aren’t, with most scoring a Gentleman’s C or below.

Grades are based on what Sunlight calls the Ten Principles for Opening Up Government Information. For this report card, their criteria is based on six: completeness, timeliness, ease of access, machine readability, use of commonly owned standards and permanence.

The Society of Professional Journalists takes a different tact to explore government openness as they examine what obstacles reporters face when interviewing employees of federal agencies:

[A] survey of journalists who cover federal agencies found that information flow in the United States is highly regulated by public affairs officers, to the point where most reporters considered the control to be a form of censorship and an impediment to providing information to the public. According to a survey of 146 reporters who cover federal agencies, conducted by the Society of Professional Journalists in February 2012, journalists indicated that public information officers often require pre-approval for interviews, prohibit interviews of agency employees, and often monitor interviews. Journalists overwhelmingly agreed with the statement that “the public was not getting all the information it needs because of barriers agencies are imposing on journalists’ reporting practices.

Meantime, over at the Washington Post, Josh Hicks gives a rundown of what’s going on with FOIA requests:

The Center for Effective Government said Wednesday that the administration’s rate of response to FOIA requests had improved in 2012 but that the percentage of replies with redacted information had grown.

“While processing has gone up, we see a record-setting rate of partial grantings,” said Sean Moulton, the center’s director of open-government policy.

Federal agencies averaged a “C-minus” grade for FOIA compliance in Cause of Action’s analysis, also released Wednesday.

The group sent identical FOIA requests to 16 federal agencies in April. In its report, it said that one-quarter of the agencies provided no information and that the average response time for the others was 75 business days — more than double what the law requires.

Reporters filing FOIA requests with the Commerce Department have to wait even longer. The average turnaround time there is 239 days.

And then there’s national security and whistleblowing. We’ll let the Guardian’s Glenn Greenwald take it away. The gist of it runs like so:

Along with others, I’ve spent the last four years documenting the extreme, often unprecedented, commitment to secrecy that this president has exhibited, including his vindictive war on whistleblowers, his refusal to disclose even the legal principles underpinning his claimed war powers of assassination, and his unrelenting, Bush-copying invocation of secrecy privileges to prevent courts even from deciding the legality of his conduct.

Looking for more opinion and updates on Sunshine Week? Visit The SPJ or SunshineWeek.org’s opinions page.

Images: Screenshots, best and worst of the Sunlight Foundation’s Open Legislative Data Report Card. Select to embiggen.

‘Without any mental deliberation, picture the average female porn star. Just let her spring into your mind’s eye looking however she looks. Can you see her?’

I’d bumped into a friend who I’d not seen in a while and this was the first question I asked him. He didn’t realise at the time that I’d be in self-imposed smutty exile for an untold number of weeks, working on the largest study of porn stars ever undertaken, and now I was out and eager to spread the news.

‘Erm, yeah, I suppose,’ he said.

‘What does she look like?’ I asked, struggling to hide my smile.

When he replied by saying ‘a blonde with big boobs’, I must admit I relished the opportunity to lean in, let the grin spread across my tired face, and say ‘That’s what everyone says. And in fact, it’s wrong’.

‘Oh,’ he said, after I explained how I knew what the average porn star actually looks like, as well what her name probably is, how many films she’s most likely done and the probability of her having a tattoo or body piercing.

‘So you’ve spent all this time watching hundreds of porn movies?’

‘No,’ I said. ‘I’ve spent all this time analysing the demographic profiles and filmographies of ten thousand adult performers. There is a difference.’

‘I see’, he then said. ‘And how, dare I ask, does one go about doing that?’

There’s data porn and there’s porn data. Combining the two is Jon Millward, a self-described “Ideas Detective”.

Millward spent six months going over a ten thousand person porn star database to determine “what the average performer looks like, what they do on film, and how their role has evolved over the last forty years.”

The result is both a longread analysis and multiple data visualizations of things you never know you’d be interested to know.

Jon Millward, Deep Inside: A Study of 10,000 Porn Stars and Their Careers.

Somewhat relatedSex Diseases Cost $16 Billion a Year to Treat, CDC Says

Georgia congressman Paul Broun claimed after Tuesday’s State of the Union address that “There are more people killed with baseball bats and hammers than are killed with guns.” Explainer readers may remember Broun as the congressman who believes the Earth is 9,000 years old. What about his hammer and baseball bat claim?

He’s wrong again, but he’s getting warmer. According to FBI data, 8,583 people were murdered with firearms in 2011. Only 496 people were killed by blunt objects, a category that includes not just hammers and baseball bats but crowbars, rocks, paving stones, statuettes, and electric guitars. Broun was off by a factor of at least 17 this time, a significant improvement on his estimate of the age of the Earth. The blue planet is 4.54 billion years old, or more than 500,000 times older than Broun believes it to be.

Sunlight Foundation Launches Open States

Via the Sunlight Foundation:

After more than four years of work from volunteers and a full-time team here at Sunlight we’re immensely proud to launch the full Open States site with searchable legislative data for all 50 states, D.C. and Puerto Rico. Open States is the only comprehensive database of activities from all state capitols that makes it easy to find your state lawmaker, review their votes, search for legislation, track bills and much more.

If you watch the video one of the important points of tracking the legislative data is that laws and such often flow up from the state to federal level rather than the other way around.

Consider it an early warning system of a type.

Data is available on the Open States Web site, through APIs and through bulk downloads.

Words and phrases are fundamental building blocks of language and culture, much as genes and cells are to the biology of life. And words are how we express ideas, so tracing their origin, development and spread is not merely an academic pursuit but a window into a society’s intellectual evolution.

Predicting the Future via New York Times Archives →

Well, not just the Times, scientists are also digging through Wikipedia among many other sites.

Via GigaOm:

Researchers at Microsoft and the Technion-Israel Institute of Technology are creating software that analyzes 22 years of New York Times archives, Wikipedia and about 90 other web resources to predict future disease outbreaks, riots and deaths — and hopefully prevent them.

The new research is the latest in a number of similar initiatives that seek to mine web data to predict all kinds of events. Recorded Future, for instance, analyzes news, blogs and social media to “help identify predictive signals” for a variety of industries, including financial services and defense. Researchers are also using Twitter and Google to track flu outbreaks.

Technology Review outlines how it can work.

The system provides striking results when tested on historical data. For example, reports of droughts in Angola in 2006 triggered a warning about possible cholera outbreaks in the country, because previous events had taught the system that cholera outbreaks were more likely in years following droughts. A second warning about cholera in Angola was triggered by news reports of large storms in Africa in early 2007; less than a week later, reports appeared that cholera had become established. In similar tests involving forecasts of disease, violence, and a significant numbers of deaths, the system’s warnings were correct between 70 to 90 percent of the time.

See Kira Radinsky and Eric Horvitz, Mining the Web to Predict Future Events (PDF).

In the report, Twitter said that, worldwide, it received 1,858 requests from governments for information about users in 2012, as well as 6,646 reports of copyright violations, and 48 demands from governments that content they deem illegal be removed.
I say that news organizations should become advocates for open information, demanding that government not only make more of it available but also put it in standard formats so it can be searched, visualized, analyzed, and distributed. What the value of that information is to society is not up to the gatekeepers — officials or journalists — to decide. It is up to the public.

Jeff Jarvis, BuzzMachine. Public is public… except in journalism?

While the above quote may stand on its own, a little context: not everyone liked the map of gun permit owners that was published in the aftermath of the Sandy Hook shooting. Jarvis believes that the decision of whether or not the map is morally sound belongs to the public — not to journalists.

Other media thinkers have said otherwise. The Times’ David Carr argued yesterday that the map, which showed the addresses of gun permit owners in New York’s Westechester and Rockland counties, isn’t journalism.

Well, is it?

While Twitter’s Turks will help bring much-needed context to the platform, they’re not journalists who verify whether something is true. As we’ve seen with the shootings in Newtown, Connecticut and Superstorm Sandy, Twitter rumors ran rampant. Some rumors turned out to be true, but many were inaccurate or even malicious. Some were important, others were trivial. At Breaking News, we rely on experienced journalists (that’s one of them, Stephanie Clary, above) to verify real-time reports and prioritize their importance. We also add context, associating reports with ongoing stories, topics and locations. But accuracy and importance — along with speed — are the essence of breaking news for any news organization.

The Breaking News team to Twitter: Your Mechanical Turk team can’t compete with our actual journalists. (via shortformblog)

FJP: Some Background — The Twitter Engineering blog posted yesterday about how it uses real people alongside its search algorithms to determine the “meaning” of trending terms. It does this with both in-house evaluators and Amazon’s Mechanical Turk, a crowdsourced marketplace for accomplishing (relatively) small tasks. 

The goals is to contextualize and understand, for example, that something like #BindersFullOfWomen is related to politics.

Here’s what Twitter has to say about what happens when topics begin to trend:

As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query… For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.

(via shortformblog)

Letters, Words and the English Language
In the 1960s, Mark Mayzner culled 20,000 words from newspapers, magazines and books to study the frequency of letters and words, analyze word length and explore where letters appeared within words.
Last month he contacted Google research chief Peter Norvig to see what Norvig could do with Google’s much larger sample size and contemporary computational power.
Norvig complied, downloaded the Google books Ngrams raw data set, and came up with the following after analyzing 97,565 distinct words which were mentioned over 743 billion times.
Some takeaways:
Word Counts: The, Of, And, To, In and A are the English language’s most popular words.
Word Length: The average length of English words weighted by their popularity is 4.79 letters long.
Word Length, Part II: The average length of all 97,565 distinct words is 7.6 letters long.
Popular Letters: E, T and A are the most common letters in the English alphabet.
Popular Letters Within Words: T most frequently begins a word, E most frequently ends a word.
Back in the day, Mayzner used IBM punchcards to sort his data. Today, Norvig used his personal computer and writes:



Here’s where you would typically see a comparison saying that if you punched the 743 billion words one to a card and stacked them up, then assuming 100 cards per inch, the stack would be 100,000 miles high; nearly halfway to the moon. But that’s silly, because the stack would topple over long before then. If I had 743 billion cards, what I would do is stack them up in a big building, like, say, the Vehicle Assembly Building (VAB) at Kennedy Space Center, which has a capacity of 3.6 million cubic meters. The cards work out to only 2.9 million cubic meters; easy peasy; room to spare. And an IBM model 84 card sorter could blast through these at a rate of 2000 cards per minute, which means it would only take 700 years per pass (but you’d need multiple passes to get the whole job done).



Read through for more findings along with Norvig’s methodology for exploring the data.
Peter Norvig, English Letter Frequency Counts: Mayzner Revisited.
Image: Letter Counts by Position Within Words, by Peter Norvig. Select to embiggen.

Letters, Words and the English Language

In the 1960s, Mark Mayzner culled 20,000 words from newspapers, magazines and books to study the frequency of letters and words, analyze word length and explore where letters appeared within words.

Last month he contacted Google research chief Peter Norvig to see what Norvig could do with Google’s much larger sample size and contemporary computational power.

Norvig complied, downloaded the Google books Ngrams raw data set, and came up with the following after analyzing 97,565 distinct words which were mentioned over 743 billion times.

Some takeaways:

  • Word Counts: The, Of, And, To, In and A are the English language’s most popular words.
  • Word Length: The average length of English words weighted by their popularity is 4.79 letters long.
  • Word Length, Part II: The average length of all 97,565 distinct words is 7.6 letters long.
  • Popular Letters: E, T and A are the most common letters in the English alphabet.
  • Popular Letters Within Words: T most frequently begins a word, E most frequently ends a word.

Back in the day, Mayzner used IBM punchcards to sort his data. Today, Norvig used his personal computer and writes:

Here’s where you would typically see a comparison saying that if you punched the 743 billion words one to a card and stacked them up, then assuming 100 cards per inch, the stack would be 100,000 miles high; nearly halfway to the moon. But that’s silly, because the stack would topple over long before then. If I had 743 billion cards, what I would do is stack them up in a big building, like, say, the Vehicle Assembly Building (VAB) at Kennedy Space Center, which has a capacity of 3.6 million cubic meters. The cards work out to only 2.9 million cubic meters; easy peasy; room to spare. And an IBM model 84 card sorter could blast through these at a rate of 2000 cards per minute, which means it would only take 700 years per pass (but you’d need multiple passes to get the whole job done).

Read through for more findings along with Norvig’s methodology for exploring the data.

Peter Norvig, English Letter Frequency Counts: Mayzner Revisited.

Image: Letter Counts by Position Within Words, by Peter Norvig. Select to embiggen.

For Those Who Want to Prepare for the End of the World
Off World Backup:

Our proprietary process gets your data encrypted, transmitted and stored through our state of the art satellite array. Your content is initially stored locally within our super secret server bunker protected by a MagnetoPlasmic Repulsar Field (trademark pending) powered by a completely green geothermal energy transducers located several miles under a remote mountain range. Data is then methodically broadcast up to our geosynchronous satellite web where your data is encrypted using our quantum bilateral encryption technology. From there we bounce your data through a series of parallel redundant transitional satellites spanning all the way to our various data centers sprinkled around the solar system, with our main facility located within the walls of Olympus Mons on Mars. In case of disruption our satellites implements various algorithms derived from the Nash Equilibrium to find the most beneficial and efficient path to store your data safely and securely. Our martian armed guards are on staff 25 hours a day to provide that last bit of security, to allow you to have one last comfortable night’s sleep, knowing that your data, business and personal, will be ready for you when you need it in the post-apocalyptic rebuild.
What documents and memories would you keep safe for the end of the world?

Image: Screenshot of Off World BackupH/T: NPR for the find. 

For Those Who Want to Prepare for the End of the World

Off World Backup:

Our proprietary process gets your data encrypted, transmitted and stored through our state of the art satellite array. Your content is initially stored locally within our super secret server bunker protected by a MagnetoPlasmic Repulsar Field (trademark pending) powered by a completely green geothermal energy transducers located several miles under a remote mountain range. Data is then methodically broadcast up to our geosynchronous satellite web where your data is encrypted using our quantum bilateral encryption technology. From there we bounce your data through a series of parallel redundant transitional satellites spanning all the way to our various data centers sprinkled around the solar system, with our main facility located within the walls of Olympus Mons on Mars. In case of disruption our satellites implements various algorithms derived from the Nash Equilibrium to find the most beneficial and efficient path to store your data safely and securely. Our martian armed guards are on staff 25 hours a day to provide that last bit of security, to allow you to have one last comfortable night’s sleep, knowing that your data, business and personal, will be ready for you when you need it in the post-apocalyptic rebuild.

What documents and memories would you keep safe for the end of the world?

Image: Screenshot of Off World Backup
H/T: NPR for the find. 

US Expands Citizen Data Surveillance to Predict Future Crimes →

The Wall Street Journal reports that a little known government agency now has the authority to hold and monitor data on US citizens for up to five years, even if the individual has never committed a crime.

The goal, it appears, is to use the data to predict future — or potential — criminal activity.

Via the Wall Street Journal*:

[New] rules now allow the little-known National Counterterrorism Center to examine the government files of U.S. citizens for possible criminal behavior, even if there is no reason to suspect them. That is a departure from past practice, which barred the agency from storing information about ordinary Americans unless a person was a terror suspect or related to an investigation.

Now, NCTC can copy entire government databases—flight records, casino-employee lists, the names of Americans hosting foreign-exchange students and many others. The agency has new authority to keep data about innocent U.S. citizens for up to five years, and to analyze it for suspicious patterns of behavior. Previously, both were prohibited…

The changes also allow databases of U.S. civilian information to be given to foreign governments for analysis of their own. In effect, U.S. and foreign governments would be using the information to look for clues that people might commit future crimes.

Under the new rules, the NCTC can request access to any governmental database that it “reasonably believes” contains “terrorism information.”

Considering the National Security Agency is currently building a massive information center in Utah to monitor almost “all forms of communication, including the complete contents of private emails, cell phone calls, and Google searches, as well as all sorts of personal data trails,” the NCTC wont be want for information.

BONUS: Looking for more about government surveillance? Check the FJP Surveillance Tag.

Wall Street Journal, U.S. Terrorism Agency to Tap a Vast Database of Citizens.

* This WSJ article is paywalled if you go directly to the site. If you want to read it, copy the title, paste it in Google and follow the search result back to the WSJ.

And You Wonder Why You’re Exhausted

Background via Fast Company:

In The Human Face of Big Data, Rick Smolan, a former Time, Life, and National Geographic photographer famous for creating the Day in the Life book series, and author Jennifer Erwitt examine how today’s digital onslaught and emerging technologies can help us better understand and improve the human condition—ourselves, interactions with each other, and the planet.

Susan Karlin, FastCo Create. Earth’s Nervous System: Looking at Humanity Through Big Data.