Posts tagged data

In the report, Twitter said that, worldwide, it received 1,858 requests from governments for information about users in 2012, as well as 6,646 reports of copyright violations, and 48 demands from governments that content they deem illegal be removed.
I say that news organizations should become advocates for open information, demanding that government not only make more of it available but also put it in standard formats so it can be searched, visualized, analyzed, and distributed. What the value of that information is to society is not up to the gatekeepers — officials or journalists — to decide. It is up to the public.

Jeff Jarvis, BuzzMachine. Public is public… except in journalism?

While the above quote may stand on its own, a little context: not everyone liked the map of gun permit owners that was published in the aftermath of the Sandy Hook shooting. Jarvis believes that the decision of whether or not the map is morally sound belongs to the public — not to journalists.

Other media thinkers have said otherwise. The Times’ David Carr argued yesterday that the map, which showed the addresses of gun permit owners in New York’s Westechester and Rockland counties, isn’t journalism.

Well, is it?

While Twitter’s Turks will help bring much-needed context to the platform, they’re not journalists who verify whether something is true. As we’ve seen with the shootings in Newtown, Connecticut and Superstorm Sandy, Twitter rumors ran rampant. Some rumors turned out to be true, but many were inaccurate or even malicious. Some were important, others were trivial. At Breaking News, we rely on experienced journalists (that’s one of them, Stephanie Clary, above) to verify real-time reports and prioritize their importance. We also add context, associating reports with ongoing stories, topics and locations. But accuracy and importance — along with speed — are the essence of breaking news for any news organization.

The Breaking News team to Twitter: Your Mechanical Turk team can’t compete with our actual journalists. (via shortformblog)

FJP: Some Background — The Twitter Engineering blog posted yesterday about how it uses real people alongside its search algorithms to determine the “meaning” of trending terms. It does this with both in-house evaluators and Amazon’s Mechanical Turk, a crowdsourced marketplace for accomplishing (relatively) small tasks. 

The goals is to contextualize and understand, for example, that something like #BindersFullOfWomen is related to politics.

Here’s what Twitter has to say about what happens when topics begin to trend:

As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query… For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.

Letters, Words and the English Language
In the 1960s, Mark Mayzner culled 20,000 words from newspapers, magazines and books to study the frequency of letters and words, analyze word length and explore where letters appeared within words.
Last month he contacted Google research chief Peter Norvig to see what Norvig could do with Google’s much larger sample size and contemporary computational power.
Norvig complied, downloaded the Google books Ngrams raw data set, and came up with the following after analyzing 97,565 distinct words which were mentioned over 743 billion times.
Some takeaways:
Word Counts: The, Of, And, To, In and A are the English language’s most popular words.
Word Length: The average length of English words weighted by their popularity is 4.79 letters long.
Word Length, Part II: The average length of all 97,565 distinct words is 7.6 letters long.
Popular Letters: E, T and A are the most common letters in the English alphabet.
Popular Letters Within Words: T most frequently begins a word, E most frequently ends a word.
Back in the day, Mayzner used IBM punchcards to sort his data. Today, Norvig used his personal computer and writes:



Here’s where you would typically see a comparison saying that if you punched the 743 billion words one to a card and stacked them up, then assuming 100 cards per inch, the stack would be 100,000 miles high; nearly halfway to the moon. But that’s silly, because the stack would topple over long before then. If I had 743 billion cards, what I would do is stack them up in a big building, like, say, the Vehicle Assembly Building (VAB) at Kennedy Space Center, which has a capacity of 3.6 million cubic meters. The cards work out to only 2.9 million cubic meters; easy peasy; room to spare. And an IBM model 84 card sorter could blast through these at a rate of 2000 cards per minute, which means it would only take 700 years per pass (but you’d need multiple passes to get the whole job done).



Read through for more findings along with Norvig’s methodology for exploring the data.
Peter Norvig, English Letter Frequency Counts: Mayzner Revisited.
Image: Letter Counts by Position Within Words, by Peter Norvig. Select to embiggen.

Letters, Words and the English Language

In the 1960s, Mark Mayzner culled 20,000 words from newspapers, magazines and books to study the frequency of letters and words, analyze word length and explore where letters appeared within words.

Last month he contacted Google research chief Peter Norvig to see what Norvig could do with Google’s much larger sample size and contemporary computational power.

Norvig complied, downloaded the Google books Ngrams raw data set, and came up with the following after analyzing 97,565 distinct words which were mentioned over 743 billion times.

Some takeaways:

  • Word Counts: The, Of, And, To, In and A are the English language’s most popular words.
  • Word Length: The average length of English words weighted by their popularity is 4.79 letters long.
  • Word Length, Part II: The average length of all 97,565 distinct words is 7.6 letters long.
  • Popular Letters: E, T and A are the most common letters in the English alphabet.
  • Popular Letters Within Words: T most frequently begins a word, E most frequently ends a word.

Back in the day, Mayzner used IBM punchcards to sort his data. Today, Norvig used his personal computer and writes:

Here’s where you would typically see a comparison saying that if you punched the 743 billion words one to a card and stacked them up, then assuming 100 cards per inch, the stack would be 100,000 miles high; nearly halfway to the moon. But that’s silly, because the stack would topple over long before then. If I had 743 billion cards, what I would do is stack them up in a big building, like, say, the Vehicle Assembly Building (VAB) at Kennedy Space Center, which has a capacity of 3.6 million cubic meters. The cards work out to only 2.9 million cubic meters; easy peasy; room to spare. And an IBM model 84 card sorter could blast through these at a rate of 2000 cards per minute, which means it would only take 700 years per pass (but you’d need multiple passes to get the whole job done).

Read through for more findings along with Norvig’s methodology for exploring the data.

Peter Norvig, English Letter Frequency Counts: Mayzner Revisited.

Image: Letter Counts by Position Within Words, by Peter Norvig. Select to embiggen.

For Those Who Want to Prepare for the End of the World
Off World Backup:

Our proprietary process gets your data encrypted, transmitted and stored through our state of the art satellite array. Your content is initially stored locally within our super secret server bunker protected by a MagnetoPlasmic Repulsar Field (trademark pending) powered by a completely green geothermal energy transducers located several miles under a remote mountain range. Data is then methodically broadcast up to our geosynchronous satellite web where your data is encrypted using our quantum bilateral encryption technology. From there we bounce your data through a series of parallel redundant transitional satellites spanning all the way to our various data centers sprinkled around the solar system, with our main facility located within the walls of Olympus Mons on Mars. In case of disruption our satellites implements various algorithms derived from the Nash Equilibrium to find the most beneficial and efficient path to store your data safely and securely. Our martian armed guards are on staff 25 hours a day to provide that last bit of security, to allow you to have one last comfortable night’s sleep, knowing that your data, business and personal, will be ready for you when you need it in the post-apocalyptic rebuild.
What documents and memories would you keep safe for the end of the world?

Image: Screenshot of Off World BackupH/T: NPR for the find. 

For Those Who Want to Prepare for the End of the World

Off World Backup:

Our proprietary process gets your data encrypted, transmitted and stored through our state of the art satellite array. Your content is initially stored locally within our super secret server bunker protected by a MagnetoPlasmic Repulsar Field (trademark pending) powered by a completely green geothermal energy transducers located several miles under a remote mountain range. Data is then methodically broadcast up to our geosynchronous satellite web where your data is encrypted using our quantum bilateral encryption technology. From there we bounce your data through a series of parallel redundant transitional satellites spanning all the way to our various data centers sprinkled around the solar system, with our main facility located within the walls of Olympus Mons on Mars. In case of disruption our satellites implements various algorithms derived from the Nash Equilibrium to find the most beneficial and efficient path to store your data safely and securely. Our martian armed guards are on staff 25 hours a day to provide that last bit of security, to allow you to have one last comfortable night’s sleep, knowing that your data, business and personal, will be ready for you when you need it in the post-apocalyptic rebuild.

What documents and memories would you keep safe for the end of the world?

Image: Screenshot of Off World Backup
H/T: NPR for the find. 

US Expands Citizen Data Surveillance to Predict Future Crimes

The Wall Street Journal reports that a little known government agency now has the authority to hold and monitor data on US citizens for up to five years, even if the individual has never committed a crime.

The goal, it appears, is to use the data to predict future — or potential — criminal activity.

Via the Wall Street Journal*:

[New] rules now allow the little-known National Counterterrorism Center to examine the government files of U.S. citizens for possible criminal behavior, even if there is no reason to suspect them. That is a departure from past practice, which barred the agency from storing information about ordinary Americans unless a person was a terror suspect or related to an investigation.

Now, NCTC can copy entire government databases—flight records, casino-employee lists, the names of Americans hosting foreign-exchange students and many others. The agency has new authority to keep data about innocent U.S. citizens for up to five years, and to analyze it for suspicious patterns of behavior. Previously, both were prohibited…

The changes also allow databases of U.S. civilian information to be given to foreign governments for analysis of their own. In effect, U.S. and foreign governments would be using the information to look for clues that people might commit future crimes.

Under the new rules, the NCTC can request access to any governmental database that it “reasonably believes” contains “terrorism information.”

Considering the National Security Agency is currently building a massive information center in Utah to monitor almost “all forms of communication, including the complete contents of private emails, cell phone calls, and Google searches, as well as all sorts of personal data trails,” the NCTC wont be want for information.

BONUS: Looking for more about government surveillance? Check the FJP Surveillance Tag.

Wall Street Journal, U.S. Terrorism Agency to Tap a Vast Database of Citizens.

* This WSJ article is paywalled if you go directly to the site. If you want to read it, copy the title, paste it in Google and follow the search result back to the WSJ.

And You Wonder Why You’re Exhausted

Background via Fast Company:

In The Human Face of Big Data, Rick Smolan, a former Time, Life, and National Geographic photographer famous for creating the Day in the Life book series, and author Jennifer Erwitt examine how today’s digital onslaught and emerging technologies can help us better understand and improve the human condition—ourselves, interactions with each other, and the planet.

Susan Karlin, FastCo Create. Earth’s Nervous System: Looking at Humanity Through Big Data.

We need, in short, to pay attention to the materiality of algorithmic processes. By that, I do not simply mean the materiality of the algorithmic processing (the circuits, server farms, internet cables, super-computers, and so on) but to the materiality of the procedural inputs. To the stuff that the algorithm mashes up, rearranges, and spits out.

CW Anderson, Culture Daily. The Materiality of Algorithms.

In what reads like a starting point for more posts on the subject, CUNY Prof Chris Anderson discusses what documents journalists may want to design algorithms for, and just how hard that task will be.

Algorithms doing magic inside massive data sets and search engines, while not mathematically simple, are generally easy to conceptualize — algorithms and their data are sitting in the computer, the algorithm sifts through the excel sheet in the background and bam! you have something.

But if you’re working with poorly organized documents, it’s difficult to simply plug them in.

Chris writes that the work required to include any document in a set will shape the algorithm that makes sense of the whole bunch. This will be a problem for journalists who want to examine any documents made without much forethought, which is to say: government documents, phone records from different companies and countries, eye witness reports, police sketches, mugshots, bank statements, tax forms, and hundreds of other things worth investigating.

Chris quotes Jonathan Stray’s trouble preparing 4500 docs on Iraqi security contractors:

The recovered text [from these documents] is a mess, because these documents are just about the worse possible case for OCR [optical character recognition]: many of these documents are forms with a complex layout, and the pages have been photocopied multiple times, redacted, scribbled on, stamped and smudged. But large blocks of text come through pretty well, and this command extracts what text there is into one file per page.

To read the rest of Stray’s account, see his Overview Project.

And to see more with Chris Anderson, see our recent video interviews with him.

Big Data, Demographics and the Undiscovered Voter
The New York Times has a great piece on the final six weeks of the presidential campaign.
There’s a lot in there in terms of strategies, momentum and setbacks but the use of data and demographics is eye opening:

In Chicago, the [Obama] campaign recruited a team of behavioral scientists to build an extraordinarily sophisticated database packed with names of millions of undecided voters and potential supporters. The ever-expanding list let the campaign find and register new voters who fit the demographic pattern of Obama backers and methodically track their views through thousands of telephone calls every night.
That allowed the Obama campaign not only to alter the very nature of the electorate, making it younger and less white, but also to create a portrait of shifting voter allegiances. The power of this operation stunned Mr. Romney’s aides on election night, as they saw voters they never even knew existed turn out in places like Osceola County, Fla. “It’s one thing to say you are going to do it; it’s another thing to actually get out there and do it,” said Brian Jones, a senior adviser.

New York Times, How a Race in the Balance Went to Obama.
Image: An Obama victory party in Manchester, NH, via the New York Times.

Big Data, Demographics and the Undiscovered Voter

The New York Times has a great piece on the final six weeks of the presidential campaign.

There’s a lot in there in terms of strategies, momentum and setbacks but the use of data and demographics is eye opening:

In Chicago, the [Obama] campaign recruited a team of behavioral scientists to build an extraordinarily sophisticated database packed with names of millions of undecided voters and potential supporters. The ever-expanding list let the campaign find and register new voters who fit the demographic pattern of Obama backers and methodically track their views through thousands of telephone calls every night.

That allowed the Obama campaign not only to alter the very nature of the electorate, making it younger and less white, but also to create a portrait of shifting voter allegiances. The power of this operation stunned Mr. Romney’s aides on election night, as they saw voters they never even knew existed turn out in places like Osceola County, Fla. “It’s one thing to say you are going to do it; it’s another thing to actually get out there and do it,” said Brian Jones, a senior adviser.

New York Times, How a Race in the Balance Went to Obama.

Image: An Obama victory party in Manchester, NH, via the New York Times.

The [New York] Times does not release traffic figures, but a spokesperson said yesterday that [Nate] Silver’s blog provided a significant—and significantly growing, over the past year—percentage of Times pageviews. This fall, visits to the Times’ political coverage (including FiveThirtyEight) have increased, both absolutely and as a percentage of site visits. But FiveThirtyEight’s growth is staggering: where earlier this year, somewhere between 10 and 20 percent of politics visits included a stop at FiveThirtyEight, last week that figure was 71 percent.

But Silver’s blog has buoyed more than just the politics coverage, becoming a signifiant traffic-driver for the site as a whole. Earlier this year, approximately 1 percent of visits to the New York Times included FiveThirtyEight. Last week, that number was 13 percent. Yesterday, it was 20 percent. That is, one in five visitors to the sixth-most-trafficked U.S. news site took a look at Silver’s blog.

Marc Tracy, The New Republic. Nate Silver Is a One-Man Traffic Machine for the Times.

Takeaway: Stat nerds have clout.

Fortunately, my polling place is around the corner from my apartment.
Not quite sure where yours is? There’s a Web site for that.
Geeky stuff: Fun(ny) design aside, the site pulls data from the Google Civic Information API.

Fortunately, my polling place is around the corner from my apartment.

Not quite sure where yours is? There’s a Web site for that.

Geeky stuff: Fun(ny) design aside, the site pulls data from the Google Civic Information API.

Nate Silver on the Colbert Report

The New York Times’s Nate Silver, creator of the influential 538 election forecasting blog, talks pundits versus statistics, and how probability drives his forecasting methodology. 

He has no love for pundits, and says that given the choice between them and Ebola, he’d go with Ebola.

Bonus: Want more on electoral polling? Jihii has a great piece on what it all means, and where it can go so wrong.

Gendered News
From entertainment to finance to politics to sports, the Guardian Datablog explores how women and men are published in leading UK news sources, and how often articles by gender are shared across social networks.
In the interactive they’ve produced, you can sort across different criteria as well as drill deeper into specific publications and their sections.
At a macro level, UK news publishing is much like what we see in the United States: it’s dominated by men with less than 30% of news articles published by women across the Daily Mail, Telegraph and Guardian.
Drill down a bit, or look at gender participation by subject area, and you see women dominating topics like “lifestyle” and “entertainment” and men dominating, well, most everything else.
But the Datablog isn’t just looking at who gets published, but who gets heard.
You would think it’s one and the same but with the decline of the newspaper front page — and the Web site home page — as a conversation driver, it’s the social ecosystem of readers and their sharing habits that drives audience engagement and interaction.
Via the Guardian:

Online, who gets heard is determined by an ecosystem of actors: individuals sharing on Facebook and Twitter, link-sharing communities, personal algorithms on Google News, and citizen media curators. Newspapers only offer part of the information supply; we readers decide who’s heard every time we click, share or use our own voice…
…Of course, the reach of an article is much more complicated than likes and shares. What gets seen is often dependent on the time of day and the influence of who shares a link.
The definition of likes and shares also changes. Since our measurements in early August, Facebook’s counters have been changed to track links sent within private messages. This year, newsrooms experimented with Facebook social readers and tablet apps to grow their audiences. Bernhard Rieder’s network diagram of the Guardian’s Facebook page illustrates yet another social channel for news. Publishers sometimes can’t agree on what their own data means.
Despite these limitations, data on likes and shares offer the best outside picture of audience interest in women’s writing in the news.

Read through for analysis and more about the methodology and tools used to suss out the data. As usual, the Guardian also lets you download the data so you can work with it yourself.
Image: Screenshot, UK News Gender Ranking: What They Publish vs What Readers Share, via The Guardian. Select to embiggen.

Gendered News

From entertainment to finance to politics to sports, the Guardian Datablog explores how women and men are published in leading UK news sources, and how often articles by gender are shared across social networks.

In the interactive they’ve produced, you can sort across different criteria as well as drill deeper into specific publications and their sections.

At a macro level, UK news publishing is much like what we see in the United States: it’s dominated by men with less than 30% of news articles published by women across the Daily Mail, Telegraph and Guardian.

Drill down a bit, or look at gender participation by subject area, and you see women dominating topics like “lifestyle” and “entertainment” and men dominating, well, most everything else.

But the Datablog isn’t just looking at who gets published, but who gets heard.

You would think it’s one and the same but with the decline of the newspaper front page — and the Web site home page — as a conversation driver, it’s the social ecosystem of readers and their sharing habits that drives audience engagement and interaction.

Via the Guardian:

Online, who gets heard is determined by an ecosystem of actors: individuals sharing on Facebook and Twitter, link-sharing communities, personal algorithms on Google News, and citizen media curators. Newspapers only offer part of the information supply; we readers decide who’s heard every time we click, share or use our own voice…

…Of course, the reach of an article is much more complicated than likes and shares. What gets seen is often dependent on the time of day and the influence of who shares a link.

The definition of likes and shares also changes. Since our measurements in early August, Facebook’s counters have been changed to track links sent within private messages. This year, newsrooms experimented with Facebook social readers and tablet apps to grow their audiences. Bernhard Rieder’s network diagram of the Guardian’s Facebook page illustrates yet another social channel for news. Publishers sometimes can’t agree on what their own data means.

Despite these limitations, data on likes and shares offer the best outside picture of audience interest in women’s writing in the news.

Read through for analysis and more about the methodology and tools used to suss out the data. As usual, the Guardian also lets you download the data so you can work with it yourself.

Image: Screenshot, UK News Gender Ranking: What They Publish vs What Readers Share, via The Guardian. Select to embiggen.

Nulpunt to Give Freedom of Information Some Digital Grunt

Every good design project starts with a problem, and one of the biggest is how to find the key facts in a sea of data. 

A design studio in Amsterdam called Metahaven is developing a product called Nulpunt to do two things: Firstly, it tells journalists and activists when their government has published a document holding information they care about, and secondly it lets users highlight, annotate and share the important sections.

Metahaven say that Nulpunt will integrate with the new Freedom of Information Laws The Netherlands is drafting. The new legislation will demand the publication of vastly more documents produced by government, the public service or private companies working on publicly funded projects. 

It’s great for transparency in theory, but assuming the laws pass and aren’t hobbled on the way through, it’ll mean that the FOI “problem” won’t be about scarcity any more, it’ll be about abundance; how to organize and sift through a vast sea of data. And that’s the problem that Metahaven is aiming to solve with Nulpunt; using key digital characteristics; personalization and socialization.

They’re not the only people to be attacking the problem space: If you’ve got youself a huge document dump you can use Document Cloud to automatically ‘read’ the files for key facts, subjects and dates, or turn to The Overview Project to get a kind of visual table of contents. 

The point of difference for Nulpunt, assuming it gets a release, seems to be that it’s designed to integrate with a specific source of information; namely the Dutch government. Metahaven are keen to launch Nulpunt in more countries, although they have also said Nulpunt will not always be non-profit and commercial free, which is a tough business model to scale.

There’s more on the product at FastCompany Design and The Verge