posts about or somewhat related to ‘data’

We need, in short, to pay attention to the materiality of algorithmic processes. By that, I do not simply mean the materiality of the algorithmic processing (the circuits, server farms, internet cables, super-computers, and so on) but to the materiality of the procedural inputs. To the stuff that the algorithm mashes up, rearranges, and spits out.

CW Anderson, Culture Daily. The Materiality of Algorithms.

In what reads like a starting point for more posts on the subject, CUNY Prof Chris Anderson discusses what documents journalists may want to design algorithms for, and just how hard that task will be.

Algorithms doing magic inside massive data sets and search engines, while not mathematically simple, are generally easy to conceptualize — algorithms and their data are sitting in the computer, the algorithm sifts through the excel sheet in the background and bam! you have something.

But if you’re working with poorly organized documents, it’s difficult to simply plug them in.

Chris writes that the work required to include any document in a set will shape the algorithm that makes sense of the whole bunch. This will be a problem for journalists who want to examine any documents made without much forethought, which is to say: government documents, phone records from different companies and countries, eye witness reports, police sketches, mugshots, bank statements, tax forms, and hundreds of other things worth investigating.

Chris quotes Jonathan Stray’s trouble preparing 4500 docs on Iraqi security contractors:

The recovered text [from these documents] is a mess, because these documents are just about the worse possible case for OCR [optical character recognition]: many of these documents are forms with a complex layout, and the pages have been photocopied multiple times, redacted, scribbled on, stamped and smudged. But large blocks of text come through pretty well, and this command extracts what text there is into one file per page.

To read the rest of Stray’s account, see his Overview Project.

And to see more with Chris Anderson, see our recent video interviews with him.

Big Data, Demographics and the Undiscovered Voter
The New York Times has a great piece on the final six weeks of the presidential campaign.
There’s a lot in there in terms of strategies, momentum and setbacks but the use of data and demographics is eye opening:

In Chicago, the [Obama] campaign recruited a team of behavioral scientists to build an extraordinarily sophisticated database packed with names of millions of undecided voters and potential supporters. The ever-expanding list let the campaign find and register new voters who fit the demographic pattern of Obama backers and methodically track their views through thousands of telephone calls every night.
That allowed the Obama campaign not only to alter the very nature of the electorate, making it younger and less white, but also to create a portrait of shifting voter allegiances. The power of this operation stunned Mr. Romney’s aides on election night, as they saw voters they never even knew existed turn out in places like Osceola County, Fla. “It’s one thing to say you are going to do it; it’s another thing to actually get out there and do it,” said Brian Jones, a senior adviser.

New York Times, How a Race in the Balance Went to Obama.
Image: An Obama victory party in Manchester, NH, via the New York Times.

Big Data, Demographics and the Undiscovered Voter

The New York Times has a great piece on the final six weeks of the presidential campaign.

There’s a lot in there in terms of strategies, momentum and setbacks but the use of data and demographics is eye opening:

In Chicago, the [Obama] campaign recruited a team of behavioral scientists to build an extraordinarily sophisticated database packed with names of millions of undecided voters and potential supporters. The ever-expanding list let the campaign find and register new voters who fit the demographic pattern of Obama backers and methodically track their views through thousands of telephone calls every night.

That allowed the Obama campaign not only to alter the very nature of the electorate, making it younger and less white, but also to create a portrait of shifting voter allegiances. The power of this operation stunned Mr. Romney’s aides on election night, as they saw voters they never even knew existed turn out in places like Osceola County, Fla. “It’s one thing to say you are going to do it; it’s another thing to actually get out there and do it,” said Brian Jones, a senior adviser.

New York Times, How a Race in the Balance Went to Obama.

Image: An Obama victory party in Manchester, NH, via the New York Times.

The [New York] Times does not release traffic figures, but a spokesperson said yesterday that [Nate] Silver’s blog provided a significant—and significantly growing, over the past year—percentage of Times pageviews. This fall, visits to the Times’ political coverage (including FiveThirtyEight) have increased, both absolutely and as a percentage of site visits. But FiveThirtyEight’s growth is staggering: where earlier this year, somewhere between 10 and 20 percent of politics visits included a stop at FiveThirtyEight, last week that figure was 71 percent.

But Silver’s blog has buoyed more than just the politics coverage, becoming a signifiant traffic-driver for the site as a whole. Earlier this year, approximately 1 percent of visits to the New York Times included FiveThirtyEight. Last week, that number was 13 percent. Yesterday, it was 20 percent. That is, one in five visitors to the sixth-most-trafficked U.S. news site took a look at Silver’s blog.

Marc Tracy, The New Republic. Nate Silver Is a One-Man Traffic Machine for the Times.

Takeaway: Stat nerds have clout.

Fortunately, my polling place is around the corner from my apartment.
Not quite sure where yours is? There’s a Web site for that.
Geeky stuff: Fun(ny) design aside, the site pulls data from the Google Civic Information API.

Fortunately, my polling place is around the corner from my apartment.

Not quite sure where yours is? There’s a Web site for that.

Geeky stuff: Fun(ny) design aside, the site pulls data from the Google Civic Information API.

Nate Silver on the Colbert Report

The New York Times’s Nate Silver, creator of the influential 538 election forecasting blog, talks pundits versus statistics, and how probability drives his forecasting methodology. 

He has no love for pundits, and says that given the choice between them and Ebola, he’d go with Ebola.

Bonus: Want more on electoral polling? Jihii has a great piece on what it all means, and where it can go so wrong.

Gendered News
From entertainment to finance to politics to sports, the Guardian Datablog explores how women and men are published in leading UK news sources, and how often articles by gender are shared across social networks.
In the interactive they’ve produced, you can sort across different criteria as well as drill deeper into specific publications and their sections.
At a macro level, UK news publishing is much like what we see in the United States: it’s dominated by men with less than 30% of news articles published by women across the Daily Mail, Telegraph and Guardian.
Drill down a bit, or look at gender participation by subject area, and you see women dominating topics like “lifestyle” and “entertainment” and men dominating, well, most everything else.
But the Datablog isn’t just looking at who gets published, but who gets heard.
You would think it’s one and the same but with the decline of the newspaper front page — and the Web site home page — as a conversation driver, it’s the social ecosystem of readers and their sharing habits that drives audience engagement and interaction.
Via the Guardian:

Online, who gets heard is determined by an ecosystem of actors: individuals sharing on Facebook and Twitter, link-sharing communities, personal algorithms on Google News, and citizen media curators. Newspapers only offer part of the information supply; we readers decide who’s heard every time we click, share or use our own voice…
…Of course, the reach of an article is much more complicated than likes and shares. What gets seen is often dependent on the time of day and the influence of who shares a link.
The definition of likes and shares also changes. Since our measurements in early August, Facebook’s counters have been changed to track links sent within private messages. This year, newsrooms experimented with Facebook social readers and tablet apps to grow their audiences. Bernhard Rieder’s network diagram of the Guardian’s Facebook page illustrates yet another social channel for news. Publishers sometimes can’t agree on what their own data means.
Despite these limitations, data on likes and shares offer the best outside picture of audience interest in women’s writing in the news.

Read through for analysis and more about the methodology and tools used to suss out the data. As usual, the Guardian also lets you download the data so you can work with it yourself.
Image: Screenshot, UK News Gender Ranking: What They Publish vs What Readers Share, via The Guardian. Select to embiggen.

Gendered News

From entertainment to finance to politics to sports, the Guardian Datablog explores how women and men are published in leading UK news sources, and how often articles by gender are shared across social networks.

In the interactive they’ve produced, you can sort across different criteria as well as drill deeper into specific publications and their sections.

At a macro level, UK news publishing is much like what we see in the United States: it’s dominated by men with less than 30% of news articles published by women across the Daily Mail, Telegraph and Guardian.

Drill down a bit, or look at gender participation by subject area, and you see women dominating topics like “lifestyle” and “entertainment” and men dominating, well, most everything else.

But the Datablog isn’t just looking at who gets published, but who gets heard.

You would think it’s one and the same but with the decline of the newspaper front page — and the Web site home page — as a conversation driver, it’s the social ecosystem of readers and their sharing habits that drives audience engagement and interaction.

Via the Guardian:

Online, who gets heard is determined by an ecosystem of actors: individuals sharing on Facebook and Twitter, link-sharing communities, personal algorithms on Google News, and citizen media curators. Newspapers only offer part of the information supply; we readers decide who’s heard every time we click, share or use our own voice…

…Of course, the reach of an article is much more complicated than likes and shares. What gets seen is often dependent on the time of day and the influence of who shares a link.

The definition of likes and shares also changes. Since our measurements in early August, Facebook’s counters have been changed to track links sent within private messages. This year, newsrooms experimented with Facebook social readers and tablet apps to grow their audiences. Bernhard Rieder’s network diagram of the Guardian’s Facebook page illustrates yet another social channel for news. Publishers sometimes can’t agree on what their own data means.

Despite these limitations, data on likes and shares offer the best outside picture of audience interest in women’s writing in the news.

Read through for analysis and more about the methodology and tools used to suss out the data. As usual, the Guardian also lets you download the data so you can work with it yourself.

Image: Screenshot, UK News Gender Ranking: What They Publish vs What Readers Share, via The Guardian. Select to embiggen.

Nulpunt to Give Freedom of Information Some Digital Grunt

Every good design project starts with a problem, and one of the biggest is how to find the key facts in a sea of data. 

A design studio in Amsterdam called Metahaven is developing a product called Nulpunt to do two things: Firstly, it tells journalists and activists when their government has published a document holding information they care about, and secondly it lets users highlight, annotate and share the important sections.

Metahaven say that Nulpunt will integrate with the new Freedom of Information Laws The Netherlands is drafting. The new legislation will demand the publication of vastly more documents produced by government, the public service or private companies working on publicly funded projects. 

It’s great for transparency in theory, but assuming the laws pass and aren’t hobbled on the way through, it’ll mean that the FOI “problem” won’t be about scarcity any more, it’ll be about abundance; how to organize and sift through a vast sea of data. And that’s the problem that Metahaven is aiming to solve with Nulpunt; using key digital characteristics; personalization and socialization.

They’re not the only people to be attacking the problem space: If you’ve got youself a huge document dump you can use Document Cloud to automatically ‘read’ the files for key facts, subjects and dates, or turn to The Overview Project to get a kind of visual table of contents. 

The point of difference for Nulpunt, assuming it gets a release, seems to be that it’s designed to integrate with a specific source of information; namely the Dutch government. Metahaven are keen to launch Nulpunt in more countries, although they have also said Nulpunt will not always be non-profit and commercial free, which is a tough business model to scale.

There’s more on the product at FastCompany Design and The Verge

Mapping Gender Income Inequality

A collaboration between Slate and the New America Foundation. The interactive visualization was created using MapBox.

Via Slate:

Women in Utah have it the worst. There, the average working woman makes 55 cents for every dollar the average working man makes. The state is followed closely by Wyoming, at 56 cents; Louisiana, at 59 cents; North Dakota, at 62 cents; and Michigan, at 62 cents. The best states for income equality are Hawaii, Florida, Nevada, Maryland, and North Carolina. In each, women make about three-fourths of what men make.

County-level data illustrate the best cities for pay equality: Washington, D.C. and Dallas lead, followed by San Francisco, Los Angeles, Austin, Santa Fe, New York, and Boston. In each, women make at least 80 cents per dollar that men make. In most other major cities, they make about 70 cents.

For a biggie version, see Slate, Map Shows the Worst State for Women To Make Money.

Mapping Conflict
Conflict History maps the world’s wars and skirmishes over the millennia. Users control the map with a timeline scrubber or by entering search terms. Data is pulled from Freebase and shown on Google Maps.
Image: Screenshot, Conflict History 1998-2007. 
H/T: Infosthetics.

Mapping Conflict

Conflict History maps the world’s wars and skirmishes over the millennia. Users control the map with a timeline scrubber or by entering search terms. Data is pulled from Freebase and shown on Google Maps.

Image: Screenshot, Conflict History 1998-2007

H/T: Infosthetics.

Bidding on Your Personal Browser History

Proclivity Media and others are working very hard to find out what you want to buy, and they’re getting to know you very well along the way.

Here’s the backstory: one particularly savvy way of advertising has begun receiving a lot of attention lately. It’s called re-targeting, and it relies on personal browser history to figure out what users may want to buy.

Automated programming bids on ad space individual users see based on their personal search history, more traditional consumer reports and retailer records, selling one-time ads at several hundred dollars a pop.

via Internet Retailer:

Proclivity uses its Consumer Valuation Platform to place cookies in consumers’ web browsers to monitor their browsing behavior around the Internet and tracks their specific interactions on a client retailer’s site using tiny pieces of embedded software code in site content. Proclivity adds data from the retailer, including the merchant’s own web analytics on shoppers’ click activity, and information on sales, merchandizing campaigns and product pricing, then scores it to determine when each customer is likely to buy and at what price point.

This is very similar to Facebook Exchange, which has been working cautiously well since June.

Here’s the Wall Street Journal:

Facebook is using its data trove to study the links between Facebook ads and members’ shopping habits at brick-and-mortar stores, part of an effort to prove the effectiveness of its $3.7 billion annual ad business to marketers.

FJP: This is big data at work — for many businesses, there’s a lot to find when comparing data sets that follow consumer behavior online and in stores.

I Love Messing with Data
The Journalist’s Resource, a project that curates media scholarship, created a great reading list on the social, cultural and political issues and possibilities surrounding big data.
Like much in today’s digital world, the promise and hope of using huge data sets to solve significant issues are all too tempered by the threats that same data can have depending on whose hands it is in and what they plan to do with it.
What follows are abstracts from just some of the articles the Journalist’s Resource has pulled together. Read through for more and to access links back to the originals.

danah boyd and Kate Crawford Will large-scale analysis of DNA help cure diseases? Or will it usher in a new wave of medical inequality? Will data analytics help make people’s access to information more efficient and effective? Or will it be used to track protesters in the streets of major cities? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Some or all of the above?… Given the rise of Big Data as both a phenomenon and a methodological persuasion, we believe that it is time to start critically interrogating this phenomenon, its assumptions and its biases.
Vivek Kundra If … data isn’t sliced, diced and cubed to separate signal from noise, it can be useless. But, when made available to the public and combined with the network effect — defined by Reed’s Law, which asserts that the utility of large networks, particularly social networks, can scale exponentially with the size of the network — society has the potential to drive massive social, political and economic change.
David M. Berry In cutting up the world [into data chunks], information about the world necessarily has to be discarded in order to store a representation within the computer. In other words, a computer requires that everything is transformed from the continuous flow of our everyday reality into a grid of numbers that can be stored as a representation of reality which can then be manipulated using algorithms. These subtractive methods of understanding reality (episteme) produce new knowledges and methods for the control of reality (techne). They do so through a digital mediation, which the digital humanities are starting to take seriously as they’re problematic.”
Bert-Japp Koops Big Data involves not only individuals’ digital footprints (data they themselves leave behind) but, perhaps more importantly, also individuals’ data shadows (information about them generated by others). And contrary to physical footprints and shadows, their digital counterparts are not ephemeral but persistent. This presents particular challenges for the right to be forgotten, which are discussed in the form of three key questions. Against whom can the right be invoked? When and why can the right be invoked? And how can the right be effected?”
Janna Anderson and Lee RainieWhile enthusiasts see great potential for using Big Data, privacy advocates are worried as more and more data is collected about people — both as they knowingly disclose such things as their postings through social media and as they unknowingly share digital details about themselves as they march through life. Not only do the advocates worry about profiling, they also worry that those who crunch Big Data with algorithms might draw the wrong conclusions about who someone is, how she might behave in the future, and how to apply the correlations that will emerge in the data analysis.

Image: Calvin and Hobbes.

I Love Messing with Data

The Journalist’s Resource, a project that curates media scholarship, created a great reading list on the social, cultural and political issues and possibilities surrounding big data.

Like much in today’s digital world, the promise and hope of using huge data sets to solve significant issues are all too tempered by the threats that same data can have depending on whose hands it is in and what they plan to do with it.

What follows are abstracts from just some of the articles the Journalist’s Resource has pulled together. Read through for more and to access links back to the originals.

danah boyd and Kate Crawford
Will large-scale analysis of DNA help cure diseases? Or will it usher in a new wave of medical inequality? Will data analytics help make people’s access to information more efficient and effective? Or will it be used to track protesters in the streets of major cities? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Some or all of the above?… Given the rise of Big Data as both a phenomenon and a methodological persuasion, we believe that it is time to start critically interrogating this phenomenon, its assumptions and its biases.

Vivek Kundra
If … data isn’t sliced, diced and cubed to separate signal from noise, it can be useless. But, when made available to the public and combined with the network effect — defined by Reed’s Law, which asserts that the utility of large networks, particularly social networks, can scale exponentially with the size of the network — society has the potential to drive massive social, political and economic change.

David M. Berry
In cutting up the world [into data chunks], information about the world necessarily has to be discarded in order to store a representation within the computer. In other words, a computer requires that everything is transformed from the continuous flow of our everyday reality into a grid of numbers that can be stored as a representation of reality which can then be manipulated using algorithms. These subtractive methods of understanding reality (episteme) produce new knowledges and methods for the control of reality (techne). They do so through a digital mediation, which the digital humanities are starting to take seriously as they’re problematic.”

Bert-Japp Koops
Big Data involves not only individuals’ digital footprints (data they themselves leave behind) but, perhaps more importantly, also individuals’ data shadows (information about them generated by others). And contrary to physical footprints and shadows, their digital counterparts are not ephemeral but persistent. This presents particular challenges for the right to be forgotten, which are discussed in the form of three key questions. Against whom can the right be invoked? When and why can the right be invoked? And how can the right be effected?”

Janna Anderson and Lee Rainie
While enthusiasts see great potential for using Big Data, privacy advocates are worried as more and more data is collected about people — both as they knowingly disclose such things as their postings through social media and as they unknowingly share digital details about themselves as they march through life. Not only do the advocates worry about profiling, they also worry that those who crunch Big Data with algorithms might draw the wrong conclusions about who someone is, how she might behave in the future, and how to apply the correlations that will emerge in the data analysis.

Image: Calvin and Hobbes.

Imagine if your whole life you’ve looked through one eye, only seeing through one eye and suddenly, scientists can give you the ability to open up a second eye. So what you would see is not just more data but it’s a whole different way of seeing.

Said photojournalist Rick Smolan today, telling the audience at a Human Face of Big Data event the same thing he told his son when, at 2am, the little boy climbed out of bed, snuck into the kitchen and asked him why he stayed up late everynight on the phone talking about “big data.” Smolan continued:

My son, who again wanted to stay up as late as he could before I sent him back to bed, said: could scientists and computers, like, let us open up a third eye and a fourth and a fifth? And I said yes.

See the group’s phone app, its upcoming book and more here.

New York Times, Washington Post developers team up to create Open Elections database →

shaneguiter:

Senior developers from The New York Times and The Washington Post are looking for volunteers to help collect more than 10 years of federal elections data from each state. With their help — and $200,000 in Knight News Challenge funding — Serdar Tumgoren and Derek Willis are working on creating a free, comprehensive source of official U.S. election results.

The goal is to end up with electoral data that can then be linked to different types of data sets — campaign finance, voter demographics, legislative histories, and so on — in ways that previously haven’t been possible on this scale.

Tumgoren, of The Washington Post, says the idea for Open Elections came from “mutual frustration that there is no single, free source of data — and more importantly, nicely standardized data.” Soothing this frustration isn’t necessarily going to be pretty. The task of finding state elections data — at least some of which will be a godawful, inextricable mess — will require some “brute-forcing,” Tumgoren says.

Access to Full Twitter Archive of Public Posts Now Available →

Gnip, a social data delivery company that offers the full Twitter firehose, announced the release of Historical PowerTrack, a tool for accessing Twitter’s complete public history.

Via Gnip:

This level of access has never been available and we know it is really going to accelerate the rate of innovation going forward. We think there are new products and businesses that will now be possible with access to a “social layer” of historical data. We frequently ask ourselves “If you could know what the world was saying at any moment in time about any topic, what could you build?”

We very much look forward to seeing how that question is answered.