Posts tagged data

Things You Can Do That You Never Used To
Via Archive.org:

For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.

So if you can’t get enough of whatever it is they’re trying to do in the Situation Room, a one gig tarball of text is waiting for your download.
H/T: Flowing Data

Things You Can Do That You Never Used To

Via Archive.org:

For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.

So if you can’t get enough of whatever it is they’re trying to do in the Situation Room, a one gig tarball of text is waiting for your download.

H/T: Flowing Data

Them, not that or there: Bing and the social search engine
Let’s speak cryptically, because the mood today calls for it: the search engine as self has always been a middle man (or woman), pointing us toward wikipedia, yelp, or wherever else we want to go online but don’t actually know it yet.
But what if instead of sending us out there, it told us who knew what — who, among my friends and acquaintances, can give me suggestions on where the best hikes are in upstate New York, and help me avoid those old looking state park sites that don’t tell me anything? Well, Bing to the rescue.
via Fast Company:

“We’re literally no longer indexing text,” [Bing director Stefan] Weitz says. “We’re trying to associate data that exists on the web in all forms with the physical object that spawned it in the first place.”

That means that when searching for upstate hiking trails, you’ll be shown who among your friends may have been somewhere up there, during a Summer trip five years ago that they never mentioned but maybe, conveniently, made into a photo album on Facebook that you never saw.
But, like most new ideas, there are hurdles.
via Co.Design:

Bing isn’t taking all user-generated content into consideration when it makes its people-relevance decisions. That’s because it would take an extraordinary amount of computing power to analyze all the free text people generate and determine its meaning (for example, if you write about “turkey,” are you talking about the bird or the country?).
So instead, Bing is simply looking at what your friends Like, share, or search for to assess their expertise on certain topics. But those proxies might not be sufficient to actually get you to the right people. “Just because there’s someone in my social graph who Likes Hawaii doesn’t mean they’re the best person to recommend a hotel on Kauai,” Rebecca Lieb of the Altimeter Group tells Fast Company.

FJP: One oversight on Bing’s part may be the fact that I don’t want to ask that one guy I haven’t seen in three years what the Adirondacks are like. But it’s still a good idea.

Them, not that or there: Bing and the social search engine

Let’s speak cryptically, because the mood today calls for it: the search engine as self has always been a middle man (or woman), pointing us toward wikipedia, yelp, or wherever else we want to go online but don’t actually know it yet.

But what if instead of sending us out there, it told us who knew what — who, among my friends and acquaintances, can give me suggestions on where the best hikes are in upstate New York, and help me avoid those old looking state park sites that don’t tell me anything? Well, Bing to the rescue.

via Fast Company:

“We’re literally no longer indexing text,” [Bing director Stefan] Weitz says. “We’re trying to associate data that exists on the web in all forms with the physical object that spawned it in the first place.”

That means that when searching for upstate hiking trails, you’ll be shown who among your friends may have been somewhere up there, during a Summer trip five years ago that they never mentioned but maybe, conveniently, made into a photo album on Facebook that you never saw.

But, like most new ideas, there are hurdles.

via Co.Design:

Bing isn’t taking all user-generated content into consideration when it makes its people-relevance decisions. That’s because it would take an extraordinary amount of computing power to analyze all the free text people generate and determine its meaning (for example, if you write about “turkey,” are you talking about the bird or the country?).

So instead, Bing is simply looking at what your friends Like, share, or search for to assess their expertise on certain topics. But those proxies might not be sufficient to actually get you to the right people. “Just because there’s someone in my social graph who Likes Hawaii doesn’t mean they’re the best person to recommend a hotel on Kauai,” Rebecca Lieb of the Altimeter Group tells Fast Company.

FJP: One oversight on Bing’s part may be the fact that I don’t want to ask that one guy I haven’t seen in three years what the Adirondacks are like. But it’s still a good idea.

“Gay marriage conversation peaked at 7,347 Tweets per minute at 3:22p ET yesterday.” — @gov.
It more or less kicked off with Matthew Keys’ (ProducerMatthew) Twitter post breaking the news.

“Gay marriage conversation peaked at 7,347 Tweets per minute at 3:22p ET yesterday.” — @gov.

It more or less kicked off with Matthew Keys’ (ProducerMatthew) Twitter post breaking the news.

A Finely Curated List of Data Tools
A fantastic resource for getting started in — and advancing — your work with data from some of the best in the business.
Via Datavisualization.ch:

Datavisualization.ch Selected Tools is a collection of tools that we, the people behind Datavisualization.ch, work with on a daily basis and recommend warmly. This is not a list of everything out there, but instead a thoughtfully curated selection of our favourite tools that will make your life easier creating meaningful and beautiful data visualizations.

As Benjamin Wiederkehr writes on their blog, “It includes libraries for plotting data on maps, frameworks for creating charts, graphs and diagrams and tools to simplify the handling of data. Even if you’re not into programming, you’ll find applications that can be used without writing one single line of code.”
FJP Pro Tip: Jump in and start playing. If you’re just getting started, check out our short videos with Bitly data chief Hilary Mason for her advice on working with data.

A Finely Curated List of Data Tools

A fantastic resource for getting started in — and advancing — your work with data from some of the best in the business.

Via Datavisualization.ch:

Datavisualization.ch Selected Tools is a collection of tools that we, the people behind Datavisualization.ch, work with on a daily basis and recommend warmly. This is not a list of everything out there, but instead a thoughtfully curated selection of our favourite tools that will make your life easier creating meaningful and beautiful data visualizations.

As Benjamin Wiederkehr writes on their blog, “It includes libraries for plotting data on maps, frameworks for creating charts, graphs and diagrams and tools to simplify the handling of data. Even if you’re not into programming, you’ll find applications that can be used without writing one single line of code.”

FJP Pro Tip: Jump in and start playing. If you’re just getting started, check out our short videos with Bitly data chief Hilary Mason for her advice on working with data.

1. The model which has guided many people’s thinking in this area, the 1/9/90 rule, is outmoded. The number of people participating online is significantly higher than 10%.

Above is just one finding of 6 by BBC’s Holly Goodier, who has spent a good deal of time assessing online participation patterns in the UK. Here are the other 5, which she and her team culled from a general agreement that the former audience is becoming more and more active online:

2. Participation is now the rule rather than the exception: 77% of the UK online population is now active in some way.
3. This has been driven by the rise of ‘easy participation’: activities which may have once required great effort but now are relatively easy, expected and every day. 60% of the UK online population now participates in this way, from sharing photos to starting a discussion.
4. Despite participation becoming relatively ‘easy’, almost a quarter of people (23%) remain passive - they do not participate at all.
5. Passivity is not as rooted in digital literacy as traditional wisdom may have suggested. 11% of the people who are passive online today are early adopters. They have the access and the ability but are choosing not to participate.
6. Digital participation now is best characterised through the lens of choice. These are the decisions we take about whether, when, with whom and around what, we will participate. Because participation is now much more about who we are, than what we have, or our digital skill.

See here for more on the 1/9/90 rule.

These are the humans trying to give our jobs to robots
There’s been a lot of talk lately about Narrative Science, its boss Kristian Hammond, and their algorithmic journalist robots of the future. Most of the controversy has been over a few audacious comments, as most controversy usually is (via Wired):

Last year at a small conference of journalists and technologists, I asked Hammond to predict what percentage of news would be written by computers in 15 years. At first he tried to duck the question, but with some prodding he sighed and gave in: “More than 90 percent.”

He also predicted that a computer will win the Pulitzer Prize by 2017. But that’s just talk — from reading what his algorithms have done, it’s hard to expect a Pulitzer, but it’s not as easy to rebuke the 90% assumption. 
via Slate, on what the robots cover:

Narrative Science is one of several companies developing automated journalism software. These startups work primarily in niche fields—sports, finance, real estate—in which news stories tend to follow the same pattern and revolve around statistics. 

Take the financial articles that NS writes for Forbes, as considered a little later in the article:

Don’t miss the irony here: Automated platforms are now “writing” news reports about companies that make their money from automated trading. These reports are eventually fed back into the financial system, helping the algorithms to spot even more lucrative deals. Essentially, this is journalism done by robots and for robots. The only upside here is that humans get to keep all the cash.

Following the diplomatic/commodity trail that influences stock prices, or tracking stats and numbers in sports to find stories, may eventually become an obsolete task for us humans as robots begin to cover them more efficiently, and faster. And, having begun to crawl through Twitter for election coverage, Narrative Science’s scope may (soon! soon!) slowly grow.
FJP: But as for what this post covers, the concern is a lot like other problems people have with today’s journalism. In the same way that programmers or bloggers won’t replace columnists and reporters, but will instead facilitate, complement, and in all sorts of ways share the new workload, so too might Narrative Science-esque algorithms cover some of the responsibilities that future journalism expects, but which are difficult/unreasonable/impossible for, say, a journalist from ten years ago to handle.
Photo courtesy of Narrative Science.

These are the humans trying to give our jobs to robots

There’s been a lot of talk lately about Narrative Science, its boss Kristian Hammond, and their algorithmic journalist robots of the future. Most of the controversy has been over a few audacious comments, as most controversy usually is (via Wired):

Last year at a small conference of journalists and technologists, I asked Hammond to predict what percentage of news would be written by computers in 15 years. At first he tried to duck the question, but with some prodding he sighed and gave in: “More than 90 percent.”

He also predicted that a computer will win the Pulitzer Prize by 2017. But that’s just talk — from reading what his algorithms have done, it’s hard to expect a Pulitzer, but it’s not as easy to rebuke the 90% assumption. 

via Slate, on what the robots cover:

Narrative Science is one of several companies developing automated journalism software. These startups work primarily in niche fields—sports, finance, real estate—in which news stories tend to follow the same pattern and revolve around statistics. 

Take the financial articles that NS writes for Forbes, as considered a little later in the article:

Don’t miss the irony here: Automated platforms are now “writing” news reports about companies that make their money from automated trading. These reports are eventually fed back into the financial system, helping the algorithms to spot even more lucrative deals. Essentially, this is journalism done by robots and for robots. The only upside here is that humans get to keep all the cash.

Following the diplomatic/commodity trail that influences stock prices, or tracking stats and numbers in sports to find stories, may eventually become an obsolete task for us humans as robots begin to cover them more efficiently, and faster. And, having begun to crawl through Twitter for election coverage, Narrative Science’s scope may (soon! soon!) slowly grow.

FJP: But as for what this post covers, the concern is a lot like other problems people have with today’s journalism. In the same way that programmers or bloggers won’t replace columnists and reporters, but will instead facilitate, complement, and in all sorts of ways share the new workload, so too might Narrative Science-esque algorithms cover some of the responsibilities that future journalism expects, but which are difficult/unreasonable/impossible for, say, a journalist from ten years ago to handle.

Photo courtesy of Narrative Science.

Data Tools, Data Challenges

Bitly Data Chief Hilary Mason explains how the company’s infrastructure is set up, what challenges she sees ahead for data science and offers a wish list of tools she hopes the community will come together to create.

Last week, we posted other segments from this interview. They include getting started with data and how to to work with data. They can be viewed here.

Our talk with Benji and Matt
We emailed the Guardian’s Benji Lanyado about a new project he and Matt Andrews have been working on called Top 5 News, which lists the most popular articles by the most popular news orgs in the US and UK. Here’s what we talked about, short and simple:FJP (Blake): What is Top 5 News and how did it come together?
Benji: top5news.net (and its British cousin top5news.co.uk) pulls from a number of different news sites, displaying their most popular pieces of content every 15 minutes. We wanted it to be a snapshot of what people are actually reading, rather than the latest news, or editor’s choices. To some extent, it’s an automated Drudge Report. 



FJP (Blake): How does it work? What was used to make it work?
Matt: The site is a fairly customised use of the PHP framework CodeIgniter. It goes off to fetch the page HTML of the source websites every 15 minutes and scans through the code for the relevant links. We store an archive of links as well as the most recent ones so that over time we can attempt some data visualisation to show trends and spikes. Finally, on the front end we use CSS3 media queries to give the site a responsive design so it works well on mobile too.
FJP (Blake): Is it “just” an experiment or is it something you plan to build off of?
Benji: For now, it’s a minimum viable product… we want to see how popular the idea is, and gather as much feedback as possible. After that… who knows.
FJP (Blake): Besides that, your work at the Guardian and all your interactive traveling (a la Kerouapp) is very cool. Any plans to expand upon this previous work?
Benji: Yeah, it was a lot of fun working on Kerouapp, and its been great to see Jon Henley, one of the Guardian’s feature writers, using it for his trips across Europe. It’s also been used by the BBC and Time Out, which is great. I’m actually travelling a lot less these days, but would love to see other news organisations use the tool and run with it.
FJP (Blake): Please tell me about any other plans you might have, and what you’d like to do in the near/distant future.
Benji: I’m very keen to keep working on projects like this with developers, both inside the Guardian and outside it. I’m actually starting an intensive front end development course myself in a few weeks, so I can potentially knock together prototypes myself in the near future. I think basic programming skills are going to become an essential skill for future journalists.
Photo: The Guardian.

Our talk with Benji and Matt

We emailed the Guardian’s Benji Lanyado about a new project he and Matt Andrews have been working on called Top 5 News, which lists the most popular articles by the most popular news orgs in the US and UK. Here’s what we talked about, short and simple:

FJP (Blake): What is Top 5 News and how did it come together?

Benji: top5news.net (and its British cousin top5news.co.uk) pulls from a number of different news sites, displaying their most popular pieces of content every 15 minutes. We wanted it to be a snapshot of what people are actually reading, rather than the latest news, or editor’s choices. To some extent, it’s an automated Drudge Report. 

FJP (Blake): How does it work? What was used to make it work?

Matt: The site is a fairly customised use of the PHP framework CodeIgniter. It goes off to fetch the page HTML of the source websites every 15 minutes and scans through the code for the relevant links. We store an archive of links as well as the most recent ones so that over time we can attempt some data visualisation to show trends and spikes. Finally, on the front end we use CSS3 media queries to give the site a responsive design so it works well on mobile too.

FJP (Blake): Is it “just” an experiment or is it something you plan to build off of?

Benji: For now, it’s a minimum viable product… we want to see how popular the idea is, and gather as much feedback as possible. After that… who knows.

FJP (Blake): Besides that, your work at the Guardian and all your interactive traveling (a la Kerouapp) is very cool. Any plans to expand upon this previous work?

Benji: Yeah, it was a lot of fun working on Kerouapp, and its been great to see Jon Henley, one of the Guardian’s feature writers, using it for his trips across Europe. It’s also been used by the BBC and Time Out, which is great. I’m actually travelling a lot less these days, but would love to see other news organisations use the tool and run with it.

FJP (Blake): Please tell me about any other plans you might have, and what you’d like to do in the near/distant future.

Benji: I’m very keen to keep working on projects like this with developers, both inside the Guardian and outside it. I’m actually starting an intensive front end development course myself in a few weeks, so I can potentially knock together prototypes myself in the near future. I think basic programming skills are going to become an essential skill for future journalists.

Photo: The Guardian.

How to Work With Data

The other day, Bitly’s data chief Hilary Mason explained how to get started with data.

Today, she discusses how to work with data, from getting it, to exploring it to interpreting it.

A while back, Hilary and Columbia mathematician Chris Wiggins wrote about this process, called it a taxonomy of data science, and gave a roughly chronological account of what one does with data: Obtain, Scrub, Explore, Model and iNterpret.

No, that’s not a typo, it’s part of an acronym: OSEMN, which rhymes with possum, which means you pronounce it “awesome”.

To get more details than Hilary offers here, check their article. It offers code examples and tools and tricks to work through each of the steps above.

Honolulu’s Civil Beat is a new, public affairs investigative journalism group. Here’s a neat quote by one of their reporters, Chad Blair:

You know, the executive, the legislative, and the judicial branches of government — journalism is that forth estate that has been around since the beginning of our democracy.

FJP: Journalism… the beach… aaah.

Visualizing Global Migration Flows
via Infosthetics:

Global Migration Patterns [mpg.de] by the German Max Planck Institute for the Study of Religious and Ethnic Diversity contains a set of interactive instruments that visualize the latest global migration data.
The International Migration Flows shows the different flows to - and from - selected OECD-countries between the years 1970-2007. It illustrates the concept of “Superdiversity”, or how during the last 2 decades more people than ever have moved between different locations worldwide.
The outer circle shows the number of emigrants, with each bar represents a country of origin and each color conveying a unique continent. The inner circle shows the number of immigrants. One can “zoom” into the data by choosing a specific threshold, which truncates the bars to a maximum value.
The Global Migration By Originvisualization conveys the societal diversity in about 225 countries according to their historical census results. For each country in the list, a population is grouped by origin or citizenship. Again, an emergent pattern of increasing diversity of societies can be observed. A specific country can be selected in a world map, which also reveals a bar charts that conveys the different countries of origin, with each color representing a continent.
Global Migration By Destination uses the same concept, but from an inverse perspective. It thus shows where people tend to leaving their country of birth to move to somewhere else.

Image: Visualized societal diversity in 225 countries using census data from 1960, 1970, 1980, 1990, 2000.
FJP: Play with the interactive. It’s really quite cool. They even have a video tutorial to help you.

Visualizing Global Migration Flows

via Infosthetics:

Global Migration Patterns [mpg.de] by the German Max Planck Institute for the Study of Religious and Ethnic Diversity contains a set of interactive instruments that visualize the latest global migration data.

The International Migration Flows shows the different flows to - and from - selected OECD-countries between the years 1970-2007. It illustrates the concept of “Superdiversity”, or how during the last 2 decades more people than ever have moved between different locations worldwide.

The outer circle shows the number of emigrants, with each bar represents a country of origin and each color conveying a unique continent. The inner circle shows the number of immigrants. One can “zoom” into the data by choosing a specific threshold, which truncates the bars to a maximum value.

The Global Migration By Originvisualization conveys the societal diversity in about 225 countries according to their historical census results. For each country in the list, a population is grouped by origin or citizenship. Again, an emergent pattern of increasing diversity of societies can be observed. A specific country can be selected in a world map, which also reveals a bar charts that conveys the different countries of origin, with each color representing a continent.

Global Migration By Destination uses the same concept, but from an inverse perspective. It thus shows where people tend to leaving their country of birth to move to somewhere else.

Image: Visualized societal diversity in 225 countries using census data from 1960, 1970, 1980, 1990, 2000.

FJP: Play with the interactive. It’s really quite cool. They even have a video tutorial to help you.

Getting Started With Data

Hilary Mason, Bitly’s data chief, gives advice on how to get started in data science, from finding a buddy to tutorials you can watch and books you can read.

Bonus: Why you don’t need to be a math whiz.

Double Bonus: See her next video, How to Work With Data.

Ask Clay Shirky a question
Internet scholar, author and NYU professor Clay Shirky is sitting online right now, answering questions at the Guardian website for their Battle for the Internet series. Ask away!

Ask Clay Shirky a question

Internet scholar, author and NYU professor Clay Shirky is sitting online right now, answering questions at the Guardian website for their Battle for the Internet series. Ask away!

NASA Has a Data Problem, And a Contest to Solve It
NASA has about 100 terabytes of information gathered from its various space missions. The data sits in various databases created over the years and is difficult to get to and manipulate.
So its Tournament Lab is holding a contest make the data more accessible to both scientists and the public.
Via the NASA Tournament Lab:

[W]hile rich in depth and breath, the [Planetary Data System] databases have developed in a disparate fashion over the years with different architectures and formats for different scientific needs; thereby making acquisition of data problematic!
So, NASA is holding a series of Challenges to generate some simply awesome ideas for mobile or web based applications that will appeal to general users, to search and display compelling facts about the data. Instead of just scientists, our audience will be the millions of school age students, their teachers and parents, game designers and general civilians of the world. We want to deliver this incredible data to users in a way that excites them – and thus, to help them understand the value and potential of this data.

Contest prizes are up to $10,000 and you can learn about it here. If you want to jump right into the data, you can do so here.
Image: Moscow at Night, captured March 28 by the International Space Station. Via NASA.

NASA Has a Data Problem, And a Contest to Solve It

NASA has about 100 terabytes of information gathered from its various space missions. The data sits in various databases created over the years and is difficult to get to and manipulate.

So its Tournament Lab is holding a contest make the data more accessible to both scientists and the public.

Via the NASA Tournament Lab:

[W]hile rich in depth and breath, the [Planetary Data System] databases have developed in a disparate fashion over the years with different architectures and formats for different scientific needs; thereby making acquisition of data problematic!

So, NASA is holding a series of Challenges to generate some simply awesome ideas for mobile or web based applications that will appeal to general users, to search and display compelling facts about the data. Instead of just scientists, our audience will be the millions of school age students, their teachers and parents, game designers and general civilians of the world. We want to deliver this incredible data to users in a way that excites them – and thus, to help them understand the value and potential of this data.

Contest prizes are up to $10,000 and you can learn about it here. If you want to jump right into the data, you can do so here.

Image: Moscow at Night, captured March 28 by the International Space Station. Via NASA.