How India’s Favorite TV Show Uses Big Data
via Gigaom:
Every Sunday morning, millions of people in India tune in to watch Bollywood star Aamir Khanhost one of the country’s highest-rated television shows,Satyamev Jayate. Only unlike so many popular programs, Satyamev Jayate doesn’t involve a singing competition or a collection of volatile strangers living under the same roof. It’s a documentary program tackling some of the country’s most-sensitive topics, and it has the whole country — indeed, the whole world — talking. In order to funnel millions of messages a week into something valuable, the shows producers have turned to big data.
Aside from Khan’s star power, the show is so popular because of the types of issues it tackles —female feticide, caste discrimination, dowry deaths, child abuse and medical practice among them. According to one of the show’s producers, the amount of engagement and the number of responses from viewers is “completely unprecedented.” Here’s a sample of what we’re talking about, just 13 episodes into the show’s existence:
- 400 million viewers on Indian television and across the world on YouTube.
- More than 1.2 billion people have connected with Satyamev Jayate across its website, Facebook, Twitter, YouTube and mobile devices.
- More than 8 million people have contributed a total of more than 14 million responses to the show’s content via Facebook, web comments, text-message votes and a telephone hotline. More than 100,000 new people respond each week.
The responses take all sorts of forms, from votes on a weekly poll question to long, heartfelt letters explaining a viewer’s experience with an issue or how the show has changed their thinking on an issue. And although 95 percent of responses come from India, the show has received them from 5,000 locations in 165 countries, including as far away as northern Canada and Alaska. The show’s topics regularly rank among the top trends on Twitter shortly after each episode airs.
The messages are parsed through an automated analysis system developed by Persistent Systems, an Indian IT consultancy.
About a day-and-a-half before each show, Satyamev Jayate’s production company tells Persistent what the issue will be and the two groups come up with a taxonomy that will help the system sort through messages based on what topics will be brought up during Sunday’s show. But it’s not by any means the definitive list. As activity ramps up on Twitter while the show airs (tweet rates are highest during commercials and immediately after it ends, by the way), the team gets a sense of what topics are resonating with viewers and what themes they can expect in the nearly million responses that will follow.
When the responses actually do start pouring in after lunch, they hit a system designed by Persistent to automatically tag them and score them based on interest level and sentiment. So, as Mukund Deshpande, head of business intelligence and analytics at Persistent, told me, a long message with an interesting story will be marked as higher quality, while a short, congratulatory note will be scored lower. Because so many viewers write in “Hinglish,” a combination of Hindi and English, an off-the-shelf system wouldn’t have been as accurate for processing these messages.
Image: Satyamev Data via Gigaom.
Giving you a glimpse of the news in a world without public access to government information.
Well in other news, the Senate wants the make the government less transparent.
FJP: Read through to see what data sources are used for different types of stories. Nicely done, Sun Foundation.
Visualizing Government Spending, 1872 Edition
Sometimes we forget that data visualization has been with us for quite some time. Above we see a “Fiscal Chart” created in 1872 that explores US government spending from 1789 to 1870.
The left column shows where revenue came from. The yellow bars represent tariffs, which accounted for most of the nation’s income until Congress introduced the income tax (pink bars) in 1862 to help pay for the [Civil War, 1861 - 1865].
The bars in the right column show how the young nation spent its money. The light blue bars represent the Army, making a few other wars easy to spot.
Since we’re going back into the past, now’s a good time to give a shout out to Florence Nightingale, our favorite data journalist of the time.
Image: Fiscal Chart of the United States, by Francis A. Walker. Via Dashboard Insight. A biggie version can be seen here.
PANDA is Waiting for You
2011 Knight News Challenge winner Christopher Groskopf announced at the PBS MediaShift blog today that he’s releasing a stable 1.0 verion of the PANDA Project.
PANDA is a data library for newsrooms where reporters can upload data sets to share with others. Instead of hunting through endless spread sheets, PANDA provides search capabilities to quickly find what you’re looking for.
An immediate benefit of using PANDA — or a searchable, data storage tool like it — is increased information at your keyboard while reporting.
Via the PANDA Project:
PANDA encourages serendipity in the reporting process. By having access to all the newsworthy data in your newsroom you will uncover information you might have otherwise overlooked. For instance, a search for the name of a state senator might return a dataset of his political affiliations, a record of his graduation, a list of bills he has sponsored and a brother who is an energy lobbyist.
It also improves overall knowledge retention across an organization:
By providing a single place to store all your newsroom’s data PANDA will encourage knowledge sharing, prevent the loss of information and slow knowledge attrition when reporters retire or change jobs. Never again should more than one reporter FOIA the same dataset.
Congratulations on the milestone.
The Twitter Political Index
Via Twitter:
Today, we’re launching the Twitter Political Index, a daily measurement of Twitter users’ feelings towards the candidates as expressed in nearly two million Tweets each week…
…Each day, the Index evaluates and weighs the sentiment of Tweets mentioning Obama or Romney relative to the more than 400 million Tweets sent on all other topics. For example, a score of 73 for a candidate indicates that Tweets containing their name or account name are on average more positive than 73 percent of all Tweets.Just as new technologies like radar and satellite joined the thermometer and barometer to give forecasters a more complete picture of the weather, so too can the Index join traditional methods like surveys and focus groups to tell a fuller story of political forecasts. It lends new insight into the feelings of the electorate, but is not intended to replace traditional polling — rather, it reinforces it.
For example, the trend in Twitter Political Index scores for President Obama over the last two years often parallel his approval ratings from Gallup, frequently even hinting at where the poll numbers are headed. But what’s more interesting are the periods when these data sets do not align, like when his daily scores following the raid that killed Osama bin Laden dropped off more quickly than his poll numbers, as the Twitter conversation returned to being more focused on economic issues.
By illustrating instances when unprompted, natural conversation deviates from responses to specific survey questions, the Twitter Political Index helps capture the nuances of public opinion.
Twitter’s @gov team is creating the Index with two polling firms and data analysts from Topsy.
Image: Partial screenshot of the Twitter Political Index.
Networked Donors: Political Moneyball
The Wall Street Journal takes a close look at political contributions in a thorough interactive that pulls data from monthly Federal Elections Commission reports.
Pictured above are overall individual and committee contributions (top); contributions and contributors to Restore Our Future, a PAC created to support Mitt Romney (middle left); the balance between ideological or single issue committees and the Democratic and Republican parties (middle right); and who health services and HMO’s are donating to (bottom). (Select any to embiggen).
It’s all very clicky with a various data points available under various layers so explore through.
Meanwhile, via the Wall Street Journal:
We all know that politics is awash in money, money that is accounted for in disclosures made public through the federal government. But the degree to which we understand this universe is limited by how well we can imagine how the players and the money are interconnected.
To better understand, we used social network software to analyze the universe of money in politics.
All the money in politics starts with donors — either individuals or groups like companies and unions. Their donations go to Political Action Committees (which represent the interests of companies or groups) or candidate or party committees (which finance campaigns and other political spending). These committees often send money to one another, which tells us a lot about who their friends are.
Based on the money sent between the players (and other characteristics like party and home state), our presentation pulls players toward similar players and pushes apart those that have nothing in common. The players who are most interconnected (like industry PACs who try to make alliances with everyone) end up close to the center. Those who are less connected (like a donor who only gives money to Ron Paul) are pushed away from the center. The resulting picture is a first-ever interactive portrait of the universe of money in politics, complete with obvious macro lessons (like the gulf between Democrats and Republicans) and with many micro stories that are still emerging.
The interactive was created using CartoDB, a geospatial platform from Vizzuality.
Your Phone is a Surveillance Unit
Yesterday we noted how governments are tracking everyday people via mobile devices. In the United States, this includes “1.3 million government requests for customer data—ranging from subscriber identifying information to call detail records (who is calling whom), geolocation tracking, text messages, and full-blown wiretaps.”
This isn’t specific to the US, of course. In 2010, German politician Malte Spitz went to court in order to obtain all information that Deutsche Telekom had about his activity. The results astonished him. Over the course of five months, they had tracked his geographical location and what he was doing with his phone 35,000 times.
Working with the German newspaper Die Zeit, an infographic was created that shows Spitz’s activity across an interactive timeline.
In this TED Talk, Spitz discusses the threats and repercussions such tracking has on politics and society, and in particular the authoritarian manifestations that can ensue.
As Spitz notes, with current data retention policies, authorities know who we call, how we communicate with each other, when we go to sleep, what our social networks look like and who the leaders within a group are.
“If you have access to this information you can see what your society is doing,” says Spitz. “If you have access to this information, you can control your society. This is a blueprint for countries like China and Iran. This a blueprint for how to surveil your society.”
Most important, Spitz calls for self determination in the digital age by calling for stronger privacy regulations and getting rid of data retention laws that governments are demanding of Internet and mobile providers around the world.
Run Time: ~9:00.
If you’d like to see a shorter visualization of the actual tracking, I created a brief screencast of it last year. — Michael
Graphing the Influence of Thinkers and Ideas Throughout History
Brendan Griffen has graphed a network of all people on Wikipedia with who they’ve influenced and who they’re influenced by.
Via Griff’s Graphs:
For those new to this type of thing: the node size represents the number of connections. In short, I used a database version of Wikipedia to extract all people with known influences and made this map. The bigger the node, the bigger influence that person had on the rest of the network. Nietzsche, Kant, Hegel, Hemingway, Shakespeare, Plato, Aristotle, Kafka, and Lovecraft all, as one would expect, appear as the largest nodes. Around these nodes, cluster other personalities who are affiliated (depends on distance). Highlighting communities by colour reveals sub-networks within the total structure. You’ll notice common themes amongst similarly coloured authors.
Griffen’s influence is Simon Raper who recently graphed the history of philosophy.
The tools used are similar too:
First I queried Snorql and retrieved every person who had a registered ‘influence’ or registered ‘influenced by’ value (restricted to people only so if they were influenced by ‘anime’, they were excluded).
I then decoded these using a neat little URL decoder and imported them into Microsoft Excel for further processing (removing things like ‘(Musician)’ and other annoying syntax).
I then exported these as a csv and imported into Gephi and proceeded as usual. Fruchterman-Reingold algorithm followed by Force Atlas 2. I then identified communities using ‘Modularity’ and edited the rest in Preview. Due to the size, I’ve had to zoom up and take snapshots on regions of interest.
The csv file containing all of the data can be obtained here so you can make your own maps.
And yes, as Griffen notes, the information and visualization is biased towards Western ideas and cultures since Wikipedia skews heavily toward English speakers.
Meantime, we’re absolutely gobsmacked.
Read Griffen’s post on the project. Check out zoomable version. Get yourself a pretty print.
Images: Partial screenshots of Graphing Every* Idea in History, by Brendan Griffen. Select to embiggen.
H/T: Flowing Data.
Taking Wikipedia’s Pulse, Musically
What do changes to Wikipedia sound like? Well, if you track all edits — which are currently pushing about 400 per minute — and mapped them to Open Sound Control, Pure Data and wikibeat, you come out with some modernist beats.
Watch the above screencast by wikibeat creator Dan Chudnov as he does just this. The audio kicks in about a minute into the video.
As Chudnov describes it, “wikibeat sonifies changes to wikipedia as they happen. it uses Ed Summers’ wikipulse, which monitors changes to each language-specific wikipedia and displays their rates of change as gauges, and creates a series of audible beats based on these change rates. it does this by sending the change rate information to a Pure Data application over OSC.”
Read through for links to source code to try it on your own.
H/T: Dario Taraborelli, Senior Research Analyst at the Wikimedia Foundation who we interviewed in January (podcast).
Jonathan Gray, editor of the Data Journalism Handbook, in a Q&A with O’Reilly:
Broadly speaking, “data journalism” is a fairly recent term that is used to describe a set of practices that use data to improve the news. These range from using databases and analytical tools to write better stories and do better investigations, to publishing relevant datasets alongside stories, and using datasets to deliver interactive data visualizations or news apps.
Precisely where one places the emphasis depends on what one thinks is important. This is why in the book we have several sections in the introduction where we’ve asked leading practitioners, advocates and scholars what data journalism means to them, what makes it distinctive and why they think it is important.
Regarding the need for the book: Quite simply, data can help us to answer questions about the world. While it certainly isn’t a panacea, or an objective reflection of the world, data is an increasingly important part of our information landscape. Rather than relying on the analysis of public bodies, public relations agencies, or experts for hire, journalists and their readers should be able to explore, interrogate and critically analyze databases for themselves. The handbook is our attempt to encourage journalists to increase their own data literacy, and hopefully the data literacy of their readers.
FJP: The Data Journalism Handbook is a free and opensource reference guide. Download it here. It’s a very useful resource. We’ve talked about a few other data journalism tools in the past. See some posts here.
Image: Click-through to keep reading the Q&A.
Researchers have harnessed streams of light to transfer massive amounts of data. In a recent test, they hit 2.56 terabytes per second. Simplify the language and that’s about the equivalent of transferring over 67,000 songs per second.
Before getting too giddy, this was an experiment over one meter. Still though, add this to a very interesting, data heavy future.
Via TechSpot:
Researchers at USC, JPL and Tel Aviv University have managed to transfer 2.56 terabits of information by multiplexing 8 x 300Gbps “twisted” streams of visible light into a single beam. The feat exploits a phenomenon which, up until recently, scientists thought may have been impossible to achieve with light: orbital angular momentum (OAM).
OAM, the way a wave can be made to twist around itself, is what makes the team’s discovery particularly exciting. It also makes their findings incredibly useful for wireless data transmission. Making light beams spiral to create an optical vortex is not necessarily a new idea, but putting that phenomenon to work for the transmitting information is something researchers have been striving for.
TechSpot, Scientists hit wireless speeds of 2.56Tbps using light vortex beams.
The Common North American Belly
Yes, I’m hungry. No, I haven’t eaten today.
But I have been playing with FoodMood, an interactive visualization project that pulls data from Twitter about how people relate to the foods they mention while posting.
Via FoodMood:
Using geo-located tweets as a primary data source together with natural language processing techniques and public access data from WHO and CIA Factbook, we capture and analyze, in real time, the foods that people are tweeting about in their cities and how they feel about them…
…As a sentiment analysis tool, FoodMood develops a more informed global picture about food and emotion. As a datavisualization project, FoodMood shows the connections, patterns and relationships that exist between the variables — insights that are otherwise practically infeasible. Ultimately, FoodMood helps reveal a hidden layer of digital and social data that pushes the boundaries of awareness and understanding of our surrendings one step further.
The data that drives FoodMood is from Twitter. We scrape Twitter in real time and assign a sentiment rating to any tweet about food. So if someone said they just ate a cake and they love it the sentiment rating will be high. If they ate a snail and it made them feel weird (and they tweeted that) then the sentiment rating would be low. We only use English-language tweets on FoodMood.
Got that?
So, what we’re looking at above is a comparison of Canada, Mexico and the United States. Each has salad, eggs, pancakes, pizza, cake and sandwiches among their top 10 most mentioned foods, and each has the same mood about them.
Sticking within the top 10, Mexico and the United States share a love for chipotle and tacos. Strong choices and yes I’m getting hungry.
Of the three countries, Canada is the thinnest but least happy. The United States appears (at least for those tweeting away) fat and happy.
I’m off to lunch (tuna melt panini if you’re interested), but give the site a play. You can compare foods, moods, countries, look at data at a particular point in time, or over a period of time. — Michael
Image: Screenshot of FoodMood comparing food sentiment as measured via Twitter Posts in Canada, Mexico and the United States.
Select image to embiggen.
H/T: Infosthetics
YouTube users upload 48 hours of video, Facebook users share 684,478 pieces of content, Instagram users share 3,600 new photos, and Tumblr sees 27,778 new posts published.
Modeling Election Forecasts the FiveThirtyEight Way
Via Slashdot:
Years ago Nate Silver of FiveThirtyEight.com, a blog seeking to educate the public about elections forecasting, established his model as one of the most accurate in existence, rising from a fairly unknown statistician working in baseball to one of the most respected names in election forecasting. In this article he describes all the factors that go into his predictions. A fascinating overview of the process of modeling a chaotic system.
FJP: It is fascinating.
With national, regional and statewide polling feeding off oftentimes conflicting information, and then still other polling that has what Silver calls “house effects” (meaning that a poll is an outlier, skewing Democratic or Republican in relation to other polls), all within what will be a tight, tight, tight race, Silver lays out both the art and science of his models.
For example, take Silver’s analysis of Florida:
Right now, the polls there show almost an exact tie. But the model views Florida as leaning toward Mr. Romney, for several reasons.
First, the polls showing a tie there were mostly conducted among registered voters rather than likely voters. Republicans typically improve their standing by a point or two when polling firms switch from registered voter to likely voter polls, probably because Republican voters are older, wealthier, and otherwise have demographic characteristics that make them more reliable bets to turn out. The model anticipates this pattern and adjusts for it, bolstering Mr. Romney’s standing by a point or two whenever it evaluates a registered-voter poll.
In addition, the fundamentals somewhat favor Mr. Romney in Florida. The state has been somewhat Republican-leaning in the past, and its economy is quite poor. Mr. Romney has raised more money than Mr. Obama there, and its demographics are not especially strong for Mr. Obama. The model considers these factors in addition to the polls in each state. In the case of Florida, they equate to Mr. Romney having about a 60 or 65 percent chance of winning it, and Mr. Obama probably has easier paths to 270 electoral votes.
If you’re a political junky whose heart skips a beat with the daily polls, read through. As said before, it’s a fascinating look at how political forecasting is done by one of the best in the business.
Image: While US presidential politics — and its electoral college — is a winner take all system that leads to strict Red State versus Blue State divisions across the country, this map of the 2008 presidential elections provided by the University of Michigan’s Mark Newman shows that if you look at the country at a county by county level, the country’s political leanings are decidedly purple. Meaning that slight ebbs can turn an entire state red (Republican) or blue (Democratic).