big-data

Algorithms Could Save Book Publishing—But Ruin Novels

Over four years, Archer and Jockers fed 5,000 fiction titles published over the last 30 years into computers and trained them to “read”—to determine where sentences begin and end, to identify parts of speech, to map out plots. They then used so-called machine classification algorithms to isolate the features most common in bestsellers.

.. The result of their work—detailed in The Bestseller Code, out this month—is an algorithm built to predict, with 80 percent accuracy, which novels will become mega-bestsellers.

What does it like? Young, strong heroines who are also misfits (the type found in The Girl on the Train, Gone Girl, and The Girl with the Dragon Tattoo). No sex, just “human closeness.” Frequent use of the verb “need.” Lots of contractions. Not a lot of exclamation marks. Dogs, yes; cats, meh. In all, the “bestseller-ometer” has identified 2,799 features strongly associated with bestsellers.

.. It’s sad to think that data could narrow our tastes and possibilities.”

.. There’s a wrinkle, though: Companies such as Amazon and Apple have the data for books read on their devices, and they aren’t sharing it with publishers.

.. The ability to know who reads what and how fast is also driving Berlin-based startup Inkitt

..Albazaz, now 26, sees himself as democratizing the publishing world. “We never, ever, ever judge the books. That’s not our job. We check that the formatting is correct, the grammar is in place, we make sure that the cover is not pixelated,” he says. “Who are we to judge if the plot is good? That’s the job of the market. That’s the job of the readers.”

.. Callisto studies the search terms Amazon suggests when users start typing in the first few letters, and found that people would frequently search for something that led to no results. “Consumers are searching for a piece of information, but no product exists to satisfy that consumer demand,”

.. Don’t we risk losing the distinction between what’s important and what’s popular? As NPR noted last year, books nominated for prestigious prizes like the Man Booker Prize or the National Book Award typically don’t sell many copies.

.. The computer found much to love: a strong, young female protagonist whose most-used verbs are “need” and “want.”

What Should We Do About Big Data Leaks?

What is transparency in the age of massive database drops? The data is available, but locked in MP3s and PDFs and other documents; it’s not searchable in the way a web page is searchable, not easy to comment on or share.

.. This, said the consortia of journalists that notably did not include The New York Times, The Washington Post, etc., is the big one.

.. Organs of journalism are among the only remaining cultural institutions that can fund investigations of this size and tease the data apart, identifying linkages and thus constructing informational webs that can, with great effort, be turned into narratives, yielding something like what we call “a story” or “the truth.”

.. If this is the age of the citizen journalist, or at least the citizen opinion columnist, it’s also the age of the data journalist, with the news media acting as product managers of data leaks, making the information usable, browsable, attractive.

.. There’s a glut of data, but most of it comes to us in ugly formats. What would happen if the things released in the interest of transparency were released in actual transparent formats? By which I mean, not as a pile of unstructured documents, not even as pure data, but, well, as software? Put cost aside and imagine for a minute that the FCIC report was not delivered as web pages, PDFs, finding aids, and the like, but as a database filled with searchable, formatted text, including documents attributed to the individuals within, audio files transcribed, and so forth.

.. I look at that FCIC data and see at least 300 hours of audio. That’s $18,000 worth of transcription. Those documents could be similarly turned into searchable text, as could any of the PDFs. We can do the same for emails. These tools exist and are open. If there are any faxes they can be OCRed.

Snowplow Analytics

Snowplow is the best data warehousing platform for web and mobile events.

The Global Database of Society

Supported by Google Ideas, the GDELT Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes and events driving our global society every second of every day, creating a free open platform for computing on the entire world.