Andrew Savikas visits Nature

Last Thursday, Andrew Savikas, VP of Digital Initiatives at O’Reilly Media and ebook expert, paid us a visit. We had some very interesting conversations about the future direction of publishing, and Andrew delivered a great talk on the topic. He kindly provided a copy of his slides (16MB PDF). My (partial and impressionistic) notes are below.

Nature Video presents…

Charlotte Stoddart

Two new Nature Videos have just gone online.

First up, and my first solo video project, a film about Sci Foo 09. Here it is…

If you enjoyed the film and would like to share it, you can embed it in your own blog by going to Nature Video’s YouTube Channel.

Also just out, the trailer for our latest Lindau film series: Nobel Reactions. Every summer an extraordinary meeting between Nobel Laureates and young scientists takes place on Lindau Island in Germany. In 2009 it was the turn of the chemists and we were there to capture moments of this unique meeting of minds on film. The trailer introduces the Lindau Meetings and offers a taster of the films that follow: five short films on chemistry plus a special film feature on climate change. The films will be released one a week from 27 August. Watch them here, or subscribe to the series in iTunes (just search for ‘Nature Video’ in the iTunes store).

Riding a Wave of Science

On Saturday the Science Online London 09 conference took place. The conference tag was #solo09. Martin Fenner has already gathered together some reactions to the conference. In the afternoon I had the pleasure to co-present on Google Wave with Cameron Neylon and Chris Thorpe. Cameron has already written up some reactions to our session.

Four short links

Stealing a format from Nat, here are some things on my radar that might interest Nascent readers:

  1. Next week’s Science Online London meeting (co-produced by Nature Network) has been long sold out, but for those who missed getting a ticket, or who can’t travel to London, there will also be live video streaming of the conference into Second Life. Entrance is £10 or $15 and it is open to all. SL attendees will be able to see all the live video and ask questions of the speakers. Jo Scott (avatar name: Joanna Wombat) and colleagues will also be running free orientations sessions for newbies. To avoid missing this one too, register here now.
  2. Nature Methods is seeking votes for its 2009 Method of the Year. The previous winners are next-generation sequencing (2007) and super-resolution fluorescence microscopy in (2008). (And if you haven’t seen the video on the latter then you should.) Check out the current 2009 nominees and vote tallies here.
  3. This isn’t exactly new news (hey, it’s been busy round here), but for anyone who has seen it yet, Edge has some coverage of outcomes and experiences at Sci Foo ’09 (including a contribution from yours truly). We also have a short video from the event coming out soon – watch this space.
  4. Meanwhile elsewhere on Edge, check out this report on another wonderful event (though sadly one I didn’t attend): a master class from George Church and Craig Venter on the brave new world of synthetic genomics.

Lies, damned lies and download counts


Shirley Wu posted on Friendfeed earlier about some of the things she’d overheard people saying about PLoS ONE papers. PLoS ONE Manging Ed Peter Binfield weighed in early to point out that the best way of combating misconceptions about the journal is to push out positive info and mentioned the journal’s article-level metrics program.

Near the end of the (long) thread was this exchange:

“You could try asking them exactly how many downloads their last paper in a ‘high impact’ journal got… – Peter Binfield

Fair enough, but you know, I really don’t think they think about that. They think “what will be in my CV?” and they think any journal that is somewhat competitive [includes other PLoS journals, BMC journals, etc] looks better than one that accepts anything that’s methodologically sound. Again, not my view, but perhaps one that is held by many. Do people list # of downloads on their CV for publications? – Shirley Wu

They dont, because they dont have the data. However, people do list if their paper was rated by F1000; or if BMC designated it a ‘highly accessed’ article. So I think they will start to say “this paper was downloaded 5000 times in the first 3 months which put it in the top x% of all PLoS ONE articles, the top y% of all PLoS articles, and the top z% of ALL articles” (when the rest of the world starts quoting this data) – Peter Binfield"

Do people here think that article downloads stats should be put on academic CVs? (serious question)

It feels wrong to me. IMHO encouraging anybody to take download statistics seriously as a measure of success / quality would be a mistake. Taken on their own they’re meaningless, surely – nice to know for the author, but meaningless. For them to be at all useful you’d have to supply a lot of context – as Peter suggests – though I don’t think the journal level “top 10% of papers in first three months” context he outlined would be enough either.

(just to be clear I don’t think Peter was necessarily saying that people should put only the download count on their CV – am using his comment above simply as a jumping off point for discussion)

A download counter can’t tell if the person visiting your paper is a grad student looking for a journal club paper, a researcher interested in your field or… somebody who typed in an obscure porn related search that turned up unconnected words in the abstract. A search bot. Somebody on Google Images looking for free clipart. Got a blog? Check your traffic stats. Journals get those crazy queries too, lots of them. Mainstream search engines are a major source of traffic for journals but not always for the reasons publishers might want.

As a publisher do you account for this and only record ‘good’ traffic? What if your competition don’t?

Institutions and ISPs transparently cache pages. If my lab mate and I both download your paper depending on the publisher’s stats package it might register as only one hit (from the university proxy server). Do you compensate for that somehow?

Am I going to be penalized if I host my papers on my homepage? In my institutional repository? Should I add all those counts up for my CV? Do I need to cite my sources?

Should I tell my mum to set my paper as her homepage (and to be sure to delete her cookies each morning)?

If Science spends $50m on SEO next year and hits on their article pages double will the articles in 2010 be twice as good as those in 2009?

As an author should I be repeating keywords in my title to get more Google traffic? Should I try to include a figure of Britney Spears?

If we stick to giving ‘top x percentage’ context then do we make concessions for smaller disciplines publishing in multidisciplinary journals? More people work and publish in genetics than in quantum physics. Even if every important person in your field downloads your paper they might be outnumbered by grad students from the three dozen groups working on Rab4A effectors that download the genetics paper next to yours in the TOC.

I’m not saying that download stats aren’t useful in aggregate or that authors don’t have a right to know how many hits their papers received but they’re so potentially misleading (& open to misinterpretation) that it doesn’t seem to me the type of metric we want to be bandying about as an impact factor replacement.

Igor – a Google Wave robot to manage your references

(Google Wave hasn’t been released yet but if you’re interested in working with the preview you can request a developer account on the sandbox here)

Google Wave is a new open source project from Google that holds a lot of promise as a platform for scholarly communication. It’s a little bit like email but allows for collaborative document editing, versioning and real time conversation within groups – check out Cameron and Martin’s archives for more.

Igor is a proof of concept Wave robot that allows Wave users to pull in citations from Pubmed or their libraries on Connotea and CiteULike as they type.

To use it invite to join a wave.

Streamosphere update

This month’s iteration of Streamosphere is now up. It’s still more a preview than a product but imho it’s approaching usefulness!


The main changes are:

  • a new way of exploring the site – the list view shows you the most popular items within a given time frame. It’s sort of like Digg but to vote an item up you need to have commented on it or shared it on a social media site.
  • simplified sidebar, visual cues on the grid / timeline view and a help link will hopefully help new users work out what they’re seeing
  • the aggregation logic now uses Friendfeed’s SUP feed and connects directly to Twitter, so messages are picked up much faster.
  • trending topics – this is a list of topics that are appearing more frequently than you might expect. Bear in mind that it’s generated algorithmically so items are sometimes grouped together in odd (but technically correct ;) ) ways…
  • clicking on “see details” in the list view or on an item in the grid view brings up a breakdown of comments and tweets which you can use to jump straight into a conversation on, for example, Friendfeed.

There are still lots of little niggles. On smaller timescales (anything under than four hours) there’s lots of items that aren’t strictly speaking about science, too. Still not sure if that’s a bug or a feature.

The next version will focus on people – both the people being followed by Streamosphere and visitors to the site – and grouping items by topic.

“I am not a scientist, I am a number”

On Monday I was at the BioLINK Special Interest Group at the Intelligent Systems for Molecule Biology meeting in Stockholm. Amongst the many thought-provoking talks was one by Phil Bourne, he of the Protein Data Bank, SciVee and other goodies. Phil made a cogent plea for a system of unique identifiers for scientists.

Welcome to the Streamosphere

river-of-news.jpgWeb publishing as a discipline has few tenets but I think release early, release often and don’t be afraid to fail are pretty sound. That was the philosophy behind Connotea when Timo and Ben Lund launched it in 2004 and it’s the spirit in which I’ve just put up an early version of Streamosphere.

Streamosphere is a pet side project which I’m running according to what I guess you could call the Paul Graham principles (it’d be disingenuous to say “as a start-up” as most startups don’t have NPG level resources. OTOH we lack a fussball table and free M&Ms). Think of it as a pre-alpha alpha.

The elevator pitch

Streamosphere lets you track scientific discussion on the web, in real time.

What it does

If you visit you’ll see a page of stacked timelines like these:

Picture 5.png

Each timeline shows discussion around a particular item, for now always a web page. The portrait on the left is of one of the people who first started talking about the item. The slice of time in which the discussion was active (people were leaving comments, tweeting, liking or bookmarking it) is coloured a shade of magnolia. Behind the active slice is a graph – this shows you how much activity there was at any one point.

Click on an item’s active slice to pop up more details about it including an activity breakdown and a selection of associated comments and tweets. If the item is a video or photograph it should be embedded in the popup. If the item description is in a foreign language hover your mouse cursor over it to get the English translation.

Picture 6.png

Streamosphere only ever shows the most active items in a given time period. Use the controls on the right hand side of the screen to see the most active items in the past few hours, day, week or month. You can also filter items by domain or by keywords in their description.

In smaller time periods you’ll see some items that aren’t anything to do with science: recently there’s been stuff about Iran and a viral video for example. I’m not sure if this is a bug or a feature, or how to filter out non-science stuff is that’s a requirement – suggestions welcome.

In the future I’d like to see the page update dynamically as new activity gets tracked but for now to refresh the page you need to reload or choose a new time period.

How it works

Streamosphere tracks ~ 4k accounts on half a dozen different social media sites including Friendfeed, Twitter and bookmarking services like Delicious. The account owners have all self-identified (sometimes implicitly) as scientists or people interested in science.

It uses a combination of polling, web hooks (via GNIP) and SUP feeds to aggregate public updates from tracked accounts as soon after they happen as possible. Average latency is ~ 3 minutes for Friendfeed and a few seconds for Twitter.

Right now there’s only one view on the data: by item. Items are the URIs associated with or mentioned in updates: if I tweet “I love” and you bookmark it on delicious then the streamosphere database will record a single item ( associated with two updates.

Items are currently always websites but in the future I’d like to add views for users and topics; these are non-trival because of problems with account owner disambiguation and classifying short messages respectively.

Owner disambiguation relies on the Google Social Graph API. We need to disambiguate owners because otherwise the same person could post a single link on multiple services and Streamosphere would believe it’s amazingly popular.

Sometimes users have set up rules to automatically route updates from one service to another (e.g. they share an item on Google Reader which appears in their Friendfeed stream which gets pushed out to their Twitter account). Rules like this are the bane of Streamosphere’s existence – it’s non-trivial to detect this kind of thing and handle them correctly.

I’m collecting hashtags, tags and extracting key terms from all updates but don’t quite know what to do with them yet – still need a good algorithm to detect trending topics. Links are extracted from updates but right now there’s no disambiguation for papers (Buggotea is alive and well in Streamosphere). There’s a best effort attempt to resolve shortened URLs though occasionally one will slip through.

There’s no API but if anybody has a good use for the data I’m happy to set something up using GNIP or long polling to support real time updates if necessary – just send me a use case.

Which web 2.0 services do scientists use?

Which web services are scientists actively contributing to?

There are ~ 1,240 Friendfeeders in science related rooms (the-life-scientists, scienceapps, science-2-0, science-online…). What percentage have listed usernames associated with the science related tools supported by Friendfeed?

Picture 10.png

Service Count
citeulike 41
connotea 31
delicious 431
digg 208
googlereader 394
reddit 68
slideshare 143
twitter 675
youtube 341

Why this dataset isn’t very good…

There’s a bias towards services formally supported by Friendfeed – it’s easy to add feeds from supported services. Connotea and CiteULike aren’t formally supported though you can add your library RSS feeds manually. Many Friendfeed users won’t bother to do this.

People may be contributing to services (like YouTube…) for reasons that have nothing to do with science.

People who use Friendfeed aren’t a representative sample of scientists (though they may well be a representative sample of blog friendly, web savvy scientists).

People sometimes remove their Twitter feeds from Friendfeed to help keep the conversations that they start there in one place.

I picked the set of services to look at which is why you don’t see, say, Wikipedia or OpenWetWare above (some preliminary analysis suggested that the numbers would be negligible).

That said…

We can still use it to guess at broad trends.

Almost a third of Friendfeed scientists have delicious bookmarks. Don’t discount non-academic bookmarking services as a source of paper metadata.

A similar number use the share functionality in Google Reader.

Despite rumors to the contrary not everybody is on Twitter.

A surprising (to me) number of people are uploading and favouriting items on Slideshare.