Data should be public domain, and more esoteric blog-based ‘rasslin’
Over the end of last week, I noticed several items coming down the RSS tubes that seemed to be involved with the permission barriers we place on scientific data. These posts also seemed to be interrelated as well. At the time I couldn’t give them the attention they deservered, so I filed them away for later digestion. I’d like to discuss them here now, at risks of kicking an anthill that seems to have settled down a bit over the weekend.
As far as I can tell, things seemed to start when Chemspider chose to license their data under a Creative Commons license. This is obviously (from my point of view) an attempt on their part to do the right thing - ensure that their data is freely available, and to give them some controls to “enforce the freedom”. Then the wonderfully muddy communication medium of the internet kicked in, and people started getting angry at one another. It seems that Peter Murray Rust published (somewhat erroneously) a conversation between himself and John Wilbanks. This conversation was taken somewhat out of context by the folks over at Chemspider, and the ball was rolling.
The Wilbanks comment that set off Chemspider reads:
I would add to it that I’d like to see a meaningful discussion of the
risks of Share Alike and Attribution on data integration. Chemspider’s
move to CC BY SA fits into this discussion nicely - it’s a total
violation of the open data protocol we laid out at SC, which says “Don’t
Use CC Licenses on Data” - but it does conform inside the broader [Open Knowledge Definition].
Now the way I read this (and keep in mind I’m doing exactly what caused the problem in the first place - putting my own spin on someone else’s words), I would rephrase this statement from Wilbanks something like this:
It’s good that Chemspider wants to ensure open access to their data; I just don’t think that the Creative Commons license is the correct tool to use here
Chemspider themselves, on the other hand, focused on the “total violation” bit, and came away feeling slighted.
Much as he did when I misinterpreted some of his comments, John Wilbanks quickly implemented diplomacy:
I think they should get complimented for their intentions and that they deserve tea and sympathy, because this licensing stuff is really complicated, and all they wanted to do was share.
Further reading/discussion at Open Reading Frame and chem-bla-ics.
Wilbanks also followed up with a great post on Open Data and the Public Domain in general.
The public domain is not contractually constructed. It just is. It cannot be made more free, only less free. And if we start a culture of licensing and enclosing the public domain (stuff that is actually already free, like the human genome) in the name of “freedom” we’re playing a dangerous game.
This is one of those posts you read that makes you virtually stand up and applaud. This is exactly what I think most of those who support the idea of open data believe in and are working to accomplish in their own way. It also just keeps getting better. As so often happens when the satellite of internet discourse passes close to the gravitational well of misunderstanding, the conversation has slingshot nicely off into an examination of the broader topics confronting Open Data. The OpenWetWare blog lists some of the realities facing creation of a “science exchange”:
I start with a couple of premises:
1. Data belongs in the public domain
2. An effective and useful data commons requires well structured data
3. Preparing high quality data costs money, and the tools do not really exist to support this in general
4. Academic career progression depends at core on one thing. How much money you bring in.
Please read the entire post on the OWW blog, I can’t possibly relay the contents in a few snippets. I’ll see you back here when you’re done.
The post discusses possible logistics of creating an open exchange for scientific knowledge. I have to say that while it might work, I’m not sure if it’s the best way to go about things. First of all, I’m not sure that we need to “pay people to deposit their data”. If you can make deposition easier and more socially acceptable than retention, then the problem will take care of itself. Also, I’m not sure that leveraging “premium” data from the exchange is the best way to monetize it. We should take as our example other fields in which there is an open dataset. What I envision is the development and sale of tools to access the data - be this software or hardware. By charging a nominal license fee for the tools, you are selling things like more efficient use of the data as opposed to the data itself.
The thing is, I don’t really care how we do it. We just have to make it happen. I can’t talk about this without descending into grammar school vocabulary for some reason, but it would be a REALLY REALLY GOOD THING, for science and the world at large. I can’t even wrap my brain around a little corner of how amazing it would be to have a real database of scientific information that is as searchable as the listings at Amazon. I have a feeling it would increase the speed of scientific advancement by a ludicrously large amount. It would save money (less repetition of work) and probably pay for itself in short order in that way.
The startup cost might be expensive, but it’s worth is likely to be immense. I’m still not sure how to find the force of will to get the job done, but I’m really glad to see the conversation.
::EDIT:: The Science Commons Blog has posted a relevant and clear guide entitled “How to Free Your Facts“


May 12th, 2008 at 3:12 pm
Hi PA. I probably do get a bit obsessed about the money side. But, speaking as a cash strapped academic, I do know that providing the carrot of money to academics will get serious action very quickly, whereas the stick, whether hard (funding agencies demanding it) or soft (social pressure) is less effective.
It would be better if it just happened because the social structures moved that way but we do need to actually pay the cost of preparing and depositing the data somewhere along the line. If that gets paid directly by the ‘repository’ there is the opportunity to really demand quality which I think gets us further down the road faster.
As you say, once that’s in place then the money saved is potentially huge and we just need to make sure that flows back into supporting the deposition effort. Monetizing the data is more about trying to keep things running before the benefits kick in. I think we mean the same thing though - I meant premium services rather than premium data, which is kind of the same as your charging for the tools.
But as you say, the important thing is to just get further down this road.
May 12th, 2008 at 8:55 pm
PA, ChemSpiderman here….
The term slighted doesn’t really describe what happened for me. It was more of a hands in the air, deep breath, I don’t have time for this type of response. I’ve spent tens if not hundreds of hours trying to navigate negativity around what we are trying to do. We have over 5000 users a day visiting our site now..and tipped over 6000 last week. They seem happy. They like what we’re doing.
Meanwhile, I have allowed myself to get distracted by the comments of one vocal individual and have spent time addressing those comments and rather than addressing the needs of the users of the system. What am I thinking? We are trying to build a community for chemists…ones that use the system and enjoy what we do. We should be focused on their needs rather than a non-user and, for right now, licensing be damned. All was good for us before except for the vocal critic as far as I am concerned. We are going to return it to that place and get the focus back onto delivering for our community of users.
For info about the multiple distractions see: http://www.chemspider.com/blog/another-response-to-constructive-feedback-from-peter-murray-rust.html
May 12th, 2008 at 11:58 pm
[...] data-should-be-public-domain-and-more-esoteric-blog-based [...]
May 13th, 2008 at 8:49 am
@Cameron: I’m glad someone is paying attention to money. I tend to have this idealist “it will pay for itself in savings” mentality a lot of the time. While that may or may not be true, it doesn’t really provide enough inertia to get a large project like this off the ground.
@Chemspiderman: I totally understand where you’re coming from. It’s clear to me (from the outside looking in) that everyone in this situation was just trying to do the best thing for Open Science. You guys wanted to make sure the freedom of your data was protected; Wilbanks and PMR wanted to make sure that you were choosing the right kind of protection. There was a bit of a crossing of the wires in communication, and that’s about all it takes on the web to start a dust-up. I think partly it’s due to the fact that text on a screen often can’t convey a person’s emotions and frame of mind.
May 15th, 2008 at 3:41 pm
I think it will pay for itself in the end, but getting there will be expensive. It could take 15 years to recover the investment. More importantly I think unless there is a commercial interest in developing the tools we are going to wallow around not getting too far for several more years yet. I don’t see traditional funders having either the will or the flexibility (nor the international view) to make things happen. We will, however, get a long way with a bit of clever re-purposing of what is already out there so getting the big players involved will be important - and they need a return over 2-5 years at the outside.
May 24th, 2008 at 11:58 am
[...] of a scientific data commons. I [1,2] and others (including John Wilbanks, Deepak Singh, and Plausible Accuracy) have written on this quite a lot recently. The second aspect is that I believe strongly in the [...]