Archive for July, 2008

More on yesterday’s quote of the day and why I admire my undergraduate advisor

Thursday, July 31st, 2008

First, some background. I was lucky enough to be enrolled in the Honors program at my school. In retrospect this is one of the best things that ever happened to me solely because of Honors General Chemistry II lab.  Instead of a “normal” lab where you go in a few times a week and do some experiments out of a manual or something like this, the Hon Gen Chem II lab put us (as second semester Freshmen) into a Real Science Lab to work as assistants.  I don’t even remember exactly how I ended up working in the lab that I did, although I do remember the professor expressing some reluctance at having a couple of freshmen in the lab.  Regardless, he let myself and another student come and assigned us to work alongside a tech that was in the lab at the time.  I really fell in love with the whole atmosphere.  The lab dynamic was great, it was exciting to be doing Real Science, and I just felt like this was what life should be like.

At the end of the semester, Dr. L came to me and said he’d be glad to have me back just as an independent research assistant, and I jumped at the offer.  From then on, if I was at the University I was working in the lab (I took at least one summer off to go home and work).  It might also be worth noting that I was a real sucker - most of the time I was working for free.

It was during my time in Dr. L’s lab that I switched my major from Biology to Chemistry, gained my interest in structural biophysics, and decided that I really wanted to go into academia.  Some of these things had been floating around in my mind, but here I could see people living out the life, and it looked great.

Now a few words about Dr. L himself.  For some reason I have a hard time explaining his personality.  I just recently realized what he’s “got” that most people don’t - the ability to constructively criticize.  For instance, when I wrote him the other day I had mentioned what I was thinking about doing for research for the rest of my life (a topic I’ve given some thought to).  In 3 or 4 short sentences in his reply he completely dismantled the idea in a way that not only made a huge amount of sense, but after a moment’s reflection revealed itself as 100% correct.  On top of that, I wasn’t upset that my dreams had been dashed, but rather excited and fired up about his alternate recommendations.

Another anecdote that still sticks with me:  I was taking one of the classes he taught (I forget the name, structural biophysics maybe… anyway, it’s not important).  It was mostly graduate students with a few undergrads sprinkled in.  He handed around a take-home exam to the class, with the admonishment that we were to work on it alone.  I was sitting next to my friend, another undergrad who worked in the lab, and we both sort of gave one another a “yeah right” look.  Then Dr. L gave the class a short speech.  The root message was something like:

You are all here because you plan on being scientists.  As a scientist, everything you do is based on your personal ethics.  If you cheat on this test, you are cheating on yourselves, and there is no way you will make for a decent scientist

I remember feeling like the biggest chump for ever even thinking about doing otherwise, and determined to take extra measures to avoid even looking like I wasn’t working on the test alone.  Not only that, but this short speech has stuck with me since then and applies to everything I do as a researcher.

I can’t think of a single other person I’ve worked for or with that can so efficiently mentor, inspire, and motivate.  He does it without any obvious effort.  I can’t lie, I think I have a bit of a case of hero worship.  He’s really my role model; I feel like if I ever make it through graduate school and onto an academic position of my own, if I could be 50% of the advisor he is I’d improve on most others that I’ve seen.

The interesting and telling thing is that he is not a “hotshot” scientist as measured by a lot of the metrics considered important (like funding).  I remember one tough period when I was in the lab where we were washing and reusing pipette tips, “borrowing” chemicals and supplies, etc.  Since then I get the impression that things have changed for the better and that the lab is doing fine for itself.  This is so odd to me - I’ve seen dysfunctional, less productive labs that are funded through the ears.  Just another comment on the system, I suppose.

Do you have any figures like this in your scientific or personal life?

Quote of the day

Wednesday, July 30th, 2008

From my undergraduate advisor (and still my favorite faculty member):

Keep an eye on what is exciting because that is what will drag you into the lab everyday

This is a good way to avoid having people look at your content

Monday, July 28th, 2008

I tried to follow a link from another site to MAKE magazine, and was greeted with this welcoming page:

Doesn't this make you want to do what they say?

Doesn't this make you want to do what they say?

First of all, why do I have to enable cookies to read one of your articles?  Also, there is no justification given for taking this action (besides getting to see the mysteriously locked down content I suppose).  It takes a lot less time for me to close a browser tab than to jump through your hoops.

Open Science blog carnival - The interest seems to be there, so what about the details?

Saturday, July 26th, 2008

My previous post was really just a signal flare to see if anyone was aware of a pre-established regular blog carnival which focused on Open Science.  The consensus seems to be that no, there is not, and it would be a worthwhile idea to try.

Here are my thoughts on how to organize such a carnival, that I’m tentatively calling “Open Carnival”, but it could really use a more pithy name (hint: put your ideas in the comments):

  • Focus on Open Science, including Open Access, Open Notebooks, and Open Data
  • Monthly “publication”
    • This allows for a nice amount of good posts to build up, and cuts down on overhead for the host
  • Themed
    • The host chooses a theme for the carnival that will be hosted at their blog.  The theme is announced at the publication of the carnival before, to allow people time to prepare their entries
  • Rotating host
    • Anyone interested in hosting could add their name to a list; the carnival migrates through the list
  • Content supply: automatic and manual
    • Automatic content retrieval via tag.  For instance the author could apply a tag “Open Carnival” on a post and the host could retrieve these using a tool like Google Blog search
    • Manual method in which the author emails a link to the upcoming host

Sound reasonable?  Anything I left out or should take away?  I’ve already got some ideas for themes to get things started…

Open Access carnival?

Saturday, July 26th, 2008

This post over at Bora’s blog has me wondering if there is a blog carnival out there (active or defunct) that focuses on Open Science?  I feel like I keep a decent eye on the OA blogs, but I don’t remember seeing one mentioned.  I feel like it would be a good way to knit the community even closer together.

I get the feeling that a lot of us (perhaps it’s a bit presumptuous to include myself here) are keeping an eye out on what one another are writing and perhaps creating our own posts or comments in response, but a regular (Monthly?) carnival might be a nice way to compile the “best” OA posts in one place.  This could be a traffic driver of course, as well as a nice synopsis or rolling snapshot of what the state of the Open Science conversation is.

To quote Bora:

Because this is the best way to build a community around a particular topic - the quickest, easiest way for people who are harboring similar interests to find each other, decide if they like each other, to boost each other’s rankings and traffic, and, if needed, to organize together for some kind of action. In best cases, you will meet some of those bloggers in person and forge new friendships, or even scientific collaborations.

So, if there is one out there already that I’m just not aware of, erm.. promote yourself more! If there isn’t a carnival going already, feedback on getting one started?

Gettin’ Anxiety

Friday, July 25th, 2008

worried I'm getting anxiety

Toothpaste For Dinner - “The most addictive comic on the web”

Extracting data from the PDB and the power of FriendFeed, part 2 (IRC edition)

Tuesday, July 22nd, 2008

After another several hours of cranking away at this script, along with some help from my ever-brilliant buddies on IRC, I’ve gotten something that works well enough for now.  The code:

#!/usr/bin/python2.5

import csv, string, re, glob
from xml.etree import ElementTree as ET

PDBx = '{http://deposit.pdb.org/pdbML/pdbx.xsd}'
search = raw_input('What grep term? > ')
filelist = glob.glob(search+'*.xml')
results = []
itemdict = {}
def parser(filename):
	tree = ET.ElementTree(file=filename)
	rootelem = tree.getroot()
	pdbcode = rootelem.attrib['datablockName']
	for detailtag in tree.getiterator(PDBx+'pdbx_details'):
		details = detailtag.text
		if details == 'None':
			pass
		else:
			print 'Protein '+pdbcode+' added to database'
			itemdict[pdbcode] = details
	return itemdict

outfile = open('extracted.csv', 'a')
for filename in filelist:
	parser(filename)

writer = csv.writer(outfile)
for k,v in itemdict.iteritems():
	writer.writerow((k,v))
print "Complete.  Data added to extracted.csv"
outfile.close()

I’ve wrapped the XML processing in a function called “parser”. This will pull out the PDB code and the crystallization details from the file, then insert them into a dictionary as a key:value pair. The parser is run on all the matching hits from the search term (entered by the user at runtime) and appends the data in the resulting dictionary as rows of key,value. This file can then be opened as a CSV document.

Here is the top of the output file:

1AF6,pH 7.

1AUN,"PROTEIN WAS CRYSTALLIZED FROM 0.7 M MGCL2, 10% GLYCEROL, 50 MM HEPES, PH 7.5"

1A0R,"THE PROTEIN COMPLEX (10 MG/ML SOLUTION) WAS CRYSTALLIZED FROM 400 MM SODIUM CACODYLATE (PH 6.8), 1 MM ZINC CHLORIDE, 25 % ETHYLENE GLYCOL, 12 % PEG 8000, BY MICROBATCH CRYSTALLIZATION AT 4 DEGREES C., microbatch, temperature 277K"

1A0S,"PROTEIN WAS CRYSTALLIZED BY VAPOR DIFFUSION USING THE SITTING-DROP METHOD. THE DROP CONTAINED 5-7 MG/ML PROTEIN, 20 MM TRIS/CL AT PH 7.7, 100MM LICL, 20MM MGSO4, 1.2% BETA-D-OCTYLGLUCOPYRANOSIDE AND 6-9% PEG-2000. THE CONCENTRATION OF PEG IN THE RESERVOIR WAS 12-15%., vapor diffusion - sitting drop"

1AKN,pH 7.

I had to split up the grep/glob term because this was the only way I could figure out how to run the program. I’m running it now in chunks using the first two digits of PDB identifiers as the search term. This way the memory gets cleared out after every 10 or 20 files and the machine doesn’t lock up.

As usual with my code, it’s pretty ugly. It’s functional, but could be better. I could also add to the function to extract the other information I mentioned in the initial posts on this topic (citation info and protein name), but to be honest I’m sort of tired of wrestling with it right now.

Extracting experimental data from the PDB, and the power of FriendFeed, Part 1

Monday, July 21st, 2008

Late last week, one of my committee members posed a suggestion: take a look at all of the known structures of proteins sort of like mine and see if you can find any pattern or ideas for things to try.  It’s a pretty obvious idea, but the implementation is what was a bit complicated.

First of all, I had to define a set of known structures that fall in the set of “similar to my protein”.  Luckily for me I study an interfacial protein and the Orientations of Proteins in Membranes (OPM) database has already done a fantastic job of this.  The even provide a list of all the PDB codes as a plain text list.  Over the weekend I (foolishly) took the Web 1.0 route to solve my problem: manually searching for each code, hunting for the data I wanted, and doing the Ctrl-C Ctrl-V dance until I wanted to cry.  This took about 10 entries.

I realized there had to be a better way.  Enter FriendFeed.  I posted my problem there and within minutes had great suggestions from Deepak and Neil Saunders.  Deepak recommended I write a parser, and Neil mentioned that the PDB offered XML versions of their data.  Both of these things combined would make my life a lot easier.  Using the list of PDB codes from the OPM and the download page at the PDB, I grabbed all of my files.  Then it was off to Python.

Of course once I got my grubby mitts on it, I started running into problems.  After a few hours of toying with it, here is my (non-functional) code:

#! /usr/bin/python

import csv, string, re, glob
import xml.etree.ElementTree as ET
PDBx = '{http://deposit.pdb.org/pdbML/pdbx.xsd}'
filelist = glob.glob('*.xml')
print filelist
def xmlparse(filename):
	xmltree = ET.ElementTree(file=filename)
	rootelem = xmltree.getroot()
	pdbcode = rootelem.attrib['datablockName']
	if pdbcode is None:
		pass
	else:
		print pdbcode
	rootelem.clear()

for filename in filelist:
	xmlparse(filename)

Note: the PDBx variable has to do with the namespace included in the XML file.  I’m not using it here because the PDB code is embedded in the root element, but when I go to extract data which is nested deeper it becomes important.

This should just print out the PDB IDs for each of the files, and indeed does so for the first several of them. The problem is that while I believe the line “rootelem.clear()” should wipe the data from the memory, after about 4 files my system goes to 100% memory usage and bogs down.

I really feel like this should work. In playing with a single file I more or less managed to already pull out the information I was interested in. The only issue now is how to handle so many files. The documentation on ElementTree is a bit lacking, so I feel like it’s just going to take a lot more testing.

Is there a good way to extract experimental data from the PDB?

Monday, July 21st, 2008

What I have: A list of PDB codes (~750)

What I’d like:

  • The protein name
  • Citation
  • Experimental conditions
    • If it’s a crystal, the crystallization conditions
    • If it’s NMR, the sample buffer will do

Most files on the PDB obtained via X-Ray diffraction include a data file like this one, which often lists the crystallization conditions.  They also contain a tabular search function which allows you to choose things like ionic strength, solvent conditions, etc, but for my search every single entry came back with n/a for the parameters I’m interested in.

As of now my method goes like this:

  • Use advanced query to pull out the entries from my list that include experimental data
  • Click link to detailed page
  • Copy and paste title and citation into spreadsheet
  • Open data file in external text editor
  • Search for crystallization conditions
  • Manually extract and enter this information into the spreadsheet

It’s pretty labor-intensive, and doing >700 of these is going to take me a long time.  I was wondering if anyone knew of a better way to go about pulling this information out.

A bit of an Open Science roundup

Sunday, July 20th, 2008

At the end of last week, it seemed there was something in the water that blog authors were drinking.  My RSS feeds kept churning up really interesting articles that I just didn’t have time to blog about being a bit busy with my list of Things To Do ™ in order to head down the homestretch of my degree.

This, of course, is why some brilliant programmer at Google allows you to “star” items for later.  Now that it’s a lazy Sunday, I thought I’d take the time to add my comments.  In chronological order:

Item 1 is just a short mention that PLoS has upgraded the TOPAZ software they use to power the journals.  I’ve given them some grief in the past for performance, but the general trend for OSS is faster, leaner, better and it’s good to see them following the trend. (via Bora)

Item 2 is a post by the inimitable Cameron Neylon over at the OpenWetWare blog entitled “Policy for Open Science - reflections on the workshop”. It’s a wide-ranging post on the status of Open Access both of literature and data. Although Cameron covers a lot of ground, I especially like the commentary on how to engage scientists in the sharing process as soon as possible:

The same is true for capturing data. We must capture it at source. This is the point where it has the potential to add the greatest value to the scientist’s workflow by making their data and records more available, by making them more consistent, by allowing them to reformat and reanalyse data with ease, and ultimately by making it easy for them to share the full record

Cameron also believes that the software to do this capture should be composed a certain way:

It needs to be an open and standards based ecosystem and in my view needs to be built up of small parts, loosely coupled.

I sort of agree with this. I definitely feel that the software should be open source, with small teams maintaining certain applications that they are most interested in. I don’t necessarily think that the individual applications should be too loosely coupled, but I would like to see them as pluggable, interchangeable components of a larger framework.  Presumably the developers and maintainers of this larger framework could publish a set of standards or API which the individual application developers could adhere to.  A scientist would then simply install the framework and whatever apps they needed for their purposes.

I do recommend reading all of Cameron’s post.  It’s quite interesting.

Finally, Item 3 is a post by Michael Nielsen (via John Wilbanks) which once again takes a look at the state of open data, starting with a discussion of the historical standard practice of keeping discoveries a secret.

The adoption and growth of the scientific journal system has created a body of shared knowledge for our civilization, a collective long-term memory which is the basis for much of human progress. This system has changed surprisingly little in the last 300 years. The internet offers us the first major opportunity to improve this collective long-term memory, and to create a collective short-term working memory, a conversational commons for the rapid collaborative development of ideas.

He goes on in the rest of the post to perform a really comprehensive examination of the entire open data issue.  To me, the prime quote is this one:

the journal system is perhaps the most open system for the transmission of knowledge that could be built with 17th century media. The adoption of the journal system was achieved by subsidizing scientists who published their discoveries in journals. This same subsidy now inhibits the adoption of more effective technologies, because it continues to incentivize scientists to share their work in conventional journals, and not in more modern media.

That pretty much sums up the problems that we proponents of open data face.  He goes on to talk about how we can take lessons from economics and commerce in attempting to overcome these obstacles.  It’s a bit of a long read, but every word of it is intriguing, and I highly recommend it if you can set aside a few minutes.  At the end of the post he mentions that he’s working on a book on the future of science, which I am already anticipating based on this commentary.