Archive for April, 2008

Weekend Liveblog: Switching my home computer from Windows XP to Ubuntu

Saturday, April 19th, 2008

I’ve been using Ubuntu as my operating system at work for about 2 years now. I really love it. I’ve tinkered with making it the OS on my home computer, but since I do a bit of gaming I keep putting it off (as gaming support in Linux tends to lag behind Windows a bit).

I’ve been getting more and more fed up with Windows lately, however, so I’ve decided to give Linux another go. I’ll be liveblogging the changeover for you below.

(more…)

Feasibilty of using LaTeX as a means of generating machine-readable papers

Friday, April 18th, 2008

As more papers are published in Open Access journals, it becomes possible to text-mine these articles and build large, centralized, and cross-linked databases of scientific knowledge.  One of the keys to this effort is the ability to read in the content of scientific articles using computational algorithms.  These algorithms need to be able to “understand” the content, at least to the point where they can properly assign things such as the title, authors, system under study, key findings, etc.

Most people who are working on this problem, which is tied in with the semantic web, believe that in order to do this, articles must be accompanied by markup language, or metadata.  This means that in addition to the actual content of the article (the text of the paper), we need to include some information about the content.  For instance, the raw file might look something like this:

<journal>The Journal I Just Made Up</journal>
<issue>1</issue>
<pages>1-2</pages>
<title>A Study of Everything: Not so complicated as you might think</title>
<author1>Accuracy, P.</author1>

The things between the <> are metadata - they tell a computer something about what is inside.  A program could quite easily scan this raw file and then do things with it, whether that is assigning the content between the tags to a database, or formatting the article for easier reading online.

Markup like this is already very common in web development.  If you right-click this page and then click on “view source”, you’ll see that the raw file used to generate the page is full of tags like this.  Your web browser formats the raw file according to a simple style sheet (also provided by the web developer - you can see the PlausibleAccuracy sheet here) in order to display it on your screen.

So one of the issues of machine-readable scientific articles is that we need to apply some sort of markup like this to the papers.  This requires some cost, both in time and money (you presumably have to pay someone to do it), and as such you have to convince the people involved that it’s worth doing.

There is already document writing software which includes markup and is particularly well-suited to academic papers - LaTeX.  Much in the same way that Cascading Style Sheets can be used to separate web design from site content, LaTeX uses style files to format a content-rich document.  When writing a paper in LaTeX, you simply tag things as you go and focus on writing what is important, rather than wrangling with formatting.  You then process the document and a nice looking file is generated.

It seems to me that writing papers in LaTeX would be a benefit to everyone involved in the publishing of academic papers.  Authors can just write up their results, and spend less time wrestling with Microsoft Word to try and shoehorn the paper to match the journal’s formatting rules.  Journals should be able to write relatively simple style files for the LaTeX submissions to generate documents suitable for publication.  In the same vein, those interested in machine parsing of the documents could write software which also leveraged the markup for their own uses.  Of course, however, there are some concerns.

The main one, in my view is reluctance on the part of authors to learn LaTeX.  It does take a (small) amount of effort to learn, and to be sure the actual writing is more complicated than doing so in a What You See Is What You Get (WYSIWYG) editor like Microsoft Word.  It’s my experience that for some reason people see the time and effort put into learning LaTeX and including markup during the writing process as more work than the formatting changes that they have to make after the fact in Word, even though this may take more time and be more frustrating.

It’s also harder to just read a LaTeX file.  It’s sort of like reading a raw web page without a browser - you stumble across the tags.  More important than this is group editing of a document; most professors like to receive an electronic copy from the student writing the paper, so they can make their edits directly and return it.  As LaTeX tends to output PDF files, this isn’t so simple.  There are other alternatives, like WYSIWYG editors for LaTeX, but the experience I’ve had with them is less than stellar.

In spite of these issues, I think that LaTeX (or perhaps a modified version of it) is an excellent candidate as we think about ways to generate a semantically addressed body of scientific literature.  Perhaps development of specific markup and improved WYSIWYG editors would aid this effort.

Science Conference to be held in World of Warcraft

Friday, April 18th, 2008

The Science RSS feed this week contained an interesting entry entitled “Scientists, we need your swords“.  Curiosity piqued, I clicked through to check it out.

It turns out that they are organizing a scientific conference in the game World of Warcraft.  To participate, you just need a character on the proper server (Earthen Ring US) and to join the “Science” guild.  The conference takes place May 9-11, and as you might expect is going to focus on virtual worlds research.

If you’re interested, check out the preliminary program.  I recommend creating a character soon… The information states that they plan on raiding an enemy city, and I can promise you that you’ll spend most of that time dead unless you start grinding boars now.

Some good news from my institution

Thursday, April 17th, 2008

Last night I was having a discussion with Mrs. PA.  She was worried that a paper she was writing (yes, it’s a two scientist household) was a bit long for publication.  I told her “Hey, I bet you can find an OA journal, such as PLoS One, that will not bind you to arbitrary page limits”.  I then launched into my (new and unpolished) sales pitch on why publishing in an Open Access journal would be better overall for the paper, and showed her some other papers in her field that have been published in PLoS One.  Then I came to the hard part - the increased page charges that she would have to sell her advisor on.  I mentioned that for PLoS One these come to $1250, which probably isn’t that much more than they would end up paying anyway to the closed-access journal they were thinking of publishing in.  Then I noticed a line on the PLoS site which stated that Institutional Members got a discount on top of this.  Not hoping for much, I went to the list of members, and lo and behold, our school was on it!  I’m really excited about this, because I think it will make convincing faculty to publish (in at least this particular OA journal) much easier.

I also learned that the University has put in place a support system for graduate students who lose their adviser.  If you read my 3-part post on misconduct, you’ll realize that this is very important to me.  I’m checking into it for more details, which I’ll probably write about later.  Unfortunately, the current policy on research misconduct investigations (dated 9/07) fails to mention students once, and is still chock full of the secrecy and careful political maneuvering that caused problems in the situation I was involved in.

Impact factors are hooey

Wednesday, April 16th, 2008

One of the major metrics scientists use to rank journals is the “Impact Factor“.  This is a real number calculated in mysterious ways by the Thomson Corporation.  The problem with it is twofold: first of all, it’s not accurate; secondly, it’s held in far too high of a position of importance (especially given point one).

This came up in my thoughts because of a comment from Sciencewoman in response to my recommendation that she look into publishing her latest paper in an OA journal:

PA: I looked at the directory [of open access journals] and the only appropriate OA journal had a significantly lower impact factor. Again, as a pre-tenured person, I need to be aware of those things.

This disheartens me because I think that it is precisely amongst the young faculty that a preference for OA publishing can take hold.

So, what is wrong with impact factors?  I hardly know where to start… Indeed the subject has been covered extensively elsewhere, and I’ll refer the reader to papers from Nature, Seglen, & Postma.

Let’s look at this case specifically.  Sciencewoman is worried that publishing her work in a journal with a low impact factor will reflect negatively on the paper - the tenure committee won’t think that it’s “prestigious”.  First of all, this is a fallcious argument.  Impact factors are an aggregate of citation counts for that journal (more or less), and therefore reflect only on the journal as a whole, not individual papers within it.  For instance, let’s say I wrote a really smashing paper in the Journal I Just Made Up, which was cited by every scientist twice.  Since all the other papers in JIJMU were random scraps I found in rubbish bins, they weren’t cited at all.  Just because JIJMU would have a fantastic impact factor (hauled up by my honestly brilliant paper) that doesn’t mean that my neighbor’s grocery list is also groundbreaking research.  This is an extreme example, but I think you get my point.

There are more specific problems with impact factors as they relate to OA journals.  These journals tend to be younger, and therefore are inherently shortchanged when it comes to impact factors which are tallied over several years.  They often serve niche fields, in which the overall impact factor might be lower (the numbers across fields cannot really be directly compared, because of different audience sizes and referencing traditions).  To be fair, there are certain ways that OA journals have an advantage when it comes to impact factors - it’s been shown that OA papers tend to be referenced more often than their closed counterparts.

The fact is that the impact factor as a metric of research quality is embarassingly bad, and I’d be happy if we could just do away with the notion altogether.  I definitely think that a young faculty member up for tenure should be able to make a convincing argument that the committee should look past a flawed number and appreciate the desire to make the work accessible to the largest community possible, by publishing it openly.

Another way to reduce “medical losses” - pay out less insurance for life saving drugs

Tuesday, April 15th, 2008

The drug companies and health insurance companies provide a criticial service to the people.  They can act either in benevolent ways, ensuring that the most people get the best health care possible, or they can act like mafia extortion men, holding health and happiness hostage for ever-increasing amounts of money.  I think it’s clear which way they have leaned, especially in the recent past.

Case in point, this article from the New York Times.  It describes a practice that insurance providers have put into place, in which for specific drugs which are expensive, the insurance pays a percentage of the cost rather than everything above a limited co-pay.

From the article (emphasis mine):

No one knows how many patients are affected, but hundreds of drugs are priced this new way. They are used to treat diseases that may be fairly common, including multiple sclerosis, rheumatoid arthritis, hemophilia, hepatitis C and some cancers. There are no cheaper equivalents for these drugs, so patients are forced to pay the price or do without.

These are not “optional” drugs, nor is this a case of people choosing name-brand over generic because of some sense that they are better. These patients need these specific drugs to be healthy.  This policy has a terrible impact on people already struggling with serious diseases:

There must be a mistake, Ms. Steinwand said. So the pharmacist checked with her supervisor. The new price was correct. Kaiser’s policy had changed. Now Kaiser was charging 25 percent of the cost of the drug up to a maximum of $325 per prescription. Her annual cost would be $3,900 and unless her insurance changed or the drug dropped in price, it would go on for the rest of her life.

Of course the insurance companies sell this using a populist message: make the sick minority pay for their drugs, and your premiums will be lower.  Never mind that this goes against the very purpose of insurance.

I’m not quite sure who to blame here.  The drug companies are almost definitely overcharging for these drugs, based on a business model that relies on double-digit profit margins and advertising budgets that outstrip research budgets.  Insurance companies are literally parasitic, profiting by making America sicker.  The only real option is a comprehensive overhaul of health care, which I hope the next Administration can push through.

Mondays are catch-up days

Monday, April 14th, 2008

I usually take the weekend off from blogging and RSS reading, preferring to play really obscene amounts of EVE Online. This usually means that I spend a fair amount of my Monday morning just catching up on the pile that accumulates in my “unread items” list in Google Reader.

Here are some highlights:
Harold Varmus’ interview on NPR’s Science Friday was a nice public discussion of Open Access.  He talks about the background of PLoS, the new NIH delayed release policy, and the OA movement in general.  As others have noted, the very first listener question asks about peer review in OA journals.  Also as noted in the link above, peer review and OA are entirely separate issues, and I think Dr. Varmus could have been a bit more clear here.  He also at the end of his answer throws in a sentence or two about the source of funding for articles which is a bit misleading.  Regardless, I think overall it was a positive interview, and I recommend listening to it if you haven’t already.  Also, the ScienceCommons blog has a few more related links.

Next, take a break from reading for a bit and watch Bill Maher interviewing Richard Dawkins (via Greg Laden)

Also, if you have some time, check out this screencast on Open Notebook Science from Jean-Claude Bradley (via The Imaginary Journal of Poetic Economics):

Publisher’s talking points put them in a head to head ignorance fight with the RIAA

Friday, April 11th, 2008

OAN linked to a PDF entitled “An Overview of Scientific, Technical, and Medical Publishing and the Value it Adds to Research Outputs” that I would like to discuss.  Unlike Dr. Suber’s version, my PDF doesn’t seem to be locked against copying and pasting excerpts, so I’ll include some tidbits along with my commentary.

The position paper starts out listing the benefits of publishers… it’s really boilerplate.  Eventually they get down to brass tacks and start talking money:

The total cost of publishing a journal article with a print and electronic edition depends on multiple factors, but has been estimated to average between € 1100 and 3000 3 (US$ 1500 – 4000)

Well, I find that interesting, given my own number crunching from a few days back. Remember, I found that the page charges alone in at least one journal (JBC) cover a fair bit of the total costs of publishing.  Keep in mind that this doesn’t take into account advertising revenue (not mentioned at all in this paper) or the actual subscription fees, both of which are substantial.  But of course, if you don’t publish in a “well-respected” journal, your work is worthless:

The cost of publishing is integral to the cost of doing research and without support in the form of publication in a well-respected journal, research remains largely unrecognised.

they are really a bit full of themselves, actually:

While peer review ensures the quality and scientific integrity of articles, it is the journal “brand name” that places those articles in context for readers.

And how was that brand name developed? That’s right - by publishing high quality research. They are putting the cart in front of the horse and insisting that it is the motive force.

Of course, they aren’t done telling us about all the hard work they do for us:

The costs of journal publishing include the costs of managing not only the peer review
and the creation and management of journals themselves, but also the costs of
substantive editing, verifying references and inserting tags to create the online links,
preparing illustrations or special graphics, typesetting, coding for web dissemination
(e.g., in XML) and layout.

As someone who has actually written a journal article, this paragraph made me laugh out loud. When you submit a manuscript, you have to jump through a remarkable number of hoops to get everything formatted precisely as the journal wants it. I’ve had comments that my references weren’t numbered properly, certain figures weren’t the exact resolution, etc etc. It’s true that the journal has to do some work to generate the final document, but I’m willing to bet that this is highly automated, and is likely not much more complicated than a LaTeX document class slapped around the whole thing.

All right, so now we know how hard the publishers have to work once you’ve actually done all the experiments, written the paper, formatted it to their criteria, and often times told them who you’d like to review it.  Now they give us a list of reasons that prices are increasing:

  • Increased numbers of articles produced by researchers (as described in The Scale
    of STM Publishing section of this paper), at around 3 % annually 7, and increased
    average length of articles. This is a fundamental driver for journal costs, as it leads
    to the increased size of journals.
  • Increased special requirements of features such as specialised language, graphics,
    chemical compounds, citations, linking, images and links to numeric databases.
  • Value-added attributes associated with electronic publishing, such as the provision
    of navigation, search, retrieval, analysis, and linking options.
  • The follow-on effects of fewer institutions carrying the fixed-cost base of the
    journal, and currency effects.
  • Relative economic inefficiency of new journals when they are started, which
    factored into overall subscription inflation can contribute up to 1 % 9 of a price
    increase.
  • Inflation (especially salary and paper costs) which has run at about 3.3 % 10 per year for the last two decades or more.

In the interest of brevity, I’ll leave it as a reader exercise to find the logical fallacies present in this list. Might I recommend beginning with an argument that more articles might mean more revenue, and therefore should decrease the cost of the journal?

All right this is already dragging on a bit, but now we’ve come to the section entitled “STM Publishers and the Goal of Open Access”, which I’m sure that you’ve all been waiting for.  They definitely don’t take long to start slinging F.U.D.:

The best-known approach is the author-side payment model, where an article processing charge (mostly in the range € 1500 to € 2200) is levied on each accepted article.

Of course, as yet in this paper they haven’t put dollar values on their own processing charges, nor do they mention the many OA journals that charge no or nominal fees to authors. This is a devious bit of wordplay - they say “mostly in the range” rather than citing an average, which would be much lower.

On the subject of self-archiving:

Publishers do not believe that self-archiving offers a sustainable alternative for scientific publishing. Also, there are serious potential risks with institutional repositories in terms of quality control
and the potential for a reduction in journal revenues

I can’t decide if the second half of that last sentence wasn’t meant to be removed when editing. “It’s a bad idea because we would make less money.” Also, I fail to see serious quality control problems with self-archiving of papers that have been through the peer review process.

They don’t care for embargoes either. I won’t bother to directly quote them here and just paraphrase the argument as “we will make less money that way as well”. They then complete the article by roundly patting themselves on the back for all their hard work.

To summarize, this is all bull. They use cherry-picked and mismatched metrics to contrast the “glory” of for-profit publishing against the F.U.D. of Open Access. All of their points are so easily refuted that it’s almost boring to do so. I don’t suppose I really expected anything else, but I guess I had hoped to see some manner of brilliant argument that would at least lead to an interesting discussion.

Chronicle of Higher Education commentary on uninformed students in the information age

Thursday, April 10th, 2008

This commentary from the excellent Chronicle of Higher Education decrying the lack of knowledge of college students is getting some play on popular social news sites.  From the article:

In recent years I have administered a dumbed-down quiz on current events and history early in each semester to get a sense of what my students know and don’t know. Initially I worried that its simplicity would insult them, but my fears were unfounded. The results have been, well, horrifying.

We’ve heard this before - it seems that at least once a month a major news outlet runs a story discussing the inability of U.S. students to place states on a map, or locate countries, or name foreign leaders, what have you. It’s clear that if we want our students to know this information, we are failing to teach it to them.

The author goes on to place the blame for this at the feet of the media, an argument that has been made many times before. He argues that the emphasis of “infotainment” over “actual news” - the environment where what is going on with celebrities is more important than what’s going on in Afghanistan - reduces knowledge of important world affairs. “The Daily Show With Jon Stewart is as close as many dare get to actual news.” He goes on to say that blogs are not a proper replacement for “real” news sources (that is to say: news agencies still actually reporting world news), and that with the explosion of content on the internet, many young people simply don’t bother keeping track of world events.

As I’ve said, this is a common complaint, but I haven’t seen many people really talk about some of the more critical causes, or methods to correct it. In my opinion, one of the real concerns is grade inflation, tied with a seemingly widespread opinion among primary school educators that history/civics just aren’t as important as reading and math. It’s far easier for the high school civics teacher to drill a few facts into the heads of the students, test on those facts once, and allow for that knowledge to atrophy. Everyone passes, they don’t complain about how civics is such a tough class that they can’t study for their math test, and the teacher doesn’t have to deal with irate parents and concerned administrators. The problem is that when courses are taught this way, the students aren’t batteries for storing the knowledge - they are capacitors. They are pumped full of information which they release onto the test page, and it is out of their concern thereafter. This is why the introductory journalism students can’t say what country Kabul is in.

To be fair, this practice takes place to some degree in every class. There is always a sense of “I’ll never have to use this in the real world, so why try to actually comprehend it” on the part of the students. I was always amazed when people in my classes would attempt really remarkable feats of rote memorization of facts rather than attempt to just understand something.

I think that this will only be solved when we can figure out a way to teach a comprehensive understanding rather than a list of bullet point facts.  There are many people in education research (mostly at the college level) who are working on this, but I have to say I’m not sure that they’ve found the right way yet.  We also have to put a dampener on grade inflation.  I can’t stress enough how perilous I think the situation is in this regard.  For too long, teachers have been reducing course content and test difficulty under pressure from administrators and parents to put on a facade of educational success.  We have to return to an educational regime in which failure is an option.  Only by facing the failure head on can we develop new methods and tools to improve the system and better teach the next generation of students.

A call for a “Wikipedia for Data” is close to the mark

Wednesday, April 9th, 2008

I came across a post from a former Google employee entitled “We Need a Wikipedia for Data“. His sentiments echo many of those you hear from the Open Access proponents:

I think all of these barriers to data are holding back innovation at a scale that few people realize. The most important part of an environment that encourages innovation is low barriers to entry. The moment a contract and lawyers are involved, you inherently restrict the set of people who can work on a problem to well-funded companies with a profitable product.

The problem is that the “data” that Mr. Taylor seems to be interested in is not the data that I, as a scientist am interested in.  He wants people to deposit maps, stocks, and movie times; I want to see genomes, protein structures, and experimental results.

This is not to say that I don’t support his ideas.  It’s entirely possible that if someone ::cough::Google::cough:: developed this product as a commercial idea (as a way to provide software makers with data for their projects), it could then be modified to function as a “repository of scientific knowledge” with much less effort than building such an entity from the ground up.

The comments to the original blog posting list a few potential sites that are already working on this sort of thing, Freebase seeming to be the most popular.  I gave Freebase a quick look, and it seems like they are doing some really interesting things with cross-referencing of datasets.

For science, I think that it’s asking too much for researchers to independently log findings into an external database.  For this type of system to work, it will have to be combined with machine reading of open access publications in order to automatically populate the database.  This could then be monitored and edited by the community in order to expand on cross links and correct any errors in the automated parsing.

It’s not the simplest project ever imagined, but I think it’s entirely feasible.  We just need to find the power of will to make it happen.