As more papers are published in Open Access journals, it becomes possible to text-mine these articles and build large, centralized, and cross-linked databases of scientific knowledge. One of the keys to this effort is the ability to read in the content of scientific articles using computational algorithms. These algorithms need to be able to “understand” the content, at least to the point where they can properly assign things such as the title, authors, system under study, key findings, etc.
Most people who are working on this problem, which is tied in with the semantic web, believe that in order to do this, articles must be accompanied by markup language, or metadata. This means that in addition to the actual content of the article (the text of the paper), we need to include some information about the content. For instance, the raw file might look something like this:
<journal>The Journal I Just Made Up</journal>
<issue>1</issue>
<pages>1-2</pages>
<title>A Study of Everything: Not so complicated as you might think</title>
<author1>Accuracy, P.</author1>
The things between the <> are metadata - they tell a computer something about what is inside. A program could quite easily scan this raw file and then do things with it, whether that is assigning the content between the tags to a database, or formatting the article for easier reading online.
Markup like this is already very common in web development. If you right-click this page and then click on “view source”, you’ll see that the raw file used to generate the page is full of tags like this. Your web browser formats the raw file according to a simple style sheet (also provided by the web developer - you can see the PlausibleAccuracy sheet here) in order to display it on your screen.
So one of the issues of machine-readable scientific articles is that we need to apply some sort of markup like this to the papers. This requires some cost, both in time and money (you presumably have to pay someone to do it), and as such you have to convince the people involved that it’s worth doing.
There is already document writing software which includes markup and is particularly well-suited to academic papers - LaTeX. Much in the same way that Cascading Style Sheets can be used to separate web design from site content, LaTeX uses style files to format a content-rich document. When writing a paper in LaTeX, you simply tag things as you go and focus on writing what is important, rather than wrangling with formatting. You then process the document and a nice looking file is generated.
It seems to me that writing papers in LaTeX would be a benefit to everyone involved in the publishing of academic papers. Authors can just write up their results, and spend less time wrestling with Microsoft Word to try and shoehorn the paper to match the journal’s formatting rules. Journals should be able to write relatively simple style files for the LaTeX submissions to generate documents suitable for publication. In the same vein, those interested in machine parsing of the documents could write software which also leveraged the markup for their own uses. Of course, however, there are some concerns.
The main one, in my view is reluctance on the part of authors to learn LaTeX. It does take a (small) amount of effort to learn, and to be sure the actual writing is more complicated than doing so in a What You See Is What You Get (WYSIWYG) editor like Microsoft Word. It’s my experience that for some reason people see the time and effort put into learning LaTeX and including markup during the writing process as more work than the formatting changes that they have to make after the fact in Word, even though this may take more time and be more frustrating.
It’s also harder to just read a LaTeX file. It’s sort of like reading a raw web page without a browser - you stumble across the tags. More important than this is group editing of a document; most professors like to receive an electronic copy from the student writing the paper, so they can make their edits directly and return it. As LaTeX tends to output PDF files, this isn’t so simple. There are other alternatives, like WYSIWYG editors for LaTeX, but the experience I’ve had with them is less than stellar.
In spite of these issues, I think that LaTeX (or perhaps a modified version of it) is an excellent candidate as we think about ways to generate a semantically addressed body of scientific literature. Perhaps development of specific markup and improved WYSIWYG editors would aid this effort.