Pubmed publication dates

This post was originally submitted on an internal blog for my company and how I proposed we handle Pubmed data. It might be useful for other organizations looking to manage a local repository of publication metadata.


When we talk about a publication we frequently want to know when it was published. This concept looks like it should be simple and it probably was, back when journals were only published in one medium - a journal. But more and more frequently, articles are being published both in print form and in electronic form, and those forms were published on different dates. Which date to use (for citations, both dates are often both used) is not defined by a universal standard, but is instead specified by the journal. So now we have two different dates with some set of rules that defines how they are to be used. To make matters more complicated, each of these two dates might not be usable in all contexts. For instance, maybe the date is some date range like "Spring 2014" or even just "2014". That makes no difference for displaying in a citation or similar, but what if we want to search or sort based on the date? If we search for publications starting in May 2014, should that capture a publication with date "2014"? What if the print date is an incredibly vague "2013" but the electronic date is "13 November 2013" - do we care that the journal says it considers the print date to be canon? At this point it should be clear that "what is the publication date" is not a complete or meaningful question! Let's take a step back and look at all of these aspects of a publication date in detail - and see how PubMed deals with them.

Print/Electronic Dates and the PubModel

A publication can have a print and/or electronic publication date. (side note: In some places we refer to a "pubdate" field which is kind of a default date field. In some cases it is the print date, but in a publication that only has en electronic format, it would be the electronic date). We have another attribute called "pubmodel" which we get from PubMed as it was specified to them by the journal that published that article. This is the key that tells us how to use these dates. This is perhaps most common with citations, where both dates are frequently used but in very different ways.

PubMed defines the following PubModel values in an article here:

  1. Print - use the print publication date in the citation. An electronic publication date may still exist for the publication, but we ignore it when constructing the citation!
  2. Electronic - there is only an electronic date
  3. Print-Electronic - there is an earlier electronic publishing date and a print date, and the journal wants to use the print date. The electronic date will be appended to the citation with "Epub" followed by a date.
  4. Electronic-Print - there is an earlier electronic publishing date and a print date, and the journal wants to use the electronic date. The print date will be appended to the citation with "Print" followed by a date.
  5. Electronic-eCollection - in this case the "pubdate" field actually represents an eCollection date instead of a print date, plus we have an electronic publishing date. We use the electronic publishing date, with "eCollection" appended to the citation followed by a date.

Before we implemented a new field to hold an electric date, and used pubmodel, our dates were inaccurate for Electronic-Print and Electronic-eCollection. In addition, our citations for Print-Electronic did not include "Epub" in the citation. The new system corrects this.

Following are some examples for each. Note that I am calling it PubDate instead of print date, because this is how PubMed sends it, though it effectively is the print date for all pubmodels except Electronic.

PubModel: Print
PubDate: 2003 Oct
Citation: Cassetty CT, Leonard AL. Epidermal nevus. Dermatol Online J 2003 Oct;9(4):43

PubModel: Print-Electronic
PubDate: 2000 Jan
Electronic Date: 25 November 1999
Citation: Soon MS, Lin OS. Inflammatory fibroid polyp of the duodenum. Surg Endosc 2000 Jan;14(1):86. Epub 1999 Nov 25

PubModel: Electronic
PubDate: 28 January 2004
Citation: Leslie M. Hampering a heartbreaker. Antibiotic might stem injury from heart attack. Sci Aging Knowledge Environ 2004 Jan 28;2004(4):nf13

PubModel: Electronic-Print
PubDate: 2004
Electronic Date: 16 January 2004
Citation: Edgar RC. Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Res 2004 Jan 16;32(1):380-5. Print 2004

PubModel: Electronic-eCollection
PubDate: 2012
Electronic Date: 25 January 2013
Wangari-Talbot J, Chen S. Genetics of melanoma. Front Genet 2013 Jan 25;3:330. doi: 10.3389/fgene.2012.00330. eCollection 2012


Date Representation

The dates that we pull in for both print and electronic publication are simply text that could be a day, a month, a year, or any of several date ranges. We have an extensive parser that will take text in a number of formats and try to convert into a date object in our code that represents a specific day. But to do this we need to know which day should be the default for date ranges like "Spring 2014". PubMed already had rules in place for this to decide how its site's search function would work:

Publication dates without a month are set to January, multiple months (e.g., Oct-Dec) are set to the first month, and dates without a day are set to the first day of the month. Dates with a season are set as: winter = January, spring = April, summer = July and fall = October.

We use the same strategy, and also apply a day: the first of the month.


Putting it all together

So we now effectively have five different values to keep track of, two of which are computed.

  1. PubModel
  2. Print Date (text)
  3. Print Date (computed - a date representation)
  4. Electronic Date (text)
  5. Electronic Date (computed - a date representation)


[this paragraph addresses our internal, curated publication system]
Generally I want to be agnostic about how to use these, unless the client has a very specific need, because the appropriate values to use here can change. Originally for the [redacted], we determined whether print or electronic was to be the primary date and then computed the day/month/year representation of it. This value is now included in all exports that use the "XML (computed)" format as the tag <Computed_pubdate>. For searching within a portal, we present an option: search by print date, by electronic date, by either date, or by the primary preferred date as determined by PubModel.

One of the reasons to have options on which dates to use at any given time is that sometimes what the journal specifies doesn't make sense! For example, consider this article from PLoS One: Here they give a PubDate of "2013" and an electronic publication date of "3 April 2013". But they specify a PubModel of Print-Electronic, meaning they want you to use "2013" as the primary date - which isn't very helpful. If we wanted to find publications that came out in April 2013 we'd have to search by electronic date and ignore the PubModel! Since roughly 2014 PLoS has actually started using Electronic-eCollection as the PubModel, which makes much more sense, but their older publications have not been updated. Also, it is still not uncommon to see a Print-Electronic publication where the electronic date is more exact, and thus potentially more useful.