How Pandoc is helping us redefine our Publishing Stack

The publishing process has become simpler in NON-STEM domains

Shanu Kumar
Typeset Blog

--

Converter Cover | credits:unsplash.com
Converter Cover | credits:unsplash.com

Edit1: This blog post is available in a new location — Typeset Resources for Pandoc for Publishing Stack

Publishing scientific results in a timely fashion is an important step for progressing new research findings into the world of practice and society.

However, the scientific publishing models show substantial diseconomies of scale, based on the cost of publishing one single printed paper. A majority of the cost is attributed to the “Production” phase which might include the conversion of the source document (eg. MS-Word) into:

  • Typesetted PDF
  • ePUB
  • HTML
  • XML (for archiving)

At Typeset, we work with editors and academic publishers to address this problem. Our goal is to build a production process that’s cheap and efficient for the ecosystem.

As part of this goal, we experimented with a variety of technologies in the ecosystem. Different pipelines were explored for the “Production” Stack. Accuracy and time were critical parameters.

After six months of working in the Pandoc community, our development team has come to realize that compatibility with Pandoc has priority over other development choices. When a decision was made for Typeset to work with Pandoc rather than against it, it became easier to develop new features and improve existing ones. Several examples support this claim: Plain markdown: ‘markup’ was chosen for document styles instead of ‘document language or format’. This choice is more in line with Pandoc’s philosophy, which makes it easier to implement sanity checks on Formatting objects.

Core Problem

Now, let’s try to understand the core problem, shall we?

The web started to revolutionize publishing in the 1990s, but it took more than a decade for this to be reflected in scientific publishing. In 2005, the journal “Nature Methods” published an article that marked the beginning of the HTML5 revolution in scientific publishing. Since then, the adoption of HTML5 has been rapidly increasing, especially among publishers of research articles.

However, for most publishers, the source file still continues to be MS-Word, and PDF is the de-facto medium of publishing. To meld them into an HTML standard is a challenge, to say the least. Every manuscript has a certain degree of variance, further increasing the complexity of conversion. Learn why PDF to XML conversion is important for an academic publisher.

How to solve it — Building the Publishing Stack

The business use case of publishers would continue to revolve around the generation of PDF, HTML, ePUB, and XML. At least for the time being.

The inefficiencies in the system could be curbed if there was a single source file, that led to the generation of all the other publication formats. A “Universal Format Generator” is the need of the hour.

Now, before we go into specifics, let’s understand what Pandoc actually is?

Pandoc is a universal document converter. It reads text written in one format and produces equivalent output in another.

Nearly 100 markup formats are supported, including HTML, Microsoft Word XML, Final Draft PDF, LaTeX, MediaWiki, DokuWiki, TikiWiki, DocBook XML, OpenOffice.org/LibreOffice ODT, WordPerfect, and many more.

Step 1:

The first step involved is taking the production-ready MS-Word file and converting it into Markdown via Pandoc.

Through Pandoc, this workflow intended to remove certain “nuances” inherent in Word documents to produce a common structure for easier inter-textual referencing. Such nuances include the presentational nature of Word styles, custom formatting created by users of Word, certain protected settings (such as track changes), revisions, comments, hashes, etc.

“Keeping as much semantic meaning as possible, while removing complexities”.

A usual question that might follow:

Q.) Isn’t Markdown limiting for complex scientific articles?
Ans.) Contrary to common opinion, Markdown is not a lightweight markup language at all; it is a heavyweight one. Markdown is capable of expressing everything asciidoc does — and more. Asciidoc’s documentation gives the impression that it is intended for simple formatting (like the manual pages in Unix), but the .adoc format has everything Markdown does plus a lot of extras like tables, images with captions, syntax highlighting using LaTeX math, and bibliographies with automatic numbering and cross-referencing.

Pandoc takes complete advantage of the above philosophy.

In addition, it includes a powerful rule-based language for writing new converters. You can use it as a command-line tool, or you can use the included library to embed conversion functionality in other programs. One such component is Pandoc filters that extend the set of things Pandoc can do. They allow for future enhancement of the Pandoc’s syntax, by adding new elements and attributes.

((P.S — For those truly seeking to understand the power of Pandoc, it is important to know that Pandoc contains an engine for parsing into XML and HTML, which is written in Haskell. This internal representation is a tree with many different primitive elements. These trees can then be transformed using existing Haskell libraries or even through custom modules.))

Step 2:

Editing the markdown file through text editors. Converting it to the required Production formats

Markdown has already become the de-facto standard for documentation in the software industry. Multiple text editors are easily available. When the markdown files become available, the Editorial team can use a text editor (e.g. Visual Studio code) for making minor changes, wherever needed.

At this phase, the second run of the Pandoc command line is performed to convert the enhanced markdown file to HTML, ePUB, JATS XML, and PDF.

The system is repeatable, predictable, and scalable. Any last-minute production changes can be applied to the enhanced markdown files.

We understand that this system is not perfect. It requires and favors manual assistance. The 80–20 rule. Since 80% of the work is done through software, it’s okay enough to get the remaining 20% done manually by the editors.

At Typeset, we are using this stack for the Social Sciences, Economics, and Biology domains. As the stack matures, we plan to introduce it for complex implementations (Maths and Physics manuscripts).

Until then, fingers crossed.

--

--