• Home
  • Tools
  • Convert Docx To Markdown With Pandoc

Convert Docx To Markdown With Pandoc

I prefer to use Microsoft Word for most of my writing. I prefer Word because its spell and grammar checker is superior to every other word processor or text editor I have tried. In addition, word has text to speech build in. I use text to speech to have my text spoken to me in order to catch errors and I catch a lot of errors this way. While I write my blog posts in English, English is not my first language and I need these tools to keep spelling and grammar errors to a minimum.

I use the static site generator Pelican for this blog and it generates the blog from ether restructured text or markdown files. I have written about Pelican in my blog post The Static Site Generator Pelican VS WordPress.

I have been using Pandoc to convert markdown to Word documents or PDFs for years. A Google search for a way to convert from Word to markdown did not give any usable result. Therefore, up until now I have just copied and pasted the text making sure not to do any markdown syntax until after I had done spell checking in Word.

Then a couple of weeks ago I was reading the Pandoc docs to solve a different problem and I came across the section where it is described how Pandoc can convert from docx to markdown. I do not know if this is new or why Google did not find this for me but I immediately forgot the problem I was trying to solve and began testing it.

It turns out to be quite simple to convert a docx to markdown. The following example is from the Pandoc demos site.

However the generated markdown from the above command has a few issues.

The lines are only 80 characters long. I do not know why an 80-character line length is the default but I do not like it. This is fortunately quite easy to fix with the option –no-wrap.

Links do not use the reference style. I prefer the reference style links because it makes the text less cluttered by moving the link it self to the bottom of the file. This is also easy to fix with the option –reference-links.

With the two options added the command looks like this.

Now the generated markdown is very readable and close to what I would write myself. I only use Word to write text with simple formatting like lists, italic, bold, and links. The syntax for images and code I add to the generated markdown file along site the metadata that Pelican needs. Although I do not use it at this time, Pandoc can extract images from a docx.

The option to extract images from the docx file and more can be found at the Pandoc options page.

Edit: The option page url has changed and is now http://pandoc.org/README.html#reader-options

So there you have it, sometimes what you need is right under your nose :).

3 Comments

  • Elena Sheremetyeva

    April 27, 2015 at 14:49

    Hi, Martin,

    I would suggest you to have a look at Writage (www.writage.com). It is possibly exactly what you are looking for! It is a plugin which turns Word into your Markdown WYSIWYG editor. So you will be able to create or open and edit Markdown files from Word. Have a try!

    • martinronn

      April 28, 2015 at 09:49

      Hi Elena,

      Writage is a neat plugin for Word that I definitely did not
      know existed. I installed it and tried it out. Writage is using Pandoc, just as
      I, to do the conversion. However Writage uses the default settings of Pandoc
      and do not, to my knowledge, provide a way to use any of the Pandoc options.
      This is a deal breaker for me because, as I wrote in the blog post, I do not
      like the default output of Pandoc.

      To make Pandoc easy to use I wrote a batch scrip that lives
      on my desktop that I can just drag and drop a Word document on to and it create
      the markdown file. This script also adds some metadata to the markdown file needed
      by the static site generator (Pelican) that I use for my blog. While not
      integrated into Word I find this approach to be quite efficient and easy to
      use.

      Another thing to consider is that according to their web
      side the program is free to try because it is in development. Therefore, at
      some point the plugin should start costing money. Now I do not mind paying for
      good software however, I see no reason to pay for what I already have.

      Writage is a nice plugin for Word and a nice Wrapper for
      Pandoc. It absolutely have potential however, it is to constraint for me and I prefer
      my own solution at this point.

      Thanks for the suggestion.

      • Luis

        September 29, 2015 at 07:37

        Martin,

        Do you mind sharing part of or the entire script that you developed? I am interested in the automation of creating a .md from a .docx.

        Best,