Markdown for science and academia – working with large files

Posted on May 18, 2021 by Martin Héroux 3 comments

This series of posts has looked at how to use Markdown to prepare scientific and academic document. One thing that many of us encounter are large documents that have many parts. Working on these types of documents can become unwieldy. In LaTeX, we can use the \input{} command to combine (or concatenate) various separate .tex files into one master file. This is very useful when working on large documents like a PhD thesis, or a document that is regularly updated in parts like study or meeting notes. But how do we combine a series of Markdown files into a single, master Markdown file?

Bash/Zsh to the rescue

Because Markdown files are simply text files, we can use various tools that already exist to work with them. Most terminals, including the Bash terminal (Linux, Windows), Zsh (Mac), include a function called cat. Here, cat is short for concatenate.

We can use this command to concatenate a bunch of Markdown files into a master Markdown file, which we will then process using Pandoc.

Example project

Let’s pretend we have a project that has the following file and folder structure:

.
├── entries
│   ├── entry_1.md
│   ├── entry_2.md
│   └── entry_3.md
└── make.sh

In our top-level folder, we have a make.sh file. This is a bash script file that we will run. We don’t have to use such a file, but it will greatly reduce the amount of typing we have to do and also make it easy for us to repeat commands to re-generate or update our document, and also do some clean-up.

Our project also contains an entries folder that includes three different Markdown files.

So, the first thing we need to do is concatenate our various entries into a master Markdown file. From the base-directory of our project, we can run the following command:

cat ./entries/entry_1.md ./entries/entry_2.md ./entries/entry_3.md >> doc.md

Alternatively, if our files are in alphabetic/numeric order, we can run this command:

cat ./entries/*.md >> doc.md

We can now use Pandoc to generate our PDF document:

pandoc doc.md --output=doc.pdf

As mentioned previously, we can use a make.sh file to help automate this process. We can include these commands (and an additional clean-up command that deletes the temporary doc.md file) in our make.sh file. The file would contain the following:

cat entries/*.md >> doc.md
pandoc doc.md --output=doc.pdf
rm doc.md

And we would run the file using the following command:

bash make.sh

This will generate our doc.pdf file, which was made from the concatenated content of our three Markdown files:

Summary

This short post demonstrated how we can use our scientific software engineering skills to manage and generate large documents. Bash was used to stitch our documents together and run our various commands. Pandoc was used to process and convert our document to PDF (using LaTeX behind the scenes).

tagged with Markdown

3 comments

mokagio
May 19, 2021 7:59 pm
Great post! Thanks for sharing.

Worth pointing out that the * in the glob pattern will return files in alphabetic order. That is, entry_10.md will be before entry_2.md.

You can get around that by using zeroes in the file names, e.g.: entry_001.md, entry_010.md, entry_100.md.

Another option is to sort the entries using the “version number” mode:
```
cat $(ls ./entries/*.md | sort -V) > doc.md
```
LikeLiked by 1 person
- Martin Héroux
  May 20, 2021 12:18 am
  
  Very cool. Thank you for the suggestion, I always struggle with ordering of files!
  
  Also, I was reading the pandoc manual the other day and realised that you can have multiple markdown files in your call, and the files are concatenated in order before being processed! For example:
  
  pandoc -o file.pdf entry_1.md entry_2.md entry_3.md
  
  LikeLiked by 2 people
  
  - mokagio
    May 20, 2021 1:46 am
    
    Neat 🙂
    
    LikeLike

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31