June 23, 2015

MarkDown to PDF ::: creating pdf e-book from series of markdown

I was looking for an easy way to read the 'famous' series of articles of Fabien Potencier (https://github.com/fabpot/Create-Your-Framework). They are series of blog posts of him and later stored at github as *.rst aka markdown.


I thought and searched for quite a bit for some tools to get them all into single e-book like pdf e-book because I became lazy enough to retrieve them from bookmark every time I need them.

I came across http://engineeredweb.com/blog/2014/convert-markdown-pdf-using-php.  An wonderfully written tutorial for similar purposes. While looking into the code - I remember an old `codelet` of mine which actually can solve lots of similar problems. It was using 'KnpSnappy' bundle.


So I got some code copied from Matt Farina from above mentioned blog post and modified my codeblock a bit -- which managed to build up a solution for my problem.

I will go step by step for my own record:

(1) Bootstrapping:


a) Lets create a new project at our webroot. Let's name it "MarkDownToPdf".
b) Now inside the project let's create the files we will require:

himel in /var/www/MarkDownToPdf$ touch index.php autoload.php composer.json .gitignore

c) Now let's open the project with a suitable IDE (or one can use any tools at comfort) - in my case I am using PHPStorm.

d) After that we will edit the
composer.json
with our required libraries (like as follows):

{
  "require" :  {
    "neurosys/file-merger": "dev-master",
    "h4cc/wkhtmltopdf-i386": "0.12.x",
    "h4cc/wkhtmltopdf-amd64": "0.12.x",
    "knplabs/knp-snappy": "0.3.*@dev",
    "dompdf/dompdf": "0.6.*",
    "michelf/php-markdown": "1.4.*",
    "querypath/querypath": "3.*",
    "masterminds/html5": "1.*"
  }
}

e) Now it's time for running the composer update. But before that we need to have composer in our system. To download composer we will run the following:
himel in /var/www/MarkDownToPdf$ curl -sS https://getcomposer.org/installer | php
Then run composer:
himel in /var/www/MarkDownToPdf$ php composer.phar install
All the libraries will be at the "vendor" folder.

f) Now as the vendors are downloaded it will be better to create the ".gitignore" file to avoid future problems when we will be adding this to git. My ".gitignore" contains:

/vendor/
/composer.phar
/composer.lock
/.idea

g) Now it's time to write the ``autoload.php``. It will be used to autoload the vendors - it puts a direct require to the autoload file inside the Vendor folder.
<?php

require_once __DIR__ . '/vendor/autoload.php';


(2) Downloading the files:

Now I will download the files and for this I will create a ``files`` folder. Point to be noted that we can use any files at any location but to make it a complete solution - I am keeping the files inside the tiny app.
himel in /var/www/MarkDownToPdf$ mkdir files
himel in /var/www/MarkDownToPdf$ cd files
himel in /var/www/MarkDownToPdf/files$ git clone git@github.com:fabpot/Create-Your-Framework.git fabpot

So the books are downloaded at "/fabpot/book" inside "files" folder.

(3) The Code:


The code will first include the required libraries into it. So let's start like:
<?php

require_once __DIR__ . '/autoload.php';

use NeuroSys\FileMerger\Merger;
use NeuroSys\FileMerger\Driver\PdfTkDriver;
use NeuroSys\FileMerger\Transformer\ImageTransformer;
use Knp\Snappy\Pdf;


Now I will take the path of the folder where the markdown files are in - as input from command line:
$handle = fopen("php://stdin","r");
echo "Path to markdown files (upto folder without trailing slash '/' ):" . PHP_EOL;
$dir = rtrim(fgets($handle, 1024));
fclose($handle);

Now as we do have the folder path - we can read all the files in there.
$files = scandir($dir);

So let's process the files one by one
$pdfs = array();
$domain_name = 'http://fabien.potencier.org/';

$index = 1;
foreach ($files as $file){
    if (!in_array($file, array('.', '..'))) { // as scandir() returns file list with dots which actually specifies parent directory
        
    }
}

Inside the loop I will convert markdown files to html first:
// md to html
$markdown = file_get_contents( $dir.'/'.$file );
$markdownParser = new \Michelf\MarkdownExtra();
$html = $markdownParser->transform( $markdown );

For the output html we need to do some cleanup and modify links in the html if there's any:
// html clean up and clean up links by assigning absolute path to them
$dom = \HTML5::loadHTML( $html );
$links = htmlqp( $dom, 'a' );

foreach ( $links as $link ) {
    $href = $link->attr( 'href' );
    if ( substr( $href, 0, 1 ) == '/' && substr( $href, 1, 1 ) != '/' ) {
 $link->attr( 'href', $domain_name . $href );
    }
}

$html = \HTML5::saveHTML( $dom );

Now to initiate KnpSnappy Bundle and its PDF feature - we need to install "wkhtmltopdf". We can do this by:
himel in /var/www/MarkDownToPdf$ sudo apt-get install wkhtmltopdf
or
himel in /var/www/MarkDownToPdf$ sudo apt-get install xfonts-base xfonts-75dpi
himel in /var/www/MarkDownToPdf$ wget http://sourceforge.net/projects/wkhtmltopdf/files/0.12.2.1/wkhtmltox-0.12.2.1_linux-wheezy-amd64.deb
himel in /var/www/MarkDownToPdf$ sudo dpkg -i wkhtmltox-0.12.2.1_linux-wheezy-amd64.deb

Then we will use this bundle to convert html to pdf. We are saving the output pdf file in "pdf" folder inside the given path.
We will store the path of the pdf file in an array which will help us combining them.
// html to pdf
$snappy = new Pdf( '/usr/bin/wkhtmltopdf' );
$snappy->generateFromHtml( $html, $dir . '/pdf/' . $index . '.pdf' );
$pdfs[$index] = $dir . '/pdf/' . $index . '.pdf';


We have to install "pdftk" for the merging task.
himel in /var/www/MarkDownToPdf$ sudo apt-get install pdftk

Now outside the loop we will iterate over the pdf file path array and combine them and give the final path as output:
// merging all files
$driver = new PdfTkDriver("/usr/bin/pdftk");
$merger = new Merger($driver);
$merger->addTransformer(new ImageTransformer($snappy));

foreach ($pdfs as $pdf) {
   $merger->addFile($pdf);
}

$merger->merge($dir . '/pdf/CreateYourOwnFrameWork-FabPot.pdf');
echo PHP_EOL;
echo "Combined e-book is: " . $dir . '/pdf/CreateYourOwnFrameWork-FabPot.pdf';

(4) Running:


Now we will run our file:
himel in /var/www/MarkDownToPdf$ php index.php 
Path to markdown files (upto folder without trailing slash '/' ):

We will input the path of the folder as: /var/www/MarkDownToPdf/files/fabpot/book

(5) The Output:


The output is:
Combined e-book is: /var/www/MarkDownToPdf/files/fabpot/book/pdf/CreateYourOwnFrameWork-FabPot.pdf


The book is under Creative Common License so this permits me sharing the final output.

Slideshare link of the final output is:


The codebase is in github (https://github.com/himelnagrana/MarkDownToPdf)


Thanks.
And a thousand thanks to Matt Farina.