Serverless PDF Generation from HTML (WYSIWYG as PDF)

GSSwain
10 min readMar 14, 2021

In this post, we’ll learn how to create PDF document from HTML using Puppeteer. We’ll also look at a Serverless solution using AWS Lambda. This solution can run on most public clouds and on premises with minor or no modifications.

Prerequisite:

You must be familiar with NodeJS & ES6 to follow along the code examples.
You must be familiar with AWS Lambda & AWS SAM to follow the AWS based Serverless example.

How do you generate PDF?

Generally you have a template with placeholders. You need to replace the placeholders with actual data and generate a PDF using some compute. The text of the template may come from one team (product and/or legal team), while the look and feel of the template may come from another team (UX or CX team). From a developer’s perspective, he/she gets a sample PDF (with dummy data as placeholder) on some story on an agile board. Now the developer would reverse engineer this sample PDF and develop the code to generate similar PDFs with different sets of data.

Throughout my professional career, I have used multiple tools for generating PDFs including OpenText StreamServe, iText, JasperReports, Apache PDFbox, Adobe Coldfusion and jsPDF. Being a developer at heart, I certainly have my own bias towards which tool is best among them. In 2017, at GDDIndia I was introduced to Puppeteer.(I was there and it was all real. I badly miss being physically present on such events ever since Covid-19 has emerged.) Here is the Youtube video of the session.

With just a few lines of code, one can generate a PDF from a HTML page.
Is generating a PDF this simple? Yes, it is!

Puppeteer:

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. By default, it launches in headless mode and can do everything a modern browser can, including rendering HTML with CSS. It is under Apache License 2.0 and comes with permission for commercial use.

Show me the code

Here is the code block, which generates a PDF given an URL and a name for the PDF file.

const puppeteer = require('puppeteer');const generatePDF = async (pageUrl, newPdfFileName) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(pageUrl, {
waitUntil: 'networkidle0'
});
await page.pdf({
path: `/tmp/${newPdfFileName}`,
format: 'a4',
printBackground: true
});
await browser.close();
}

What does the above code do?

It uses Puppeteer to launch a browser. Then it opens a new page and navigates to a given URL and waits till the page is fully loaded (in the above code fully loaded means, Puppeteer would wait until no new network requests are sent for half a second after page load). Then it generates a PDF of the rendered page and finally closes the browser.

Sample Usage:

The following line of code would create a PDF version of a sample html page using the above function.

generatePDF(‘https://gsswain.com/print-sample/', 'Print-sample.pdf');

Do I need to host the HTML pages publicly?

No.
1. You need to ensure the above code (when running on some compute) should be able to access your html page. You can host the pages privately (on your intranet may be or have a private api which responds with html pages to be printed).
2. The browser can render local html files with file:///<path-to-your-html-page> e.g. page.goto(‘file:///Users/gsswain/Documents/gsswain.com/print-sample/index.html’)
3.
You can directly render HTML by calling the page.setContent(<html-to-be-rendered>) method instead of calling page.goto() in the above code snippet.

How does this solution help?

The UX/CX team can provide a HTML/CSS based templates rather than providing a sample PDF and you just need to focus on replacing the placeholders with actual data and the above function would take care of generating your PDF. If the UX/CX team provides a sample PDF, in that case you can either use some tool to convert the PDF into HTML and create a template out of that or you can use the expertise of your web developers to create the HTML and CSS based templates. Preferably convincing your UX/CX team to provide the HTML templates would be more efficient and make the entire process a lot faster.

Side note: I personally like the JS template strings to create templates (instead of using another tool/framework/library) and some generic JS code which would take care of validation, replacing the place holders with real data and finally give me the HTML page to create a PDF. I would write more on this templating solution on a different post and would update the link here.

Have you used this on production?

Short answer- YES.
Long answer- I always wanted to use this in production, but you know, it’s difficult to convince people 😜. What’s life without meetings!. This is tested on production and has done really well generating documents with 30–40 pages (most of the testing anyways happens on production only 😜). As promised I’ll give a AWS Lambda based Serverless solution.

Generating PDF from HTML on AWS Lambda

The above image shows all the AWS resources that we are going to create. I’ll briefly explain the solution below.

  • The Lambdasits behind API Gateway. The API accepts a request payload in JSON format. This payload must contain an url for which a PDF needs to be generated. The Lambda generates the required PDF and puts it in a S3 bucket. Finally the API responds with a 201 HTTP status code and a location header containing the S3 object URL. (For supporting CORS one needs to return the S3 URL in the response body as well.)
  • For this example, the S3 bucket has a bucket policy which only allows public access to objects tagged with public=yes.(You should ideally block public access to S3 and either share a S3 Signed URL or a CloudFront Signed URL)
  • We also have a Usage plan to restrict the maximum number of requests one can make and an API key is mandatory to access the API.
  • The Puppeteer dependency is put into a Lambda Layer. Instead of using the puppeteer npm library we need to use chrome-aws-lambda as per the troubeleshooting guideline here.

Lambda Source Code

Let’s see the AWS SAM template first

SAM template (template.yaml)

The Lambda Runtime is NodeJS 14.x and the code is written in ES6. With Node 14 Module support (specify "type": "module" in package.json) I thought we might not have to transpile, but it does not work on AWS Lambda yet. The Lambda runtime itself uses a require(<path-to-handler> for thehandler configured in the Lambda. So we need to transpile and I’m transpiling using Babel. Look at the package.json and the build.sh files below for the transpilation. Also note the Handler in the above template points to the dist directory.

Now let’s see the Lambda code.

Lambda Handler (pdf-generator/src/app.js)

The handler simply delegates to the PDF Generation Request handler.

Pdf Generation Request Handler (pdf-generator/src/pdf-generation-request-handler.js)

This request handler orchestrates the request. First it converts the Lambda event into a PDF Generation request in the constructor (You can take care of any validations here using ValueObject pattern). In the handleRequest method, it creates the PDF using the PDF Generation Service, stores it on S3using the S3 PDF Storage Service and finally sends the PDF created response.

Pdf Generation Request Adapter (pdf-generator/src/pdf-generation-request-adapter.js)

Just picks up the url from the request body and generates a random file name for the PDF. (The name generation code can be place in another file or can also be taken as an input in the request payload).

Pdf Generation Service (pdf-generator/src/pdf-generation-service.js)

This service expects the PDF Generation request. It generates the PDF file in the tmp folder and returns the path to the file from the generate method. Notice the difference in terms of the import of chromium and how we launch Puppeteer. There is also a DEFAULT_PRINT_OPTIONS which can be overridden by accepting the same in the request payload. For this example, the default should do.

Pdf Storage Service (pdf-generator/src/s3-pdf-storage-service.js)

This stores the file on S3 and returns the HTTP based object URL.

Note: When running on local with sam local start-api, if we set the MODE to SAM_LOCAL, then it would not try to upload to S3, instead return a dummy URL.

PDF Generation Request (pdf-generator/src/pdf-storage-request.js)

S3 PDF Storage Request Adapter (pdf-generator/src/s3-pdf-storage-request-adapter.js)

This takes care of creating a PutObjectCommand for storing the PDF. We are tagging the object with public=yes and the ContentDisposition is set to attachment which would start downloading the file on a browser.

File Service (pdf-generator/src/file-service.js)

This file returns a read stream for a file.

Pdf Generation Response Adapter (pdf-generator/src/pdf-generation-response-adapter.js)

This converts the response as expected by AWS API Gateway.

Config (pdf-generator/src/config.js)

Use this Config file instead of reading the configuration from the environment variables directly everywhere in the code.

package.json (pdf-generator/package.json)

Note: The transpile script uses Babel to transform the modules to CommonJS

build.sh

Takes care of transpiling the JS code before running sam build

.npmignore

Ignore the tests and src files to be included in the lambda package.

Lambda Layer package.json (dependencies/nodejs/node14/package.json)

This file contains all the dependencies which should go into the Lambda Layer. Also revisit the PuppeteerDependencyLayer in the above SAM template.

main.yml(.github/workflows/main.yml)

This section is optional and is not a requirement for building and deploying AWS Lambda. You can use your favourite CI/CD tool.
The above file is a Github Actions workflow configuration file. This will trigger the build, validate the SAM template and then deploy to AWS on git push. Now that’s real power and productivity ❤️. You need to set the secrets (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION) in your Github repo for this to work.

I humbly thank the Github team and Microsoft for Github Actions.

Parameters for sam local (sam-local-env.json)

The above file contains the parameter overrides for testing the Lambda on local. One can also use LocalStack for a complete integration test (a topic which would need a separate post).

Now that’s all the code. The entire repo is available here for reference. Feel free to send a pull request, if you have any suggestions.

Build on Local:

Run the build.sh script to build the project on local.

Test on Local:

Make sure you have Docker installed and running on your machine.

Start the api

sam local start-api --env-vars sam-local-env.json

Now you can test the API deployed on local, with the following request.

Sample Request:

curl -i -X POST \
http://127.0.0.1:3000/generate-pdf/ \
-H 'cache-control: no-cache' \
-H 'content-type: application/json' \
-d '{
"url": "https://gsswain.com/print-sample/"
}'

Sample Response:

HTTP/1.0 201 CREATED
Content-Type: application/json
Access-Control-Allow-Origin: http://localhost:8080
location: https://dummyS3Url/3c8e8e4c-f7b4-4f5b-8e85-8dc06f8a22c3_1615739992609_353075938260911200.pdf
Content-Length: 105
Server: Werkzeug/1.0.1 Python/3.8.8
Date: Sun, 14 Mar 2021 16:39:57 GMT
{"pdfUrl":"https://dummyS3Url/3c8e8e4c-f7b4-4f5b-8e85-8dc06f8a22c3_1615739992609_353075938260911200.pdf"}

Deploy on AWS:

You can use sam deploy --guided to deploy the Lambda.
If you use Github as your repo, then you can build and deploy using Github Actions which was covered above.
For the CI/CD deploy to work, you need to have the samconfig.toml file with appropriate values. Refer to the one in my repo.

Test the deployed solution

If you have deployed this to your AWS account, you must pass the api key in the headers. You need to go the AWS CloudFormation stack and then to the resources. Go to the ApiKey resource and then click show.

You can get the URL of the API in the Output section of the AWS CloudFormation Stack.

Now you can test the API deployed on AWS with the following request.

Sample Request:

curl -i -X POST \
https://<YOUR-API-ID>.execute-api.ap-southeast-2.amazonaws.com/Prod/generate-pdf/ \
-H 'cache-control: no-cache' \
-H 'content-type: application/json' \
-H 'x-api-key: <YOUR-API-KEY_GOES_HERE>' \
-d '{
"url": "https://gsswain.com/print-sample/"
}'

Sample Response:

HTTP/2 201
content-type: application/json
content-length: 171
location: <GENERATED_PDF_URL>
date: Sun, 14 Mar 2021 15:26:36 GMT
x-amzn-requestid: 8aa9cd12-aaa9-4628-96db-e2d7b952feb5
access-control-allow-origin: https://gsswain.com
x-amz-apigw-id: cLuuTFlYSwMF5dw=
x-amzn-trace-id: Root=1-604e2b28-1cff94e27f34e4e35ec1a6e8;Sampled=0
{"pdfUrl":"<GENERATED_PDF_URL>"}

Copy the pdfUrl value returned in the response and open in a browser and you should see the PDF version of the page. Check the tags and metadata in the S3 bucket.

Note: You can’t access all the objects in the bucket unless you know the bucket key for all of them. Also if you upload directly to S3 without the tag public=yes, they can’t be accessed even if you know the bucket key. With that we have covered a lot of ground and finally tested our lambda on AWS as well.

Cleanup:

After you have done your testing delete all the resources

Delete all objects in S3 bucket

aws s3 rm s3://<your-s3-bucket-name> --recursive

Delete the CloudFormation Stack

aws cloudformation delete-stack --stack-name <name-of-your-cfn-stack>

Summary:

I hope you learnt a little something and you can use this solution somewhere on production 😊. This is half the story. Ideally you would like to have an API which would take a templateId, some data for the template and generate a PDF with that. This code example generates a PDF, given a URL. You can easily build upon this to support templates. If you are having any issues with Puppeteer while deploying this solution to some other platform, please go through the troubleshooting guide here.

Note: There is at least two sides to everything. Lambda is not a silver bullet and you should have some expert opinion before using them for your use case. This solution would work best when your target is high throughput rather than low latency. I have intentionally kept the ReserverConcurrentExecutions to 1 in the SAM template. Play around the Memory, Concurrency mode for your use case. If you need help do reach out to me.

References:

--

--