Creating a Website on AWS

How This Website Works

This website uses 6 AWS services to operate (and one optional service for version control and a content repository). I'll discuss each one briefly, although over time, I will probably write more, since these articles are not static, but fluid brain dumps (at least, until all the info is dumped).

Route 53 (R53) - the DNS offering of AWS. Why is it called Route 53? Because DNS requests typically run on port 53. Originally only UDP, but for larger zones TCP is used. What does DNS do? Everything on the internet has an addresslooks like 4 numbers with periods between them (there are lots of rules that define validity, I won't get into that now) such as 55.25.124.244. Sure you could remember a few of those, but it would be easier to have a name attached, right? The same way you save contacts in your phone. What if there was a book that listed all the addresses for a particular company? And what if there was a central book that told you where to find the books for each company? And what if computers that went and looked up this information could remember it for a period of time listed along with the address so the next time anyone needed to know and asked this computer, it could just respond immediately? Great, that's how DNS works. In this case, there is a record for tolidano.com and www.tolidano.com that point to another record created by another AWS service called CloudFront (see below).
Amazon Certificate Manager (ACM) - the SSL certificate offering of AWS. SSL stands for Secure Socket Layer. This is how you get the little lock icon when you visit websites using https instead of http (the s is for secure). The lock means that the data moving between your computer and the website is encrypted, so even if someone was intercepting the data (which is often possible), it would take them a very long time to figure out what you sent. How does this work? It's all about trust. When you request a certificate in ACM for a domain you registered and host in Route 53, it creates DNS records to validate that you, in fact, own the domain. Otherwise you could ask for a cert for amazon.com and pretend to be Amazon! Once it validates that you own the domain, the cert is issued and you can attach it to various AWS services (for example, CloudFront, see below) so when your website is accessed, it is done so securely. It doesn't matter if you don't collect data from your visitors, having a secure site is just a good choice overall, and you never know, one day you might start a newsletter or sell something, and if the site is already secure, you saved yourself some future work. Certificates expire but ACM renews them automatically every year. One last piece of the trust puzzle - how do you trust ACM? Every browser actually ships with a set of trusted "root" certificates. These are certificates owned by signing authorities, such as ACM. When your browser examines the certificate issued by ACM upon visiting your website, it validates that this cert was in fact created by ACM before giving you the warm fuzzies with the lock. If someone tried to sign with their own certificate authority, it would produce a big spooky error message.
Simple Storage Service (S3) - An object store where you can drop HTML files and images with a simulated folder structure. When I "publish" new content, I synchronize my local content to the bucket in AWS and tell CloudFront "hey, empty your cache for my website" via an Invalidation so the new content gets picked up.
CloudFront & CloudFront Functions - CloudFront is the AWS Content Delivery Network (CDN) offering. In this case, I put the content in the S3 bucket as HTML pages and images, and rather than expose that directly to the internet, I use CloudFront, which provides a few different benefits - one is the caching, although that's not really important since my website does not get a lot of traffic. Another is the ability to use my own domain name over HTTPS, which is not possible with S3 website hosting. Lastly, I can take advantage of CloudFront Functions. These are tiny bits of code that can do very simple things such as ensure that if someone hits tolidano.com, it redirects them to www.tolidano.com, and it is important for many reasons to have a canonical website.
Lambda - The easiest way to get code running anywhere. In six clicks from your new account setup, you can have a code editor with a sample hello world application ready and get to coding. Six more clicks for configuration and deployment and you have, with the recently released Lambda Function URL feature, a web-accessible secure endpoint.
Lambda@Edge - Using this, I can now dynamically inject the header and footer which I keep in the code of the function that runs on each origin request. It grabs the article from S3, and short-circuits the whole trip to the bucket from CloudFront directly.
CodeCommit - So we do not technically NEED this, and we could use GitHub or GitLab or any other source code management (SCM) tool or service, but it is handy to keep your site as code. S3 offers versioning, so I could, in theory, go back in time and see what the site looked like in the past, but synchronizing to S3 is not trivial, whereas utilizing a source code repository can be extremely simple.

What About The Content

I keep the CodeCommit repository on my computer. I write new content by copying the template and writing an article. I then commit the article on my computer to the repository. I have a small script `publish.sh` which I keep in the repository and when I run that script, it synchronizes to the S3 bucket and invalidates the CloudFront cache. In fact, because I synchronize the entire folder, you can download my publish.sh script from my site. It's very short:

#! /bin/zsh
# sync the files to s3
aws s3 sync --exclude '.git/*' . s3://tolidano.com
# invalidate the cloudfront distribution
aws cloudfront create-invalidation --distribution-id E3D7R1KZ2V941B --paths '/*'

There isn't anything sensitive here - you knowing the name of the bucket holding my website does not expose me in any way, nor does the publication of my CloudFront distribution ID.

How Much This Website Costs

A key thing people will care about is cost, but the TL;DR: it's less than $1/mo for my site and in theory, I could host dozens of sites (on a single domain) and it would still be less than $1. This is because of a single key factor: Free Tier. AWS has a free tier where each month, you get some small allocation of dozens of their services for free, each month, forever. For example:

The first 5 GB in S3 is free, forever (similar to how Box and Dropbox give you 10 GB except arguably far more versatile). Your average website takes up less than 0.5% of that.
CloudFront offers 10 million free requests transfering 1 TB of data out (with 2 million CloudFront function invocations). So until you are getting 1 million hits per month, free.
Lambda offers 1 million free requests and 400k GB-seconds free per month metered in 1MB / 1ms increments (minimum 128MB). The lambdas I run consume about 0.1 GB-seconds per hit combined so 4 million hits per month free.
I also run Lambda@Edge, it's $0.03 / 50,000 requests and $0.01 / 200 GB-seconds but Lambda@Edge measures in 128MB and 50ms incremements, this certainly costs me under $1 per million hits. It allows me to inject the header and footer and manage it in 1 place
ACM certificates issued by the Amazon certificate authority for CloudFront are free (thanks to Let's Encrypt who made that the standard).
CodeCommit is free for the first 5 users of your repositories for 10,000 requests a month.
Route 53 charges $0.50/month for the domain (zone) and then 0.01 / 25,000 requests, so under $1 per million hits, unless you use ALIAS records, then those lookups are free.

So all total, the only cost is the hosted zone and DNS lookups, currently costing me $0.52 per month. Now, granted, I use my account for plenty of other things (that I'll hopefully write about) and so my actual costs are slightly higher (usually between $20 and $30 a month) but if I just wanted to run some sites cheaply, this would be it.

Hypothetically...

So one thing that is very annoying and "basic" about this model is that the header and footer are part of the article template and thus, if I ever want to change anything (like swapping CSS frameworks), I have to do a find-and-replace in every single file. While I have half a dozen articles, that's easy, but if I had 100, that might be rough (although replace all in VS Code works very well as does gsed/sed). I thought about how I might fix this, and the art of the possible. First, I thought "maybe I can access the content in the response with CloudFront Functions just before it gets back to the viewer in a viewer response association?" But this turns out to be false - CloudFront Functions cannot access the content on the way in or out. There is no "server-side includes" here because there is no server processing that I can access in any meaningful way, so the Apache/Nginx solutions are all out the window. Now, there is a feature called Lambda@Edge which is the binding of Lambda and CloudFront together - allowing full lambda functions to run at the edge. They still have limitations versus full lambda, but allow you to assign IAM policies and access content. What they still do not offer is the ability to modify content. But you can completely replace the content. So I could trap responses from the origin, and either keep the header and footer in the lambda code or keep them as separate cachable objects in the bucket along with the content. But Lambda@Edge has no free tier, so I will pay for each request ($0.60 per million and $0.18 / GB-hour, which I suspect would cover about 250,000 requests, billed at the 1 ms increment). So do I spend another $0.65 - $0.70 a month for the versatility of only updating the header/footer in a single place? I think I'll at least try it out and see.

Ok, so it worked! Now I run an edge lambda that does exactly as described above. It took 8 versions, and it also has to handle 404s now, but overall, it does a great job. The one "special" piece is that the title of the page has to be kept with the article, so the first line of the file going to S3 is the title. Also there's an exception that it only handles URIs that have "html" in them. Here's the code for the lambda:

import io
import json

import boto3
import botocore

session = boto3.Session()
s3 = session.client('s3', region_name="us-east-1")
bucket_name = "tolidano.com"

top = """
<html>

    <head>
        <title>%TITLE%</title>
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <meta charset="UTF-8">
        <link rel="stylesheet" href="https://unpkg.com/simpledotcss/simple.min.css">
        <link rel="stylesheet" href="/site.css" />
        <link rel="icon" type="image/png" href="/favicon.ico" />
        <link rel="apple-touch-icon" href="/favicon.png" />
        <script src="https://unpkg.com/htmx.org@1.8.4"
            integrity="sha384-wg5Y/JwF7VxGk4zLsJEcAojRtlVp1FKKdGy1qN+OMtdq72WRvX/EdRdqg/LOhYeV"
            crossorigin="anonymous"></script>
    </head>

    <body>
        <header>
            <nav>
                <a href="/">Home</a>
            </nav>
            <h1>%TITLE%</h1>
        </header>
        <main>
            <div id="d">
        
"""
bottom = """

        </div>
    </main>
</body>

</html>
"""

not_found = "Not Found. Go <a href='/'>Home</a>"

headers = {
            "cache-control": [
                {
                    "key": "Cache-Control",
                    "value": "max-age=100"
                }
            ],
            "content-type": [
                {
                    "key": "Content-Type",
                    "value": "text/html"
                }
            ]
        }

def lambda_handler(event, context):
    request = event["Records"][0]["cf"]["request"]
    
    if "html" not in request["uri"]:
        return request
    
    bytes_buffer = io.BytesIO()
    try:
        s3.download_fileobj(Bucket=bucket_name, Key=request["uri"][1:], Fileobj=bytes_buffer)
        byte_value = bytes_buffer.getvalue()
        file_data = byte_value.decode()
        lines = file_data.split("\n")
        title = lines[0]
        body = "\n".join(lines[1:])
        
        return {
            "status": 200,
            "statusDescription": "OK",
            "headers": headers,
            "body": top.replace("%TITLE%", title) + body + bottom
        }
    except botocore.exceptions.ClientError as error:
        return {
            "status": 404,
            "statusDescription": "Not Found",
            "headers": headers,
            "body": not_found
        }

Actually I did a few more things... First, you see the date created and updated - I modified the bottom variable to include a footer:

     <footer>
     <div>%DATES%</div>
     <div><a href="#top">Back to Top</a></div>
     </footer>

Then I added a snippet of code to get that data from S3 (because the bucket has versioning turned on):

        versions = s3.list_object_versions(Bucket=bucket_name, Prefix=key)["Versions"]
        newest, oldest = versions[0]["LastModified"], versions[-1]["LastModified"]
        dates = f"Created: {oldest} / Updated: {newest}"
        body = top.replace("%TITLE%", title) + main + bottom.replace("%DATES%", dates)

So now you have a created and updated date on all the articles automatically.

Next I found it would suffer from cold start (because I'm not very popular). So I wrote a little lambda that is triggered by an EventBridge rate expression - rate(15 minutes) - which should be enough to keep the lambda warm for some segment of users since cold starts are usually every 20 minutes or so. The code has no dependencies (I interact with urllib directly, no requests) so it's fast and simple.

I got tired of running the publish script every time I pushed to the repo. I am already forced to be logged in (via aws sso login) anyway in order for CodeCommit to work) but it was still an extra step I know can be automated away with any reasonable CI/CD setup. So I set up a CodeBuild project and dropped a buildspec.yml file into the root of my repo. Then I set up a CodePipeline with a single build step (no "deploy" step is necessary, because the "build" step does everything I need). The buildspec looks just like my publish script, but in YAML:

version: 0.2

phases:
  install:
    commands:
      - aws s3 sync --exclude '.git/*' . s3://tolidano.com
  post_build:
    commands:
      - aws cloudfront create-invalidation --distribution-id E3D7R1KZ2V941B --paths '/*'