Email series: Basics

18 Dec, 2020

About this series

I am planning to write a few posts describing email technology and caveats with a focus on receiving email from the perspective of a product development team.

Why focus on receiving email

Email is widely used for purposes such as marketing, promotions, newsletters, or kinda-reliable notification delivery. So, many articles exist already in the internet describing how to base products on this capability, and I don’t have much to add in that space.

Why develop products based on email

Because of their long history and usage of open protocols, emails remain one of the few (perhaps the only) open standard allowing free communication between users anywhere on the Internet. This means that every internet user has an email account, and that every technology stack and platform has support to deal with email messages. As such, emails remain a widely spread common denominator which can facilitate many situations.

However, ultimately emails are a poor substrate for serious implementations and it is also the intent of this series to demonstrate why.

Email basics

Emails as a document and their structure

From a technical point of view, an email is essentially a multiline string with an internal structure:

Headers, such as
- From, Reply-To
- To, Cc, Bcc
- Subject
Body, very commonly split into parts as in a message text part (commonly in both plain text and HTML) and attachment parts

This means that an entire email, complete with its routing information, content, and attached files or embedded images, can be stored as a single text file, and in fact this is literally how the initial versions of email worked and how the design of SMTP is described ever since the initial RFC 821 from 1982. SMTP, or Simple Mail Transfer Protocol, specifies how computers can send each other emails.

Email format is described in detail in RFC 5322. Being a technical specification, RFCs are structured more as a reference document, but Appendix A includes several examples of raw emails in several typical scenarios.

Email examples

We could refer to this format as a raw email. From the examples at RFC 5322, this is the most basic raw email:

From: John Doe <jdoe@machine.example>
To: Mary Smith <mary@example.net>
Subject: Saying Hello
Date: Fri, 21 Nov 1997 09:55:06 -0600
Message-ID: <1234@local.machine.example>

This is a message just to say hello.
So, "Hello".

Email addresses

Notoriously difficult to validate, email addresses have a surprisingly deep feature set, including subaddressing and even comments (!!!). Regardless, for the most part everyone is familiar with the basic structure of user@domain.

Fun fact! Email addresses didn’t always have this format - for example there is such a thing as a UUCP bang path address which doesn’t have @ at all but instead specifies an explicit routing path of nodes separated by !.

the first email ever received in Uruguay pic.twitter.com/05vQptqmCU
— Alvaro (@alvrod) December 8, 2020

A more realistic email

End users will not typically interact with email files in this format; instead their email client of choice will parse the email structure and render it with a nice design, highlighting or hiding its parts for easier usage. It is still possible to see the raw email, for example in Gmail we can click on the three-dot menu on the right and select “Show Original” to see or download it. This is how a raw email could look like today as sent from Gmail:

As part of making it nicer for users, email clients will typically choose to hide some headers (or parts of the values of headers) which could be relevant for troubleshooting the behaviour of your system, so it’s necessary to always keep the raw emails around for debugging.

MIME-Version: 1.0
Date: Fri, 18 Dec 2020 10:10:10 +0100
Message-ID: <CABf2nMZJ-su9ntLF2ugzy=hPFR5+kuauwr9NyQ2q4R-KtA0EZg@mail.gmail.com>
Subject: hello
From: Alvaro Rodriguez <alvaro@alvrod.com>
To: test@alvrod.com
Content-Type: multipart/alternative; boundary="00000000000010b1f705b6c1e5a2"

--00000000000010b1f705b6c1e5a2
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi!

--=20
=C3=81lvaro Rodr=C3=ADguez
---
alvaro@alvrod.com
@alvrod <http://twitter.com/alvrod>

--00000000000010b1f705b6c1e5a2
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi!<br clear=3D"all"><div><br>-- <br><div dir=3D"ltr" clas=
s=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><=
div><div dir=3D"ltr"><div>=C3=81lvaro Rodr=C3=ADguez<br>---<br><a href=3D"m=
ailto:alvaro@alvrod.com" target=3D"_blank">alvaro@alvrod.com.=
com</a><br><a href=3D"http://twitter.com/alvrod" target=3D"_blank">@alvrod<=
/a><br></div></div></div></div></div></div></div>

--00000000000010b1f705b6c1e5a2--

Email parts

There are some additional elements over there, and to finish our first post in the series let’s quickly unpack what is going on in this fuller example.

The first line is something new - the MIME version. Meaning Multipurpose Internet Mail Extensions, MIME defines how to do more things with email than exchanging basic US-ASCII encoded text. MIME concepts are also used in HTTP and include key declarations such as:

As the original minimum viable product who ain’t gonna need it, US-ASCII happily ignores the needs of every other culture in the world and we have been dealing with mis-encoded characters ever since

Content-Type, in this case multipart/alternative meaning: multiple parts with a text/plain body and an alternative text/html body (giving the recipient the ability to choose which one to read, depending on device capabilities or personal choice). Using multipart also allows to add attachments each with its own MIME type.
Content-Transfer-Encoding, describing how to use US-ASCII to encode content that is definitely not US-ASCII, typically used as base64 for attached files or quoted-printable for internationalized text or US-ASCII encoded HTML as in this example.
Content-Disposition to support options for rendering: show the content inline (for example for images) or as an attachment where the user is expected to open or download it separately.

And lastly, about those funny looking lines like --00000000000010b1f705b6c1e5a2--? As part of the MIME Content-Type header, a “boundary” is provided to help the recipient parse the parts. Any string that is unique and not otherwise present in the body of the email could be used to indicate that a new part is starting. Each part may have its own Content-Type and Content-Transfer-Encoding headers.

Message-ID

Message identifiers can be generated by the email client or first server processing the email, and needs to be globally unique. To help with this they use a subset of the email address format, so that each host may use its own scheme to identify messages.

These identifiers can be used to connect emails together in different ways such as using the In-Reply-To, References or Resent-Message-ID headers.

From La increible historia del Instituto de Computacion en 24 emails ↩︎