Questions

Design

We need to support two main flows

Email Sending flow
Email Receiving flow

Email Sending

We utilise a load balance and a message queue to ensure our consumers do not become overwhelmed.

Web servers help to perform basic validation of the email. In the event that the user is sending an email to himself, then the email is never stored in an outgoing queue.

Outgoing workers ( step 5 ) mainly ensure that every message contains all the metadata required to send the email out. They also validate that the email being sent is virus and spam free before shooting it out. If the recipient’s email is not available, then a retry is configured for a later time.

Email Receiving Flow

We use a SMTP protocol to receive emails. Once an email is received, we do a few things

If attachments are too large, we decouple attachments and put them in the object store
We utilise a queue to act as a buffer while we process the email. This is an important step because we need to check for spam mails, viruses etc
We then update the email and store it in the mail storage, cache and upload other attachments to the object data store.
Lastly, if a user is online, we push the email to his/her device.

Databases

Ideally what we need is a database whereby we can get indexed queries quickly even when the individual email is up to 100kb.

This makes relational databases a bad fit, since they are optimised for small chunks of data entries. Distributed object storage is also not a good fit since we cannot efficiently support indexing of data, marking specific emails as read etc.

In this case, since our workload will be much more write heavy than read heavy, NoSQL databases might be a good solution.

Data Model

We need two things for our databases

Partition Key : How we distribute data across nodes
Clustering key: How we sort data within a single partition

We have a few different types of data we need to store

Emails By Folder: For user emails, we can partition by user_id and cluster it by the timestamp that it was received at.
Users Email Folders : For user email folders, we can partition it by user_id ( No real need to sort )

Note here that we store read and unread emails in separate tables. This is known as denormalisation in NoSQL and helps us to have a much more performant operation.

Search

We can support email search in two main ways

We integrate it within our database itself with a custom built integration
We utilise a provider such as Elasticsearch

Search is important since indexing should be near real-time and the result has to be accurate.

We utilise a Kafka cluster here to act as a buffer between the services that perform reindexing and those that trigger the reindexing.

A general rule of thumb is that for a smaller scale email system, Elasticsearch is a good option as it’s easy to integrate and doesn’t require significant engineering effort. For a larger scale, Elasticsearch might work, but we may need a dedicated team to develop and maintain the email search infrastructure. To support an email system at Gmail or Outlook scale, it might be a good idea to have a native search embedded in the database as opposed to the separate indexing approach.

Scalability

We also use data replicated across multiple data centers. This provides two main benefits

Users can communicate with servers that are closer to them, allowing for better latency.
Users data is more secure, since in the event of a data center going down, we have another one as backup.

Concepts

Email Protocols

We have 4 different protocols to send and receive emails.

SMTP: Sending emails from one mail server to another
POP: Standard mail protocol to receive and download emails from a remote mail server to a local email client. Once emails are downloaded to your computer or phone, they are deleted from the email server. This means that the email is available only on your computer
IMAP: IMAP only downloads a message when you click it BUT does not delete the email from the server. This allows you to view the same email on multiple devices
HTTP: We can use it to access our mailbox for web based emails. Outlook uses HTTP to talk to mobile devices using a protocol called ActiveSync

Metadata

Typically, we can utilise an email header to store information about the email. This allows us to embed information such as the email thread within the header itself.

{
  "headers" {
     "Message-Id": "<7BA04B2A-430C-4D12-8B57-862103C34501@gmail.com>",
     "In-Reply-To": "<CAEWTXuPfN=LzECjDJtgY9Vu03kgFvJnJUSHTt6TW@gmail.com>",
     "References": ["<7BA04B2A-430C-4D12-8B57-862103C34501@gmail.com>"]
  }
}

Ivan's notes (Powered by 🪴 Quartz 4.0)

Table of Contents

Design a Email System

Questions

Design

Email Sending

Email Receiving Flow

Databases

Data Model

Search

Scalability

Concepts

Email Protocols

Metadata

Graph View

Backlinks

Ivan's notes (Powered by 🪴 Quartz 4.0)

Table of Contents

Design a Email System

Questions §

Design §

Email Sending §

Email Receiving Flow §

Databases §

Data Model §

Search §

Scalability §

Concepts §

Email Protocols §

Metadata §

Graph View

Backlinks

Questions

Design

Email Sending

Email Receiving Flow

Databases

Data Model

Search

Scalability

Concepts

Email Protocols

Metadata