When you send a lot of email you get a lot of messages back from Mail Transport Agents (MTAs). These messages follow some standards but their content is very *home grown* in character with content that tells us a lot about what's going wrong. If one wishes to monitor the frequency of these messages and wishes to know when a *type* of message has increased often hand writing of regular expressions and combing through data by hand is the path taken. I don't want to do that. It's slow, error prone, and only works for messages already seen.

My approach to the problem is to learn the templates of incoming messages from the messages themselves. This is achieved as follows:

- Templates are discovered by recursively isolating the longest common substring of a set of messages until no commonality exists.
- The longest common substring is discovered using generalized suffix trees.

- These constructions occur in such a way as to minimize a theoretical compressed representation of the data with additional constraints.
- I attempt to make jokes along the way, many of which fall flat.

My approach yielded okay results - not great but not bad - but I think the approach is promising.

# Introduction¶

Email RFC 1891 from 1996 and RFC 3461 from 2003 are the standards that describe means of reporting status (failure, delay, success, relay, or expansion) about an email message's fate. These messages are useful in understanding what happened to your messages, but there's not a lot of structure required of them, and frequently it's infeasible to have a human read and understand all such messages.

Email service providers (ESPs) like to use their own message formats that sometimes include identifiers that help them with diagnosing problems. Sometimes ESPs will add or remove third party solutions to help manage spam and phishing attacks. Examples of such security appliances are Barracuda and Proofpoint. Each of these appliances have their own message formats.

The unstructured and shifting nature of these messages makes it difficult to identify what problem you're having when you're trying to deliver messages and getting lots of error messages back from ESPs. We want to know what unit is taking issue with us. Is it the host? Is it a security appliance? Is it a third party black listing? Is it greylisting because you're sending too fast? We also want to know how big of a problem it is, so we need to group messages together that have different diagnostic content within them. I was unable to find approaches to identifying these things that weren't massive hand-crafted collections of regular expressions.

Here, I divide this problem into two steps. The first step is deriving a template from a group of similar messages. The second step is determining a fast/goodish way of clustering messages into groups of similar messages so that a template can be constructed.

## Existing Standards and Technical Hurdles¶

There's good news and bad news.

The existing standards provide for an SMTP status code that will tell us if the message was accepted or if a temporary or permanent error occurred. In addition, often an ESMTP code is provided with more diagnostic information. These pieces of data can be used to group messages into much less diverse collections. That's the good news.

The bad news takes the form of two problems. First, the remainder of the message can be pretty much anything. It might have the user's email address, the domain name of the mail transport agent that received it, a timestamp, some 40 character long hexadecimal sequence that means *who-knows-what*, or even entire giant LDAP identifiers stretching thousands and thousands of characters... Further, messages can sometimes embed other messages that occurred along the way, so you might see the path of a message that bounced around gmail getting connection failures on 6 of their MTAs. Believe it or not, that's the *easy* problem.

The second and harder problem (actually an insurmountable problem but play along for the moment) is that templates are not uniquely *identifiable*. As a toy example, consider two templates:

`delivered to {email}`

`delivered successfully at {timestamp}`

These two templates are different, but how do we know that messages from these templates didn't come from

`delivered {message}`

We *don't* know. It was a trick question. Did I get you? I hope I did. So, there's no way to know we have a "correct" answer from the content of the messages alone. This does not only cause us to be too general - we may also become to specific. Consider the messages:

`delivery succeeded to alice@alice.com on 2018-04-01`

`delivery failed to bob@bob.com on 2018-04-01`

`delivered failed to alice@alice.com on 2018-04-02`

`delivered succeeded bob@bob.com on 2018-04-02`

Here, the template is clearly `delivery {status} to {email} on {date}`

, but is there any reason there can't be two templates:

`delivery succeeded to {email} on {date}`

`delivery failed to {email} on {date}`

That was another trick question. It could be that there are two templates. Thus, the best we can hope for is some kind of *parsimonious* set of templates that captures important structure of the messages. And due to the number of messages, it's likely that we'll need a heuristic approach to grouping such messages.

## Structure of This Post¶

I proceed as follows. First I describe how I derive a template from a collection of similar strings utilizing a data structure called a suffix tree. This includes discussion of a simple method to *smooth* the template to make it more amenable to data outside of the training set. Next, I discuss an edit-distance based measure of message similarity useful to select messages most likely to be part of the same template. The similarity measure is then combined with a template quality measure based on how well it compresses the message data, and a greedy method for constructing the templates is described. I conclude with a demonstration of this process on a data set I found on the Internet of all places, can you believe it?!

# Deriving a Template from a Collection of Similar Strings¶

Constructing a template from a collection of similar strings is achieved by recursively solving the longest common substring problem. Consider a collection of similar strings $\mathcal{S}$:

Beginning with $i=0$, find the longest common substring $s_i$ in $\mathcal{S}$

For each $S \in \mathcal{S}$ do

Split $S$ at $s_i$ and add the left part to $\mathcal{S}_\mathrm{left}$ and the right part to $\mathcal{S}_\mathrm{right}$

Recursively reapply the procedure with $i=i+1$ to $\mathcal{S}_\mathrm{left}$ and $\mathcal{S}_\mathrm{right}$ as long as they are non-empty. Assemble the $s_i$ strings into the template by noting that between adjacent $s_i$ and $s_{i+1}$ (and sometimes before the first or after the last) is a spot where the template can be filled with data.

To do this efficiently I make use of generalized suffix trees. Suffix trees are interesting data structures that "everyone in computer science knows" but that every computer scientist I've asked does not remember. I'll talk quickly about what is a *suffix tree* and how do I use them, but a much more thorough treatment can be found in Gusfield (1997).

### The Dark Art of Suffix Trees¶

Suffix trees - as I use them here - are a fast way to look for substrings in a string. Here, I'll use the ∎ character (`\u220e`

) to represent the end of a string. For example, the word *banana* would be written "banana∎". The non-trivial suffixes of "banana∎" are "banana∎", "anana∎", "nana∎", "ana∎", "na∎", and "a∎". If a substring occurs in "banana∎" then it will match the front of (or even a whole) suffix. If someone you know is feeling down just tell them to look on the bright side, every substring is a prefix of a suffix, so it will get better. It sounds like it should be reassuring.

Let's say we want to test if "nana" occurs in our string and we have a table of suffixes. We can simply go through the entries looking to see if the suffix starts with "nana". If no suffix starts with "nana" then "nana" cannot be a substring of "banana∎".

A suffix tree for as string $S$ is a tree constructed in such a way that every suffix of $S$ exists as a path from the root to some leaf, and where finding substrings of $S$ is very efficient. Each node has the property that no two branches start with the same character, so if you have a candidate string $T$ that you think might be a substring of $S$ simply go to the root of the tree and start walking down the branches that match $T$. If you traverse all of $T$ then $T$ must be a substring of $S$.

A more formal definition is that a suffix tree for a string $S$ of length $s$ is a tree with the following properties:

- It has exactly $s$ leaves.
- Except for the root, every internal node has at least two children.
- Each edge is labeled with a non-empty substring of $S$
- No two edges starting out of a node can have string-labels beginning with the same character.
- Every suffix exists as a concatenation of the sting-labels of a unique path from a leaf to the root.

Adding the terminal character helps ensure these properties exist. As an example, consider the suffix tree for "banana∎" presented below:

```
SuffixTree('banana').plot()
```

It's easy to verify that the five properties are satisfied by this tree.

My naive implementation for constructing suffix trees creates the tree by creating a node for the longest suffix and then splits that node when we would trace out a new suffix and mismatch somewhere in the node, adding to the tree to get to the final result. Note that there exist linear time algorithms to produce this tree (see for example see Ukkonen's algorithm in Gusfield (1997)). A key piece of my naive method is that I allow adding additional strings to the tree, and I keep count of how many different strings touched each node. Consider, for example the following suffix tree for *banana* and *cabana*, where the number in the node indicates how many of the strings touched it.

```
S = SuffixTree('banana')
S.add('cabana')
S.plot(counts=True)
```

In this tree, the value at a node tells you how many of the strings posses the substring made from the root to the node being examined. The sum of the node values of children tell you how many times the substring occurred in all of the strings. For example, "na" has two leaves, one with label 1, and one with label 2, so we know "na" occurred 3 times in all between "banana" and "cabana". The fact that "na" has a value of 2 tells us that it occurs at least once in both "banana" and "cabana". For template searching, we want to know the substrings that occur in all of the strings, so we can discard all the nodes of weight 1 and their edges. This yields the following tree:

```
S = SuffixTree('banana')
S.add('cabana')
S.prune()
S.plot()
```

It's easy to verify that "bana", "ana", "na", and "a" occur in "cabana" and "banana". Let us now work through building a template via suffix tree for this example. First, we find the longest common substring, which will be the longest rooted path on this tree (excluding ∎, and where *length* is the sum of the lengths of the labels of the nodes we pass through). The longest path is "bana", so splitting we get $\mathcal{S}_\mathrm{left}$ `= ['', 'ca']`

and $\mathcal{S}_\mathrm{right}$ `= ['na', '']`

. Since there's no longest common substring in either side we assemble our template:

```
tpl = '{}bana{}'
print(tpl.format('', 'na'))
print(tpl.format('ca', ''))
```

Consider a slightly longer example, where $\mathcal{S}$ is:

```
ss = [
'email to Robert@gmail.com failed.',
'email to Andre@yahoo.com failed.',
'email to Edgar@cox.net failed.'
]
```

The longest common substring is "email to " which leads us to recurse on the right side set. This yields " failed." which leads us to recurse on the left side set. So far so good. But now the longest common substring is "co" from ".com", ".com", and "@cox". This causes us to build the following template:

```
S = SuffixTree(ss[0])
for s in ss[1:]:
S.add(s)
S.prune()
print(repr(str(S.template())))
```

This template isn't great but it isn't bad. The problem is that there are constant parts that a human knows are just coincidentally constant. We could repair this error by noting the class of the characters that compose the non-constant segments of the recursive division of the strings, and then expand the non-constant segments by including the character of the same class from neighboring sets. I call this smoothing the template. To smooth this template note the different sets $\mathcal{S}$ for our example:

```
pprint(S.template().columns)
```

Note that the second element is lower- and upper-case letters, and we could expand it to include the third and fourth elements by just including characters of that class. The sixth and eight element are lower-case letters and period, so expanding those would consume the bothersome "r" and "co" constant elements. We can generate a string that would force this expansion using some character we won't see in the real data, e.g. "⋀" character (\u22e0), to ensure we won't match any of our other strings on accident. Plus "⋀" looks pretty cool, which is an added benefit.

```
smoother = S.template().smoothing_string()
print(smoother)
S.add(smoother)
S.prune()
print(repr(str(S.template())))
```

That's a pretty good template. Looking at the $\mathcal{S}$ sets we find things are nicer:

```
pprint(S.template().columns)
```

This template is much preferable.

Let us now consider a more in-depth example. A specific hour of logs were used from an open PMTA node found via judicious googling (the data itself is from http://ks.api-d.com:8080/). A set of 19,974 success DSN messages such as

`smtp;250 Requested mail action okay, completed: id=1MpVYu-1ejkL32bDI-00pkJ6`

and

`smtp;250 Requested mail action okay, completed: id=1MNLyQ-1esDHg1Gjy-00P3M1`

were collected (all messages starting with the same front up to the `=`

).

It happens sometimes that adding a string doesn't change the suffix tree at all, but *does* change the template. This is because the tree does not possess some of the order information that is in the template. We can make a fast greedy testing function that uses the template and determines if a string is already representable with that template. We can skip every representable string and only add *informative* strings to the tree and template. In this way we can create a template for a collection of millions of messages but have it constructed from a subset of tens of strings. I'll call this subset the *support* of the template. Consider the following:

```
ss = [s for s in messages if s.startswith('smtp;250 Requested mail action okay, completed: id=')]
S = SuffixTree(ss[0])
tpl = S.template()
for s in ss[1:]:
if tpl.test(s) is False:
S.add(s)
S.prune()
tpl = S.template()
print(
'A collection of {} strings was provided, but only {} were used for the following template:'.format(
len(ss), len(tpl.columns[0])
)
)
print(repr(str(S.template())))
```

The 7 strings used to construct this template, the *support* of the template if you will, are:

```
pprint(list(S.strwt.keys()))
```

As you can see, the template is pretty good. Also, it constructed very quickly. When I looked at these messages initially I did not pick up the "-1" and "-0" constants that exist there, instead assuming that something like

`smtp;250 Requested mail action okay, completed: id={}`

would be constructed. I don't know if the "-0" and "-1" have special meaning or are just circumstance of when we sampled our messages.

# Measures of Message Similarity¶

Since we're wanting to construct groups of similar messages we will want to take a candidate string $T$ and decide if we will add it to group $\mathcal{S}$ composed of strings $S_i$, $i=1, \ldots, n$. If we decide to add $T$ to no such group then we'd create a new group based on $T$.

The obvious approach would be to use edit distance. Edit distance (often called Levenshtein distance) counts edits in terms of inserts, deletes, and replacements to convert one string to another, but I use the number of characters inserted or deleted (such that a replace counts for a deletion and then an insert) for this metric, which I will write as $d_\mathrm{edit}(\cdot, \cdot)$. To use edit distance to determine if $T$ should be added to $\mathcal{S}$ we would need some string $S$ representing the *center* of group $\mathcal{S}$, but such an $S$ probably doesn't exist. We could take the average $$
\frac{1}{n} \sum_{i=1}^n d_\mathrm{edit}(T, S_i)
$$
for all $n$ strings in our corpus but that would be expensive to calculate for a group of tens of thousands of messages. But if we take the average over just those strings in the support of the template then it's actually pretty fast. (Another thing to consider is that there's a slight bias towards shorter strings, where only additions are needed where-as comparable longer strings require deletions and additions. I just pretend that doesn't happen.)

*Note: Implementations of edit distance often use dynamic programming (Gusfield, 1997). The python package python-Levenshtein is very fast for calculating this (much faster than the built-in difflib) for these sized messages, and so python-Levenshtein should be used instead of difflib for implementation.*

Consider messages of the form

`smtp;250 Requested mail action okay, completed: id={}`

from the hour's worth of activity. Below is a plot of the distance of each of these messages to the template, compared to the distribution of the distances of all other messages from that hour of activity that do not fit the template. The lower region shows that messages much like our template have very small distances to the template. The upper region shows that messages that are not in the form given above are a large distance away, and that there is no overlap where some message that are *not* of the form above are about the same distance as some that are. This separation is important because it gives us hope that we can detect that our template is goodish when the distribution of messages that fit the template are tight around the template's supports while other messages are farther away.

```
df = pd.DataFrame({
'edit_distance': [S1.edit_distance(s) for s in full1] +
[S1.edit_distance(s) for s in full1_complement],
'group': [1 for _ in full1] + [2 for _ in full1_complement]
})
```

These distances clearly differentiate between our group and those messages outside of group. Longer messages tend to have diagnostic information that changes between messages, and so the distances between members of a group will be larger, but interestingly enough the variance between them is small. Even so, the fact that the edit distances are large makes it difficult to determine if a message is in-group or out of group. Consider messages of the following form:

`smtp;250 2.6.0 <{}.prod.protection.outlook.com> [InternalId={}, Hostname={}.prod.protection.outlook.com] {} bytes in {}, {} KB/sec Queued mail for delivery`

such as

`smtp;250 2.6.0 <96b818e2-67de-486b-a92a-e22cdab054f8@DB5EUR01FT002.eop-EUR01.prod.protection.outlook.com> [InternalId=3994319627907, Hostname=DB5EUR01HT001.eop-EUR01.prod.protection.outlook.com] 9108 bytes in 2.696, 3.298 KB/sec Queued mail for delivery`

and

`smtp;250 2.6.0 <55f64e64-ef30-471c-9c70-c456b0e0cffd@DM3NAM06FT902.Eop-nam06.prod.protection.outlook.com> [InternalId=6167573060640, Hostname=DM3NAM06HT002.Eop-nam06.prod.protection.outlook.com] 9614 bytes in 0.119, 78.535 KB/sec Queued mail for delivery`

The good news is that separation seems severe, but the bad news is that the mean distance and variance is very different for different message templates. To put the bad news another way, there's no clear cutoff to use for all templates. The number of messages is too large to use hierarchical clustering (when you have 10,000 messages you have 49,995,000 comparisons to make). We also have no idea how many clusters exist so things like $k$-means are not viable.

My particular needs require a heuristic that can handle *largish* amounts of data. I also need it to be parallelizable. To those ends I took the approach that follows.

# Clustering Messages¶

My strategy for clustering is to start with a seed message and then look for messages to include that make a template that has the best lossless compression ratio for its messages. As I mentioned before, templates are not identifiable in the sense that one can do either of the following:

*Generalization*: shift characters in the fixed into neighboring variable selections,*Specification*: make part of the variable area fixed, making a version of the template for each such value.

Neither generalization nor specification are necessarily errors. A template may only refer to a specific MTA but the MTA may be variable in case more are added in the future. Similarly, some messages are composed of multiple layers of templates, and so combining the outer layers into a multiple templates is permissible.

Because of this ambiguity I rely on the set of messages with the best compression ratio. Compression, as a metric for goodness of fit, has been around for a while. Grunwald (2007) gives an excellent treatment of the minimum description length principle and the correspondence between information and likelihood.

### Compressability (Compression Ratio)¶

Consider a set of strings $\mathcal{S} = \{S_i, i=1, \ldots, n\}$ that are the support of a template. The template decomposes each of those strings into pieces $S_{[i,j]}, j=1, \ldots, m$ representing the $j$th piece of string $i$ from $\mathcal{S}$. Consider the template for the example message above

`smtp;250 Requested mail action okay, completed: id={}-1{}-0{}`

The support $\mathcal{S}$ is composed of $n=7$ support messages

```
['smtp;250 Requested mail action okay, completed: id=1MpVYu-1ejkL32bDI-00pkJ6',
'smtp;250 Requested mail action okay, completed: id=1MNLyQ-1esDHg1Gjy-00P3M1',
'smtp;250 Requested mail action okay, completed: id=1MoxOk-1ekNMC3m0M-00qVCo',
'smtp;250 Requested mail action okay, completed: id=1M4r4t-1f4Vsd20TH-001ja2',
'smtp;250 Requested mail action okay, completed: id=1N6co2-1eSgoP434Q-017Yvx',
'smtp;250 Requested mail action okay, completed: id=0M1ANs-1eDnhf0QtE-00t9c6',
'smtp;250 Requested mail action okay, completed: id=0MFyek-1fHR353uMQ-00Ewbv']
```

which are divided into $m=6$ parts each by the template. The decomposition for $S_1$, $(S_{[1,j]} ~\mathrm{for}~ j = 1, \ldots, m)$ is:

```
( 'smtp;250 Requested mail action okay, completed: id=',
'1MpVYu',
'-1',
'ejkL32bDI',
'-0',
'0pkJ6')
```

When I write the $j$th column I mean the array of items $S_{[i,j]}$ for $i=1,\ldots,n$. In the above example, the odd-indexed columns are all constant, where as the even indexed columns are all variable.

To calculate the cost of storing a template I do as follows:

- For each constant column we add the length of only one element to the cost plus 1 for a termination character,
- For each variable constant column we add the length of each unique string plus 1 to the cost, and then add 1 for each row to represent a pointer to the unique elements.

I then compare that to the baseline - the sum of $|S_i|$ for $i=1, \ldots, n$, plus $n$ for the termination character of each string. The ratio of the cost to the baseline is the compression ratio. A template with a smaller compression ratio is in some sense a better template than one with a larger ratio.

The use of compression ratio causes a problem in that I'm using it locally. If I have five templates for ten strings, then there's no reason to ever go to one template for all ten since the compression ratio must be worse no matter what those strings are. Boo!

We need other ways to look our performance that capture more global behavior.

### Representability¶

One other thing we can look at is what proportion of the messages we can compress. Thus, if we can compress 50% of the messages with one template at a compression ratio of 0.2 or we can compress 10% of the messages with a ratio of 0.1 we might choose the former over the latter.

### Uniformity¶

The last way I try to look at the quality of the template is by how uniform the occurrences of entries tends to be. The general idea is that, for things that vary in a template, we should see about equal number of each value. We'd expect different templates to occur with differing frequencies as things happen. Thus, we can use how uniform the distribution of values is as a measure of template quality.

For example, if each log message contains a timestamp then we should see the same number of each timestamp we see more-or-less. If it's the domain name of the mail transport agent that accepted the message (say mail1.doodlebug.net, mail2.doodlebug.net, and mail3.doodlebug.net) then we'd expect to see approximately equal numbers of messages to each of them. If, however, we got the templating wrong and we have the following messages:

- 10 copies of "Delivery failed due to act of god"
1000 copies of "Deliver failed due to act of dog"

then the wildly uneven frequency of values is the only clue we get that the template "Delivery failed due to act of {}o{}" is probably no good.

To compute a uniformity statistic I simple look at the relative entropy of the message components in the support. Specifically, for template $t_i$ the entropy $H(t_i)$ is divided by the entropy of a uniform distribution giving us our measure of uniformity $U(t_i)$. We define $U(t_i)$ by summing over each of the $n$ supports with weighted frequency $p_j$ for $j \in \mathcal{S}$ is just the proportion of uniform entropy achieved.

$$ H(t_i) = - \sum_{j \in \mathcal{S}} p_j \log p_j, \\ U(t_i) = \frac{\sum_{j \in \mathcal{S}} p_j \log p_j}{\sum_{j \in \mathcal{S}} 1/n \log 1/n} $$

and thus varies in $[0,1]$.

### Template Quality Metric¶

Combining the above is trivially simple. For compressability $c$, representability $r$, and uniformity $u$ I just take the geometric mean:

$$ \text{score} = 1 - \bigl((1-c) r u\bigr)^{1/3}. $$## Training a Template¶

Training of a template proceeds as follows:

- Chose some message as a
*seed*and construct a template for it. The template will be constant, with the seed as the only support. This is epoch 0. - Order the other messages in ascending order of edit distance and for each do:
- If the message fits the template then remove it and continue.
- If the message does not fit the template then add it to the support and rebuild the template.

- Choose the template with the smallest template quality metric over each constructed this way.
- Go to step 2 and repeat the process until the template quality metric fails to improve.

When we add the wrong message to a template it causes the template to over-generalize and quickly leads to a template that matches everything. This ensures that we're not doing *many* additions to the tree. The most expensive part of this is ranking messages by edit distance, which is easily parallelizable.

# Conclusion¶

So, this has been fun. By incorporating smoothing strings, by stratifying the data by SMTP/ESMTP code, and by pre-processing the messages to remove references to the address and domain you sent from/to you can improve the performance of the clustering further. Suffix trees are fun, and so is clustering. I think the method I described is far from perfect, but it does allow one to do two very important things:

- Establish a type labeling for messages that doesn't rely on just the domain. If an event is occurring on several domains then you'll be able to see the event because we're following the messages and not the points of origin.
- Learn new templates as domains add new security appliances and as software changes over time. We don't need a person manually hand-crafting regexes anymore, or more accurately, we can be a friendly guide to our friends-that-use-back-reference-ungreedy-dot-expressions.

I really only wrote about things that worked *okay* to *well*, and ignore a lot of false starts. One promising false start was using something like dictionary based compression to automatically create a coordinate system for the messages. Dictionary compression schemes operate by building up a dictionary of previously seen information (but I keep keys an absurdly long time compared to actual dictionary based compression implementations) and inserts references to which dictionary element is seen. Other false starts are not mentioned so that I seem cooler than I am.

I debated putting in source code because I really need to rewrite the suffix trees in cython or c and make it efficient. There's several good resources on suffix trees, for example Lloyd Allison has a tutorial and Brenden Kokoszka has an interactive Ukkonen's Algorithm that's pretty cool.

# Appendix: Source Code¶

```
```

## Bibliography

P.D. Grünwald.
*The Minimum Description Length Principle*.
Adaptive computation and machine learning.
MIT Press, 2007.
ISBN 9780262072816.
URL: https://books.google.com/books?id=mbU6T7oUrBgC. ↩

Dan Gusfield.
*Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology*.
Cambridge University Press, New York, NY, USA, 1997.
ISBN 0-521-58519-8. ↩ ^{1} ^{2} ^{3}