Wilkinson-Rogers Notation: A Formula by any Other Name

The formula environment in R and implemented in Python’s statsmodels package is a powerful and expressive way to describe models and transforms to data. It’s often described as Wilkinson’s notation or Wilkinson-Rogers notation. Here, I break down how it works.


July 15, 2019

Wilkinson-Rodgers notation (often just called Wilkinson’s notation) is a domain specific language for defining a transformation of data (often a data matrix in terms of the levels of the source data) into a model form (usually a data matrix in a format for modeling as well as information about the role of the columns and how data maps into the new structure). For example, if the design matrix has a column backgroundColor with levels red, green, and blue then the model matrix would have three indicator columns backgroundColor[red], backgroundColor[green], and backgroundColor[blue] where the column has a 1 when its color is present and -1 otherwise.

Statements in this notation are called formula in several languages. The notation finds itself used in several software products (e.g. R, S, Python’s statsmodels, Matlab) because it is a convenient and flexible way to express the form of a model and the relationship between terms.

The most commonly used symbols in the notation are:

Symbol Example Meaning
`~` `y~x` Usually separates the resonde (y) from the the predictors (x)
`+` `+x` Include a variable
`-` `-x` Remove a variable
`:` `x1:x2` Include an interaction variable (elementwise product of x1 and x2)
`*` `x1*x2` Equivalent to x1 + x2 + x1:x2
`^` `x^3` Equivalent to x * x * x
`1` `-1` Add or remove an intercept term
`(` and `)` `(x1+x2)*(x3+x4)` Force precedence of evaluation

Two things to note about this.

We can use this to describe some complicated models. For example

represents the following facts:

Let’s say we didn’t want the highest order interaction terms. Then we could express that as
y~A*B*C*D - A:B:C:D.
If we only want main effects and two-factor interactions then we could represent this compactly as

Functions and Annotations

Wilkonson’s notation supports functions as well. If an identifier is bare and connected to ( and ) then it is treated as a function invocation. The rules for evaluation get a little different here. Consider the following model:
This model will log-transform y and make the new column log(y) the response, with all orders of interaction between the factors A, B, C, and D represented in the predictors. All expressions within a function call are interpreted to mean their literal arithmetic interpretation and do not follow the rules above. For example, the response in
log(y+1) ~ A*B*C*D

represents the log of the value of y with 1 added to the value of each element.

A utility function I(...) exists to allow passing back the literal value instead of interpreting it with the top-level rules. Consider the following examples where x is a continuous value:
y~x^2 versus y~I(x^2).

The left formula evaluates x^2 as x*x which is then x + x + x*x which is really just x. The right formula treats I(x^2) as the square of each element which is probably what we intended.

In R and S some functions are treated more as annotations. For example, Error(...) is used to mark certain columns as part of the model’s error terms (stratification) and so they are not treated as a factor in the model and they are introduced into the sequential ANOVA with the intercept.


Elements within an annotation are evaluated via the top level rules (so Error(A*B) is equivalent to Error(A+B+A*B) and not Error(A:B)) which is confusing and pollutes the grammar, so some packages will use a | to indicate that all elements to the left are factors and all elements to the right are error terms, e.g.
y~A*B*C | DayOfWeek * DeviceType
would create strata for the device type at each day of the week thus blocking those effects. Another way one could represent this is with an additional ~ instead, e.g. 
y~A*B*C ~ DayOfWeek * DeviceType.

The reason for this second form is that random effects models are often denoted using the | symbol to indicate the level at which the random effect exists. For example, (1|DeviceType) would allow a random intercept for each device type, and (x|Country) would allow a random slope on x for each country. More complicated hierarchical models can be specified this way. In general, R’s formula extensions treat | as a generic separator and allow the consumer to define the behavior, which is a nice way to do it.

Additionally, R uses the / symbol to indicate nesting hierarchical relationships, so
would represent a model where the levels of B only make sense in the context of the levels of A, so this is equivalent to
y~A+A:B which also can be expressed y~ A + B %in% A in R.
Another extension often used is to allow the left hand side of ~ to represent multiple responses. If we wanted to indicate that both y1 and y2 are responses then we could do
y1 + y2 ~ A * B * C or y1 | y2 ~ A * B * C.
Note that in logistic regresson models the left-hand side has a two-part response where the first represents the number of successes and the second represents the number of failures, for example
cbind(succ,fail) ~ A*B

represents a two-column response (using R’s cbind function to bind columns together). In multinomial models we would have any number of columns to bind for the response.

The python statsmodel package allows function calls to have an argument list. This allows then to do things like declare the encoding scheme in the formula, for example
y~C(A, Sum)

will use the sum-to-zero contrast on factor A. This has a lot of flexible applications.


Having an honest-to-God parser and domain specific language for formulas can be of great benefit to us. We can extend the grammar as needed to meet the needs of different models. It’s important to keep its role restricted to expressing relationships and transforming matrices into other matrices, but even in this role it can smooth a lot of bumps. It can be used for all of our model-type endpoints and free the developer to think more about the analysis rather than worrying about how to express the model.

Appendix A - Antlr4 Grammar

    Wilinson Notation Formula Grammar (Antlr4)

    According to Wilkinson's rules lowerExpr are evaluated directly where as topExpr are expanded.

grammar Formula;

// - Parser Declarations ----------------------------------------------------------------------------------------

// A forumal is a binary relation of top level lowerExpressions
formula:   topExpr? ('~' topExpr)+ EOF;

// A function call changes the rules.  lowerExpressions inside the function are not top-level lowerExpressions
funcall:   funcname=ID '(' funcargs=arglist ')';

// Top level lowerExpressions
topExpr:   funcall                                       # topFunctionCall
       |   <assoc=right> lhs=topExpr '^' rhs=topExpr     # topExprExponentiate
       |   ('-'|'+') topExpr                             # topExprUnary
       |   lhs=topExpr ':' rhs=topExpr                   # topExprInteraction
       |   lhs=topExpr ('*'|'/') rhs=topExpr             # topExprMultiplication
       |   topExpr op=('+'|'-') topExpr                  # topExprAddition
       |   topExpr '|' topExpr                           # topExprCondition
       |   '(' topExpr ')'                               # topExprParens
       |   ID                                            # topExprIdentifier
       |   INT                                           # topExprConstant

// lowerExpressions that can exist within function calls
lowerExpr:   funcall                                                  # lowerExprFunctionCall
         |   <assoc=right> lowerExpr '^' lowerExpr                    # lowerExprExponentiate
         |   op=('-'|'+') lowerExpr                                   # lowerExprUnary
         |   lowerExpr ':' lowerExpr                                  # lowerExprInteraction
         |   lhs=lowerExpr op=('*'|'/') rhs=lowerExpr                 # lowerExprMultiplication
         |   lhs=lowerExpr op=('+'|'-') rhs=lowerExpr                 # lowerExprAddition
         |   '(' lowerExpr ')'                                        # lowerExprParens
         |   lowerExpr op=('>'|'>='|'<'|'<='|'=='|'!=') lowerExpr     # lowerExprComparison
         |   ('!') lowerExpr                                          # lowerExprNegation
         |   lowerExpr ('&&') lowerExpr                               # lowerExprLogialAnd
         |   lowerExpr ('||') lowerExpr                               # lowerExprLogicalOr
         |   ID                                                       # lowerExprIdentifier
         |   STRING                                                   # lowerExprString
         |   INT                                                      # lowerExprInt
         |   FLOAT                                                    # lowerExprFloat

arglist :   arg (',' arg)*;

arg :  name=ID '=' val=lowerExpr         #namedArgument
    |  val=lowerExpr                     #unnamedArgument

// - Token Declarations -----------------------------------------------------------------------------------------

HEX :   '0' ('x'|'X') HEXDIGIT+ [Ll]? ;

INT :   DIGIT+ [Ll]? ;

HEXDIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

    |   DIGIT+ EXP? [Ll]?
    |   '.' DIGIT+ EXP? [Ll]?

DIGIT:  '0'..'9' ;

EXP :   ('E' | 'e') ('+' | '-')? INT ;

    :   '"' ( ESC | ~[\\"] )*? '"'
    |   '\'' ( ESC | ~[\\'] )*? '\''

ESC :   '\\' [abtnfrv"'\\]
    |   HEX_ESCAPE


    :   '\\' [0-3] [0-7] [0-7]
    |   '\\' [0-7] [0-7]
    |   '\\' [0-7]

    :   '\\' HEXDIGIT HEXDIGIT?;

ID  :   '.'
    |   (LETTER|'_') (LETTER|DIGIT|'_')*
    |   '`' (~[\\`])*? '`'

fragment LETTER  : [a-zA-Z];

WS      :   [ \t\u000C\r\n]+ -> skip ;