Symbol | Example | Meaning |
---|---|---|
`~` | `y~x` | Usually separates the resonde (y) from the the predictors (x) |
`+` | `+x` | Include a variable |
`-` | `-x` | Remove a variable |
`:` | `x1:x2` | Include an interaction variable (elementwise product of x1 and x2) |
`*` | `x1*x2` | Equivalent to x1 + x2 + x1:x2 |
`^` | `x^3` | Equivalent to x * x * x |
`1` | `-1` | Add or remove an intercept term |
`(` and `)` | `(x1+x2)*(x3+x4)` | Force precedence of evaluation |
Wilkinson-Rogers Notation: A Formula by any Other Name
The formula environment in R and implemented in Python’s statsmodels package is a powerful and expressive way to describe models and transforms to data. It’s often described as Wilkinson’s notation or Wilkinson-Rogers notation. Here, I break down how it works.
Wilkinson-Rodgers notation (often just called Wilkinson’s notation) is a domain specific language for defining a transformation of data (often a data matrix in terms of the levels of the source data) into a model form (usually a data matrix in a format for modeling as well as information about the role of the columns and how data maps into the new structure). For example, if the design matrix has a column backgroundColor
with levels red
, green
, and blue
then the model matrix would have three indicator columns backgroundColor[red]
, backgroundColor[green]
, and backgroundColor[blue]
where the column has a 1 when its color is present and -1 otherwise.
Statements in this notation are called formula in several languages. The notation finds itself used in several software products (e.g. R, S, Python’s statsmodels, Matlab) because it is a convenient and flexible way to express the form of a model and the relationship between terms.
The most commonly used symbols in the notation are:
Two things to note about this.
- In most models the intercept is always included by default, so if you don’t see
-1
then the intercept will be present. - Interactions of a term with itself returns itself, e.g.
x:x
is equivalent tox
. - Since
+x
indicates adding a termx
, the expressionx + x
is equivalent tox
.
y~A*B*C*D
represents the following facts:
- Our response is
y
. - Our factors are
A
,B
,C
, andD
. - Include all 2-way, 3-way, and the 4-way interaction terms (e.g.
A:B
,A:C:D
, andA:B:C:D
) - The terms
A:B
is secondary toA
andB
and should have its effects shown in a different row of the ANOVA table.
y~A*B*C*D - A:B:C:D
.
y~(A+B+C+D)^2
.
Functions and Annotations
Wilkonson’s notation supports functions as well. If an identifier is bare and connected to(
and )
then it is treated as a function invocation. The rules for evaluation get a little different here. Consider the following model:
log(y)~A*B*C*D
y
and make the new column log(y)
the response, with all orders of interaction between the factors A
, B
, C
, and D
represented in the predictors. All expressions within a function call are interpreted to mean their literal arithmetic interpretation and do not follow the rules above. For example, the response inlog(y+1) ~ A*B*C*D
represents the log of the value of y
with 1 added to the value of each element.
I(...)
exists to allow passing back the literal value instead of interpreting it with the top-level rules. Consider the following examples where x
is a continuous value:
y~x^2
versus y~I(x^2)
.
The left formula evaluates x^2
as x*x
which is then x + x + x*x
which is really just x
. The right formula treats I(x^2)
as the square of each element which is probably what we intended.
In R and S some functions are treated more as annotations. For example, Error(...)
is used to mark certain columns as part of the model’s error terms (stratification) and so they are not treated as a factor in the model and they are introduced into the sequential ANOVA with the intercept.
Extensions
Elements within an annotation are evaluated via the top level rules (soError(A*B)
is equivalent to Error(A+B+A*B)
and not Error(A:B)
) which is confusing and pollutes the grammar, so some packages will use a |
to indicate that all elements to the left are factors and all elements to the right are error terms, e.g.
y~A*B*C | DayOfWeek * DeviceType
~
instead, e.g.
y~A*B*C ~ DayOfWeek * DeviceType
.
The reason for this second form is that random effects models are often denoted using the |
symbol to indicate the level at which the random effect exists. For example, (1|DeviceType)
would allow a random intercept for each device type, and (x|Country)
would allow a random slope on x
for each country. More complicated hierarchical models can be specified this way. In general, R’s formula extensions treat |
as a generic separator and allow the consumer to define the behavior, which is a nice way to do it.
/
symbol to indicate nesting hierarchical relationships, so
y~A/B
B
only make sense in the context of the levels of A
, so this is equivalent to
y~A+A:B
which also can be expressed y~ A + B %in% A
in R.
~
to represent multiple responses. If we wanted to indicate that both y1
and y2
are responses then we could do
y1 + y2 ~ A * B * C
or y1 | y2 ~ A * B * C
.
cbind(succ,fail) ~ A*B
represents a two-column response (using R’s cbind
function to bind columns together). In multinomial models we would have any number of columns to bind for the response.
y~C(A, Sum)
will use the sum-to-zero contrast on factor A
. This has a lot of flexible applications.
Conclusion
Having an honest-to-God parser and domain specific language for formulas can be of great benefit to us. We can extend the grammar as needed to meet the needs of different models. It’s important to keep its role restricted to expressing relationships and transforming matrices into other matrices, but even in this role it can smooth a lot of bumps. It can be used for all of our model-type endpoints and free the developer to think more about the analysis rather than worrying about how to express the model.
Appendix A - Antlr4 Grammar
/*
Wilinson Notation Formula Grammar (Antlr4)
According to Wilkinson's rules lowerExpr are evaluated directly where as topExpr are expanded.
*/
grammar Formula;
// - Parser Declarations ----------------------------------------------------------------------------------------
// A forumal is a binary relation of top level lowerExpressions
formula: topExpr? ('~' topExpr)+ EOF;
// A function call changes the rules. lowerExpressions inside the function are not top-level lowerExpressions
funcall: funcname=ID '(' funcargs=arglist ')';
// Top level lowerExpressions
topExpr: funcall # topFunctionCall
| <assoc=right> lhs=topExpr '^' rhs=topExpr # topExprExponentiate
| ('-'|'+') topExpr # topExprUnary
| lhs=topExpr ':' rhs=topExpr # topExprInteraction
| lhs=topExpr ('*'|'/') rhs=topExpr # topExprMultiplication
| topExpr op=('+'|'-') topExpr # topExprAddition
| topExpr '|' topExpr # topExprCondition
| '(' topExpr ')' # topExprParens
| ID # topExprIdentifier
| INT # topExprConstant
;
// lowerExpressions that can exist within function calls
lowerExpr: funcall # lowerExprFunctionCall
| <assoc=right> lowerExpr '^' lowerExpr # lowerExprExponentiate
| op=('-'|'+') lowerExpr # lowerExprUnary
| lowerExpr ':' lowerExpr # lowerExprInteraction
| lhs=lowerExpr op=('*'|'/') rhs=lowerExpr # lowerExprMultiplication
| lhs=lowerExpr op=('+'|'-') rhs=lowerExpr # lowerExprAddition
| '(' lowerExpr ')' # lowerExprParens
| lowerExpr op=('>'|'>='|'<'|'<='|'=='|'!=') lowerExpr # lowerExprComparison
| ('!') lowerExpr # lowerExprNegation
| lowerExpr ('&&') lowerExpr # lowerExprLogialAnd
| lowerExpr ('||') lowerExpr # lowerExprLogicalOr
| ID # lowerExprIdentifier
| STRING # lowerExprString
| INT # lowerExprInt
| FLOAT # lowerExprFloat
;
arglist : arg (',' arg)*;
arg : name=ID '=' val=lowerExpr #namedArgument
| val=lowerExpr #unnamedArgument
;
// - Token Declarations -----------------------------------------------------------------------------------------
HEX : '0' ('x'|'X') HEXDIGIT+ [Ll]? ;
INT : DIGIT+ [Ll]? ;
fragment
HEXDIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
FLOAT: DIGIT+ '.' DIGIT* EXP? [Ll]?
| DIGIT+ EXP? [Ll]?
| '.' DIGIT+ EXP? [Ll]?
;
fragment
DIGIT: '0'..'9' ;
fragment
EXP : ('E' | 'e') ('+' | '-')? INT ;
STRING
: '"' ( ESC | ~[\\"] )*? '"'
| '\'' ( ESC | ~[\\'] )*? '\''
;
fragment
ESC : '\\' [abtnfrv"'\\]
| UNICODE_ESCAPE
| HEX_ESCAPE
| OCTAL_ESCAPE
;
fragment
UNICODE_ESCAPE
: '\\' 'u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT
| '\\' 'u' '{' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT '}'
;
fragment
OCTAL_ESCAPE
: '\\' [0-3] [0-7] [0-7]
| '\\' [0-7] [0-7]
| '\\' [0-7]
;
fragment
HEX_ESCAPE
: '\\' HEXDIGIT HEXDIGIT?;
ID : '.'
| (LETTER|'_') (LETTER|DIGIT|'_')*
| '`' (~[\\`])*? '`'
;
fragment LETTER : [a-zA-Z];
WS : [ \t\u000C\r\n]+ -> skip ;