The field of machine learning is evolving rapidly in recent years. Communication between different researchers and research groups becomes increasingly important. A key challenge for communication arises from inconsistent notation usages among different papers. This proposal suggests a standard for commonly used mathematical notation for machine learning. In this first version, only some notation are mentioned and more notation are left to be done. This proposal will be regularly updated based on the progress of the field.

You can use this notation by downloading LaTeX macro package MLMath.sty tuned for the updates and you can turn to GitHub for more information.

See the full Guide for more

- \(S=\{\mathbf{z}_i\}_{i=1}^n=\{(\mathbf{x}_i,\mathbf{y}_i)\}_{i=1}^n\)
- Dataset
- \(\mathcal{H}\)
- function space
- \(f_{\mathbf{\theta}}:\mathcal{X}\to \mathcal{Y}\)
- hypothesis function
- \(L_{S}(\mathbf{\theta}), L_{n}(\mathbf{\theta}), R_{n}(\mathbf{\theta}), R_{S}(\mathbf{\theta})\)
- empirical risk or training loss
- \(f(\mathbf{x};\mathbf{\theta})=\sum_{j=1}^{m} a_j \sigma (\mathbf{w}_j\cdot \mathbf{x} + b_j) \)
- two-layer neural network
- \({\rm Rad}_{n} (\mathcal{H})\)
- Rademacher complexity
- GD
- gradient descent
- SGD
- stochastic gradient descent
- \(B\)
- a batch set
- \(|B|\)
- batch size
- \(\eta\)
- learning rate
- \(\mathbf{\xi}\)
- continuous frequency

See the full Guide for more

symbol | meaning | LATEX | simplied |
---|---|---|---|

x | input | \bm{x} | \vx |

y | output, label | \bm{y} | \vy |

d | input dimension | d | |

do | output dimension d_{\rm o} | d_{\rm o} | |

n | number of samples | n | |

X | instances domain (a set) | \mathcal{X} | \fX |

Y | labels domain (a set) | \mathcal{Y} | \fY |

Z | = X × Y example domain | \mathcal{Z} | \fZ |

H | hypothesis space (a set) | \mathcal{H} | \fH |

θ | a set of parameters | \bm{\theta} | \vtheta |

fθ : X → Y | hypothesis function | \f_{\bm{\theta}} | f_{\vtheta} |

f or f ∗ : X → Y | target function | f,f^* | |

ℓ : H × Z → R+ | loss function | \ell | |

D | distribution of Z | \mathcal{D} | \fD |

S = {zi}ni=1 | = {(xi, yi)}ni=1 sample set | ||

LS(θ), Ln(θ), | empirical risk or training loss | ||

Rn(θ), RS(θ) | empirical risk or training loss | ||

LD(θ), RD(θ) | population risk or expected loss | ||

σ : R → R+ | activation function | \sigma | |

wj | input weight | \bm{w}_j | \vw_j |

aj | output weight | a_j | |

bj | bias term | b_j | |

f∑θ(x) or f(x; θ) | neural network | f_{\bm{\theta}} | f_{\vtheta} |

∑mj=1 ajσ(wj · x + bj ) | two-layer neural network | ||

VCdim(H) | VC-dimension of H | ||

Rad(H ◦ S), RadS(H) | Rademacher complexity of H on S | ||

Radn(H) | Rademacher complexity over samples of size n | ||

GD | gradient descent | ||

SGD | stochastic gradient descent | ||

B | a batch set | B | |

|B| | batch size | b | |

η | learning rate | \eta | |

k | discretized frequency | \bm{k} | \vk |

ξ | continuous frequency | \bm{\xi} | \vxi |

∗ | convolution operation | * |

## We’ve heard things like