This is a **work-in-progress** demo of exploiting MathML formulas and subformulas
in topic modelling, for the European Digital Mathematics Library (EuDML) project.
It uses the open-source gensim and
MIaS libraries over
439,423 articles from arXiv.org.

### Dictionary

Ten most common terms (after pruning out terms that appear in >50% documents):

term | document frequency |
---|---|

mass | 216729 |

$R(I(1)O(,)I(2))$ | 216246 |

although | 216105 |

next | 216060 |

$I(r)$ | 215977 |

dimensional | 215718 |

lower | 215577 |

fixed | 214557 |

$I(f)$ | 214239 |

difference | 212747 |

Ten least common terms (after clipping to 100k most common):

term | document frequency |
---|---|

$s(I[v=N](Σ)I(S))$ | 141 |

$s(N(¶)I(r))$ | 141 |

alphabetically | 141 |

bruch | 141 |

comenius | 141 |

firsts | 141 |

kei | 141 |

probabilités | 141 |

rationnelles | 141 |

schroer | 141 |

Terms between *$...$* come from mathematical (sub)formulas, not plain text.

The final dictionary contains 100000 items.

(raw dictionary before any pruning -- 975.6MB!)