EuDML logo
Masaryk University logo
EuDML MIaS4gensim demo

This is a work-in-progress demo of exploiting MathML formulas and subformulas in topic modelling, for the European Digital Mathematics Library (EuDML) project. It uses the open-source gensim and MIaS libraries over 439,423 articles from arXiv.org.


Build stats

Category all was built from 439423 documents (full build log).

Dictionary

Ten most common terms (after pruning out terms that appear in >50% documents):

termdocument frequency
mass216729
$R(I(1)O(,)I(2))$216246
although216105
next216060
$I(r)$215977
dimensional215718
lower215577
fixed214557
$I(f)$214239
difference212747

Ten least common terms (after clipping to 100k most common):

termdocument frequency
$s(I[v=N](Σ)I(S))$141
$s(N(¶)I(r))$141
alphabetically141
bruch141
comenius141
firsts141
kei141
probabilités141
rationnelles141
schroer141
.

Terms between $...$ come from mathematical (sub)formulas, not plain text.

The final dictionary contains 100000 items.

(raw dictionary before any pruning -- 975.6MB!)

Latent Dirichlet Allocation

not available: 'LdaState' object has no attribute 'eta'

Latent Semantic Analysis

not available: 'Projection' object has no attribute 'u'