Evaluation of Extended Word Embeddings
Word embeddings of shallow neural networks provide strong baselines for many extrinsic tasks such as semantic text similarity, text classification, and information retrieval that correspond to real-word end tasks. Unlike state-of-the-art language models, which are accurate but also slow, black-boxed, and monolithic, word embeddings lend themselves to solutions that are fast, interpretable, and modular. Improving the accuracy of word embeddings provides an important counterbalance to the ever-increasing computational and architectural complexity of state-of-the-art language models.
Word embeddings of shallow neural networks have a number of extensions that give strong results on intrinsic tasks (word analogy), but weren't extensively evaluated on multilingual extrinsic tasks. The goal of this project is to prepare a set of tasks for the evaluation of word embeddings on multilingual extrinsic tasks.
Selected Publications
- NOVOTNÝ, Vít, Michal ŠTEFÁNIK, Eniafe Festus AYETIRAN, Petr SOJKA and Radim ŘEHŮŘEK. When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting. Journal of Universal Computer Science. New York, USA: J.UCS Consortium, 2022, vol. 28, No 2, p. 181-201. ISSN 0948-695X. doi:10.3897/jucs.69619.
- ŠTEFÁNIK, Michal, Vít NOVOTNÝ and Petr SOJKA. Regressive Ensemble for Machine Translation Quality Evaluation. In Markus Freitag. Proceedings of EMNLP 2021 Sixth Conference on Machine Translation (WMT 21). ACL, 2021. 8 pp.
- NOVOTNÝ, Vít, Eniafe Festus AYETIRAN, Dalibor BAČOVSKÝ, Dávid LUPTÁK, Michal ŠTEFÁNIK a Petr SOJKA. One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages. In Mitkov, Ruslan and Angelova, Galia. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). Varna, Bulgaria: INCOMA Ltd., 2021. 7 s. ISSN 1313-8502.
- AYETIRAN, Eniafe Festus, Petr SOJKA and Vít NOVOTNÝ. EDS-MEMBED: Multi-sense embeddings based on enhanced distributional semantic structures via a graph walk. Knowledge-Based Systems. Elsevier, 2021, vol. 2021, No 219, p. 106902-106918. ISSN 0950-7051. doi:10.1016/j.knosys.2021.106902.
- NOVOTNÝ, Vít, Eniafe Festus AYETIRAN, Dávid LUPTÁK, Michal ŠTEFÁNIK a Petr SOJKA. One Size Does Not Fit All: Finding the Optimal N-gram Sizes for FastText Models across Languages. New York, USA: Cornell University, 2021.
- NOVOTNÝ, Vít. The Art of Reproducible Machine Learning: A Survey of Methodology in Word Vector Experiments. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020. Brno: Tribun EU, 2020. s. 55-64, 10 s. ISBN 978-80-263-1517-9.
- NOVOTNÝ, Vít, Michal ŠTEFÁNIK, Dávid LUPTÁK a Petr SOJKA. Towards Useful Word Embeddings: Evaluation on Information Retrieval, Text Classification, and Language Modeling. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020. Brno: Tribun EU, 2020. s. 37-46, 10 s. ISBN 978-80-263-1517-9.
- NOVOTNÝ, Vít, Petr SOJKA, Michal ŠTEFÁNIK and Dávid LUPTÁK. Three is Better than One: Ensembling Math Information Retrieval Systems. CEUR Workshop Proceedings, Thessaloniki, Greece: M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen., 2020, vol. 2020, No 2696, p. 1-30. ISSN 1613-0073.
- NOVOTNÝ, Vít, Eniafe Festus AYETIRAN, Michal ŠTEFÁNIK and Petr SOJKA. Text classification with word embedding regularization and soft similarity measure. New York, USA: Cornell University, 2020.
- SOJKA, Petr, Vít NOVOTNÝ, Eniafe Festus AYETIRAN, Dávid LUPTÁK and Michal ŠTEFÁNIK. Quo Vadis, Math Information Retrieval. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019. Brno: Tribun EU, 2019. p. 117-128, 12 pp. ISBN 978-80-263-1517-9.
- NOVOTNÝ, Vít. Implementation Notes for the Soft Cosine Measure. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM '18). Torino, Italy: Association for Computing Machinery, 2018. p. 1639-1642, 4 pp. ISBN 978-1-4503-6014-2. doi:10.1145/3269206.3269317.