DocSim: Semantic Similarity of Text Documents based on Gensim
For EuDML, concepted and motivated for use in DML-CZ we provide Gensim as a library for computing similarities between plain text documents. It is an open-source, general-purpose software for scalable topic modelling, based on the Vector Space Model of document representation.
The award winning Gensim system was developed by Radim Řehůřek. The system is widely used and cited in digital libraries, Content Management Systems, teaching of machine learning methods etc.
Go to: navigation | start of page | end of page
Cite as
Text
ŘEHŮŘEK, Radim and Petr SOJKA. Software Framework for Topic Modelling with Large Corpora. In Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks. Valletta, Malta: University of Malta, 2010. pp. 46&endash;50. ISBN 2-9517408-6-7.
BibTeX
@inproceedings{ismu:884893, author = "Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka", title = "{Software Framework for Topic Modelling with Large Corpora}", booktitle = "{Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks}", publisher = {ELRA}, address = {Valletta, Malta}, year = 2010, month = May, isbn = "2-9517408-6-7", pages = "45--50", url = {http://is.muni.cz/publication/884893/en}, }
Go to: navigation | start of page | end of page
Selected publications
- RŮŽIČKA, Michal. Maths Information Retrieval for Digital Libraries. Intelligent Computer Mathematics CICM 2014 Doctoral Programme Presentation. 2014.
- LEE, Mark, Petr SOJKA, Radim ŘEHŮŘEK, Radim HATLAPATKA, Maroš KUCBEL, Thierry BOUCHE, Claude GOUTORBE, Romeo ANGHELACHE and Krzysztof WOJCIECHOWSKI. Toolset for Entity and Semantic Associations – Final Release: Deliverable 8.4 of project EuDML. 1.0 as of 8th February 2013. : EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2013. 13 s. Deliverable D8.4.
- LEE, Mark, Petr SOJKA, Radim ŘEHŮŘEK, Łukasz BOLIKOWSKI, Wojtek HURY and Volker SORGE. Toolset for Entity and Semantic Associations – Initial Release: Deliverable 8.2 of project EuDML. 1.0 as of 27th May 2011. : EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2011. 12 s. Deliverable D8.2.
- LEE, Mark, Petr SOJKA and Radim ŘEHŮŘEK. Toolset for Entity and Semantic Associations – Value Release: Deliverable 8.3 of project EuDML. 1.0 as of 31st May 2012. : EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, 2012. 12 s. Deliverable D8.3.
- ŘEHŮŘEK, Radim. Scalability of Semantic Analysis in Natural Language Processing. 2011.
- ŘEHŮŘEK, Radim and Petr SOJKA. Gensim -- Statistical Semantics in Python. In EuroScipy 2011, Paris. 2011.
- ŘEHŮŘEK, Radim. Subspace Tracking for Latent Semantic Analysis. In Clough, P.; Foley, C.; Gurrin, C.; Jones, G.J.F.; Kraaij, W. (Eds.). Proceedings of the 33rd European Conference on Information Retrieval (ECIR). Heidelberg: Springer, 2010. s. 289-300, 12 s. ISBN 978-3-642-20160-8.
- ŘEHŮŘEK, Radim. Speeding Up Latent Semantic Analysis: A Streamed Distributed Algorithm for SVD Updates. In Joaquim Filipe. Proceedings of the 3rd International Conference on Agents and Artificial Intelligence (ICAART). Portugal: INSTICC Press, 2010. s. 446-451, 6 s. ISBN 978-989-8425-40-9.
- ŘEHŮŘEK, Radim. Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms. In Michael Mahoney, Ameet Talwalkar, Mehryan Mohri, Arthur Gretton. NIPS 2010 workshop on Low-rank Methods for Large-scale Machine Learning. 2010. 7 s.
- ŘEHŮŘEK, Radim and Petr SOJKA. Software Framework for Topic Modelling with Large Corpora. In Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks. Valletta, Malta: University of Malta, 2010. s. 46--50, 5 s. ISBN 2-9517408-6-7.
- ŘEHŮŘEK, Radim. Constructing High Precision Synonym Sets. In After Half a Century of Slavonic Natural Language Processing. Brno, Czech Republic: Masaryk University, 2009. 5 s. Neuveden. ISBN 978-80-7399-815-8.
- SOJKA, Petr and Radim ŘEHŮŘEK. Similarity and Classification of Mathematical Papers. 2009.
- ŘEHŮŘEK, Radim and Milan KOLKUS. Language Identification on the Web: Extending the Dictionary Method. In Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. první. Mexico City, Mexico: Springer-Verlag, 2009. s. 357-368, 12 s. ISBN 978-3-642-00381-3.
- ŘEHŮŘEK, Radim. Plagiarism Detection through Vector Space Models Applied to a Digital Library. In RASLAN 2008. 1,. Brno: Masarykova Univerzita, 2008. s. 75-83, 9 s. ISBN 978-80-210-4741-9.
- ŘEHŮŘEK, Radim and Petr SOJKA. Automated Classification and Categorization of Mathematical Knowledge. In Intelligent Computer Mathematics: AISC/Calculemus/MKM LNAI 5144. první. Berlin, Heidelberg, New York: Springer-Verlag, 2008. s. 543-557, 15 s. ISBN 978-3-540--85109-7.
- SOJKA, Petr and Radim ŘEHŮŘEK. Classification of Multilingual Mathematical Papers in DML-CZ. In Proceedings of First Workshop of Recent Advances in Slavonic Natural Language Processing RASLAN 2007. první. Brno: Masarykova univerzita, 2007. pp. 89-96. ISBN 978-80-210-4471-5.
- ŘEHŮŘEK, Radim. Text Segmentation Using Context Overlap. Progress in Artificial Intelligence, Guimarães, Portugal: Springer Berlin / Heidelberg, 2007, roč. 2007, č. 4874, pp. 647-658. ISSN 0302-9743.
- ŘEHŮŘEK, Radim. On Dimensionality of Latent Semantic Indexing for Text Segmentation. Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, 2007, roč. 2007, č. 2, pp. 347&endash;356. ISSN 1896-7094.
- POMIKÁLEK, Jan and Radim ŘEHŮŘEK. The Influence of Preprocessing Parameters on Text Categorization. International Journal of Applied Science, Engineering and Technology, 2007, roč. 4/2007, č. 1, pp. 430&endash;434. ISSN 1307-4318.