Go to:

Anomaly Detection Using Deep Sparse Autoencoders for CERN Particle Detector Data

Filip Široký · Download PDF

The certification of the CMS particle detector data, as usable for physics analysis, is a crucial task to ensure the quality of all physics results published by CERN. Currently, the certification conducted by human experts is labor intensive and can only be segmented on a long period of time basis.

This contribution focuses on the design and prototype of an automated certification system assessing data quality on a per-luminosity section (i.e. 23 seconds of data taking) basis. Anomalies caused by detector malfunctions or sub-optimal reconstruction are unpredictable and occur rarely, making it difficult to use classical supervised classification methods such as feedforward neural networks. We base our prototype on a semi-supervised model which employs deep sparse autoencoders. This approach has been qualified successfully on CMS data collected during the 2016 LHC run: we demonstrate its ability to detect anomalies with high accuracy and low false positive rate when compared against the manual certification by experts. A key advantage of this approach over other ML technologies is having great interpretability of the results, which can be further used to ascribe the origin of the problems in the data to a specific sub-detector or particle physics objects.

Go to: navigation | start of page | end of page

Semantically Coherent Vector Space Representations

Petr Sojka, Vít Novotný · Download PDF

Content is king (Gates, 1996). Decomposition of word semantics matters (Mikolov, 2013). Decomposition of a sentence, paragraph, and document semantics into semantically coherent vector space representations matters, too. Interpretability of these learned vector spaces is the holy grail of natural language processing today, as it would allow accurate representation of thoughts and plugging-in inference into the game.

We will show recent results of our attempts towards this goal by showing how decomposition of document semantics could improve the query answering, performance, and “horizontal transfer learning” based on word2bits could be achieved.

Word representation in the form of binary features allows to use word lattice representation for feature inference by the well studied formal concept analysis theory, and for precise semantic similarity metric based on discriminative features. Also, the incremental learning of word features allows to interpret and infer on them, targeting the holy grail.

Go to: navigation | start of page | end of page

Soft Cosine Measure: Capturing Term Similarity in the Bag of Words VSM

Vít Novotný · Download PDF

The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. (2014) proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati (2017) showed that the SCM using word embedding similarity is highly effective in question answering systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n⁴), where n is the size of the vocabulary.

In our work, we prove a tighter lower worst-case time complexity bound of O(n³). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems.

Go to: navigation | start of page | end of page

Video699: Interconnecting Lecture Recordings with Study Materials

Michal Štefánik, Vít Novotný · Download PDF

Recording lectures is a common practice in the academia nowadays and lays foundation to massive open online courseware. Although lecture slides are often recorded along with the lecturer, machine-readable information about the lecture slides is rarely preserved. This prevents full-text search in the recordings and makes the lectures inaccessible to the blind and partially sighted members of the audience.

In our work, we present several neural architectures that work in lockstep to segment lecture recordings and to map the individual segments to shown lecture slides. We also present a new dataset, which has been produced at the Masaryk University in Brno, Czechia, and which is used to train and evaluate our system. We evaluate the performance of the individual neural architectures.

Go to: navigation | start of page | end of page