Sep

2024

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

They also revealed poor performance of neural networks on such tasks

Topic models are machine learning algorithms designed to analyse large text collections based on their topics. Scientists at HSE Campus in St Petersburg compared five topic models to determine which ones performed better. Two models, including GLDAW developed by the Laboratory for Social and Cognitive Informatics at HSE Campus in St Petersburg, made the lowest number of errors. The paper has been published in PeerJ Computer Science.

Determining the topic of a publication is usually not difficult for the human brain. For example, any editor can easily tag this article with science, artificial intelligence, and machine learning. However, the process of sorting information can be time-consuming for a person, which becomes critical when dealing with a large volume of data. A modern computer can perform this task much faster, but it requires solving a challenging problem: identifying the meaning of documents based on their content and categorising them accordingly.

This is achieved through topic modelling, a branch of machine learning that aims to categorise texts by topic. Topic modelling is used to facilitate information retrieval, analyse mass media, identify community topics in social networks, detect trends in scientific publications, and address various other tasks. For example, analysing financial news can accurately predict trading volumes on the stock exchange, which are significantly influenced by politicians' statements and economic events.

Here's how working with topic models typically unfolds: the algorithm takes a collection of text documents as input. At the output, each document is assessed for its degree of belonging to specific topics. These assessments are based on the frequency of word usage and the relationships between words and sentences. Thus, words such as ‘scientists,’ ‘laboratory,’ ‘analysis,’ ‘investigated,’ and ‘algorithms’ found in this text categorise it under the topic of ‘science.’

However, many words can appear in texts covering various topics. For example, the word ‘work’ is often used in texts about industrial production or the labour market. However, when used in the phrase ‘scientific work,’ it categorises the text as pertaining to ‘science.’ Such relationships, expressed mathematically through probability matrices, form the core of these algorithms.

Topic models can be enhanced by creating embeddings—fixed-length vectors that describe a specific entity based on various parameters. These embeddings serve as additional information acquired through training the model on millions of texts.

Any phrase or text, such as this news item, can be represented as a sequence of numbers—a vector or a vector space. In machine learning, these numerical representations are referred to as embeddings. The idea is that measuring spaces and detecting similarities becomes easier, allowing comparisons between two or more texts. If the similarities between the embeddings describing the texts are significant, then they likely belong to the same category or cluster—a specific topic.

Scientists at the HSE Laboratory for Social and Cognitive Informatics in St Petersburg examined five topic models—ETM, GLDAW, GSM, WTM-GMM and W-LDA, which are based on different mathematical principles:

ETM is a model proposed by the prominent mathematician David M. Blei, who is one of the founders of the field of topic modelling in machine learning. His model is based on latent Dirichlet allocation and employs variational inference to calculate probability distributions, combined with embeddings.
Two models—GSM and WTM-GMM—are neural topic models.
W-LDA is based on Gibbs sampling and incorporates embeddings, but also uses latent Dirichlet allocation, similar to the Blei model.
GLDAW relies on a broader collection of embeddings to determine the association of words with topics.

For any topic model to perform effectively, it is crucial to determine the optimal number of categories or clusters into which the information should be divided. This is an additional challenge when tuning algorithms.

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

Typically, a person does not know in advance how many topics are present in the information flow, so the task of determining the number of topics must be delegated to the machine. To accomplish this, we proposed measuring a certain amount of information as the inverse of chaos. If there is a lot of chaos, then there is little information, and vice versa. This allows for estimating the number of clusters, or in our case, topics associated with the dataset. We applied these principles in the GLDAW model.

The researchers investigated the models for stability (number of errors), coherence (establishing connections), and Renyi entropy (measuring the degree of chaos). The algorithms' performance was tested on three datasets: materials from a Russian-language news resource Lenta.ru and two English-language datasets - 20 Newsgroups and WoS. This choice was made because all texts in these sources were initially assigned tags, allowing for evaluation of the algorithms' performance in identifying the topics.

The experiment showed that ETM outperformed other models in terms of coherence on the Lenta.ru and 20 Newsgroups datasets, while GLDAW ranked first for the WoS dataset. Additionally, GLDAW exhibited the highest stability among the tested models, effectively determined the optimal number of topics, and performed well on shorter texts typical of social networks.

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

We improved the GLDAW algorithm by incorporating a large collection of external embeddings derived from millions of documents. This enhancement enabled more accurate determination of semantic coherence between words and, consequently, more precise grouping of texts.

GSM, WTM-GMM and W-LDA demonstrated lower performance than ETM and GLDAW across all three measures. This finding surprised the researchers, as neural network models are generally considered superior to other types of models in many aspects of machine learning. The scientists have yet to determine the reasons for their poor performance in topic modelling.

Date

19 September 2024

Topics

Research & Expertise

Keywords

publications research projects computer science frontiers of science

About

Laboratory for Social and Cognitive Informatics

About persons

Sergei Koltsov

See, Feel, and Understand: HSE Researchers to Explore Mechanisms of Movement Perception in Autism

Scientists at the HSE Cognitive Health and Intelligence Centre have won a grant from the Russian Science Foundation (RSF) to investigate the mechanisms of visual motion perception in autism. The researchers will design an experimental paradigm to explore the relationship between visual attention and motor skills in individuals with autism spectrum disorders. This will provide insight into the neurocognitive mechanisms underlying social interaction difficulties in autism and help identify strategies for compensating for them.

10 April

Apr

2025

Scholars Disprove Existence of ‘Crisis of Trust’ in Science

An international team of researchers, including specialists from HSE University, has conducted a large-scale survey in 68 countries on the subject of trust in science. In most countries, people continue to highly value the work of scientists and want to see them take a more active role in public life. The results have been published in Nature Human Behaviour.

10 April

Apr

2025

Education System Reforms Led to Better University Performance, HSE Researchers Find

A study by researchers at the HSE Faculty of Economic Sciences and the Institute of Education have found that the number of academic papers published by research universities in international journals has tripled in the past eight years. Additionally, universities have developed more distinct specialisations. Thus, sectoral universities specialising in medical, pedagogical, technical, and other fields are twice as likely to admit students to target places. The study has been published in Vocation, Technology & Education.

8 April

Apr

2025

Scientists Record GRB 221009A, the Brightest Gamma-Ray Burst in Cosmic History

A team of scientists from 17 countries, including physicists from HSE University, analysed early photometric and spectroscopic data of GRB 221009A, the brightest gamma-ray burst ever recorded. The data was obtained at the Sayan Observatory one hour and 15 minutes after the emission was registered. The researchers detected photons with an energy of 18 teraelectronvolts (TeV). Theoretically, such high-energy particles should not reach Earth, but data analysis has confirmed that they can. The results challenge the theory of gamma radiation absorption and may point to unknown physical processes. The study has been published in Astronomy & Astrophysics.

4 April

Apr

2025

Chemists Simplify Synthesis of Drugs Involving Amide Groups

Chemists from HSE University and the Nesmeyanov Institute of Organoelement Compounds of the Russian Academy of Sciences (INEOS RAS) have developed a new method for synthesising amides, essential compounds in drug production. Using a ruthenium catalyst and carbon monoxide under precisely controlled reaction conditions, they successfully obtained the target product without by-products or complex purification steps. The method has already been tested for synthesising a key component of Vorinostat, a drug used to treat T-cell lymphoma. This approach could lower the cost of the drug by orders of magnitude. The paper has been published in the Journal of Catalysis. The study was supported by the Russian Science Foundation.

3 April

Mar

2025

Scientists Examine Neurobiology of Pragmatic Reasoning

An international team including scientists from HSE University has investigated the brain's ability to comprehend hidden meanings in spoken messages. Using fMRI, the researchers found that unambiguous meanings activate brain regions involved in decision-making, whereas processing complex and ambiguous utterances engages regions responsible for analysing context and the speaker's intentions. The more complex the task, the greater the interaction between these regions, enabling the brain to decipher the meaning. The study has been published in NeuroImage.

31 March

Mar

2025

Scientists Present New Solution to Imbalanced Learning Problem

Specialists at the HSE Faculty of Computer Science and Sber AI Lab have developed a geometric oversampling technique known as Simplicial SMOTE. Tests on various datasets have shown that it significantly improves classification performance. This technique is particularly valuable in scenarios where rare cases are crucial, such as fraud detection or the diagnosis of rare diseases. The study's results are available on ArXiv.org, an open-access archive, and will be presented at the International Conference on Knowledge Discovery and Data Mining (KDD) in summer 2025 in Toronto, Canada.

27 March

Mar

2025

Hi-Tech Grief: HSE Researchers Explore the Pros and Cons of Digital Commemoration

Researchers at HSE University in Nizhny Novgorod have explored how technological advancements are transforming the ways in which people preserve the memory of the deceased and significant events. Digital technologies enable the creation of virtual memorials, the preservation of personal stories and belongings of the deceased, interaction with their digital footprint, and even the development of interactive avatars based on their online activity. However, these technologies not only evoke nostalgia and provide a sense of relief but can also heighten anxiety and fear, and delay the process of accepting loss. The study has been published in Chelovek (The Human Being).

27 March

Mar

2025

Scientists Find Out Why Aphasia Patients Lose the Ability to Talk about the Past and Future

An international team of researchers, including scientists from the HSE Centre for Language and Brain, has identified the causes of impairments in expressing grammatical tense in people with aphasia. They discovered that individuals with speech disorders struggle with both forming the concept of time and selecting the correct verb tense. However, which of these processes proves more challenging depends on the speaker's language. The findings have been published in the journal Aphasiology.

25 March

Mar

2025

Implementation of Principles of Sustainable Development Attracts More Investments

Economists from HSE and RUDN University have analysed issues related to corporate digital transformation processes. The introduction of digital solutions into corporate operations reduces the number of patents in the field of green technologies by 4% and creates additional financial difficulties. However, if a company focuses on sustainable development and increases its rating in environmental, social, and governance performance (ESG), the negative effects decrease. Moreover, when the ESG rating is high, digitalisation can even increase the number of patents by 2%. The article was published in Sustainability.

24 March