• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

© iStock

They also revealed poor performance of neural networks on such tasks

Topic models are machine learning algorithms designed to analyse large text collections based on their topics. Scientists at HSE Campus in St Petersburg compared five topic models to determine which ones performed better. Two models, including GLDAW developed by the Laboratory for Social and Cognitive Informatics at HSE Campus in St Petersburg, made the lowest number of errors. The paper has been published in PeerJ Computer Science.

Determining the topic of a publication is usually not difficult for the human brain. For example, any editor can easily tag this article with science, artificial intelligence, and machine learning. However, the process of sorting information can be time-consuming for a person, which becomes critical when dealing with a large volume of data. A modern computer can perform this task much faster, but it requires solving a challenging problem: identifying the meaning of documents based on their content and categorising them accordingly.

This is achieved through topic modelling, a branch of machine learning that aims to categorise texts by topic. Topic modelling is used to facilitate information retrieval, analyse mass media, identify community topics in social networks, detect trends in scientific publications, and address various other tasks. For example, analysing financial news can accurately predict trading volumes on the stock exchange, which are significantly influenced by politicians' statements and economic events.

Here's how working with topic models typically unfolds: the algorithm takes a collection of text documents as input. At the output, each document is assessed for its degree of belonging to specific topics. These assessments are based on the frequency of word usage and the relationships between words and sentences. Thus, words such as ‘scientists,’ ‘laboratory,’ ‘analysis,’ ‘investigated,’ and ‘algorithms’ found in this text categorise it under the topic of ‘science.’

However, many words can appear in texts covering various topics. For example, the word ‘work’ is often used in texts about industrial production or the labour market. However, when used in the phrase ‘scientific work,’ it categorises the text as pertaining to ‘science.’ Such relationships, expressed mathematically through probability matrices, form the core of these algorithms.

Topic models can be enhanced by creating embeddings—fixed-length vectors that describe a specific entity based on various parameters. These embeddings serve as additional information acquired through training the model on millions of texts. 

Any phrase or text, such as this news item, can be represented as a sequence of numbers—a vector or a vector space. In machine learning, these numerical representations are referred to as embeddings. The idea is that measuring spaces and detecting similarities becomes easier, allowing comparisons between two or more texts. If the similarities between the embeddings describing the texts are significant, then they likely belong to the same category or cluster—a specific topic.

Scientists at the HSE Laboratory for Social and Cognitive Informatics in St Petersburg examined five topic models—ETM, GLDAW, GSM, WTM-GMM and W-LDA, which are based on different mathematical principles:

  • ETM is a model proposed by the prominent mathematician David M. Blei, who is one of the founders of the field of topic modelling in machine learning. His model is based on latent Dirichlet allocation and employs variational inference to calculate probability distributions, combined with embeddings.
  • Two models—GSM and WTM-GMM—are neural topic models.
  • W-LDA is based on Gibbs sampling and incorporates embeddings, but also uses latent Dirichlet allocation, similar to the Blei model.
  • GLDAW relies on a broader collection of embeddings to determine the association of words with topics.

For any topic model to perform effectively, it is crucial to determine the optimal number of categories or clusters into which the information should be divided. This is an additional challenge when tuning algorithms.

Sergei Koltsov

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

Typically, a person does not know in advance how many topics are present in the information flow, so the task of determining the number of topics must be delegated to the machine. To accomplish this, we proposed measuring a certain amount of information as the inverse of chaos. If there is a lot of chaos, then there is little information, and vice versa. This allows for estimating the number of clusters, or in our case, topics associated with the dataset. We applied these principles in the GLDAW model.

The researchers investigated the models for stability (number of errors), coherence (establishing connections), and Renyi entropy (measuring the degree of chaos). The algorithms' performance was tested on three datasets: materials from a Russian-language news resource Lenta.ru and two English-language datasets - 20 Newsgroups and WoS. This choice was made because all texts in these sources were initially assigned tags, allowing for evaluation of the algorithms' performance in identifying the topics.

The experiment showed that ETM outperformed other models in terms of coherence on the Lenta.ru and 20 Newsgroups datasets, while GLDAW ranked first for the WoS dataset. Additionally, GLDAW exhibited the highest stability among the tested models, effectively determined the optimal number of topics, and performed well on shorter texts typical of social networks.

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

We improved the GLDAW algorithm by incorporating a large collection of external embeddings derived from millions of documents. This enhancement enabled more accurate determination of semantic coherence between words and, consequently, more precise grouping of texts.

GSM, WTM-GMM and W-LDA demonstrated lower performance than ETM and GLDAW across all three measures. This finding surprised the researchers, as neural network models are generally considered superior to other types of models in many aspects of machine learning. The scientists have yet to determine the reasons for their poor performance in topic modelling.

See also:

Narcissistic and Workaholic Leaders Guide Young Firms to Success

Scientists at HSE University—St. Petersburg studied how the founder's personal characteristics impact a young firm's performance. It turns out that a narcissist and workaholic who also fosters innovation will effectively grow their company. The paper has been published in IEEE Transactions on Engineering Management.

Biologists at HSE University Warn of Potential Errors in MicroRNA Overexpression Method

Researchers at HSE University and the RAS Institute of Bioorganic Chemistry have discovered that a common method of studying genes, which relies on the overexpression of microRNAs, can produce inaccurate results. This method is widely used in the study of various pathologies, in particular cancers. Errors in experiments can lead to incorrect conclusions, affecting the diagnosis and treatment of the disease. The study findings have been published in BBA

Green Energy Patents Boost Company Profitability

An ESG strategy—Environmental, Social, and Corporate Governance—not only helps preserve the environment but can also generate tangible income. Thus, the use of renewable energy sources (RES) and green technologies in the energy sector enhances return on investment and profitability. In contrast, higher CO2 emissions result in lower financial performance. This has been demonstrated in a collaborative study by the HSE Faculty of Economic Sciences and the European University at St. Petersburg. Their findings have been published in Frontiers in Environmental Science.

HSE Scientist Optimises Solution of Hydrodynamics Problems

Roman Gaydukov, Associate Professor at the MIEM HSE School of Applied Mathematics, has modelled the fluid flow around a rotating disk with small surface irregularities. His solution allows for predicting fluid flow behaviour without the need for powerful supercomputers. The results have been published in Russian Journal of Mathematical Physics.

Neuroscientists from HSE University Learn to Predict Human Behaviour by Their Facial Expressions

Researchers at the Institute for Cognitive Neuroscience at HSE University are using automatic emotion recognition technologies to study charitable behaviour. In an experiment, scientists presented 45 participants with photographs of dogs in need and invited them to make donations to support these animals. Emotional reactions to the images were determined through facial activity using the FaceReader program. It turned out that the stronger the participants felt sadness and anger, the more money they were willing to donate to charity funds, regardless of their personal financial well-being. The study was published in the journal Heliyon.

Physicists from Russia and Brazil Unveil Mystery behind Complex Superconductor Patterns

Scientists at HSE MIEM and MIPT have demonstrated that highly complex spatial structures, similar to the intricate patterns found in nature, can emerge in superconductors. Mathematically, these patterns are described using the Ginzburg–Landau equation at a specific combination of parameters known as the Bogomolny point. The paper has been published in the Journal of Physics: Condensed Matter.

Adhesive Tape Helps Create Innovative THz Photodetector

An international team of researchers, including scientists at HSE University and Moscow Pedagogical State University (MPGU), has developed a novel photodetector composed of a thin superconducting film, capable of detecting weak terahertz (THz) radiation. This discovery holds promise for studying objects in space, developing wireless broadband communication systems, and making advancements in spectroscopy. The study has been published in Nano Letters.

Operation of Cellular Networks Found Similar to Bacteria Growth in Petri Dish

Scientists at the HSE Laboratory for Computational Physics have developed a new model for analysing communication networks that can significantly enhance the speed of mobile communications. To achieve this, the researchers used computational physics methods and phase transition models. It turns out that the functioning of cellular networks is in many ways similar to the growth of surfaces in physics. The study was performed using the HPC cHARISMa cluster at HSE University. The study findings have been published in Frontiers in Physics.

Spelling Sensitivity in Russian Speakers Develops by Early Adolescence

Scientists at the RAS Institute of Higher Nervous Activity and Neurophysiology and HSE University have uncovered how the foundations of literacy develop in the brain. To achieve this, they compared error recognition processes across three age groups: children aged 8 to 10, early adolescents aged 11 to 14, and adults. The experiment revealed that a child's sensitivity to spelling errors first emerges in primary school and continues to develop well into the teenage years, at least until age 14. Before that age, children are less adept at recognising misspelled words compared to older teenagers and adults. The study findings have beenpublished in Scientific Reports .

HSE Researchers Demonstrate Effectiveness of Machine Learning in Forecasting Inflation

Inflation is a key indicator of economic stability, and being able to accurately forecast its levels across regions is crucial for governments, businesses, and households. Tatiana Bukina and Dmitry Kashin at HSE Campus in Perm have found that machine learning techniques outperform traditional econometric models in long-term inflation forecasting. The results of the study focused on several regions in the Privolzhskiy Federal District have been published in HSE Economic Journal.