Natural Language Processing project: A thematic exploration of textual research resources in CSS data repositories

Our pilot project aimed to improve archive searchability by testing various machine text analysis techniques on a sample of interview collections held in the digital social science archives of the Voices of the 20th Century and the Research Documentation Centre (RDC) of CSS. The project was carried out in cooperation between the RDC and SZTAKI’s Department of Distributed Systems. After selecting, applying and validating the most appropriate technique, results were integrated into the beta version of a repository search engine. In effect, metadata were automatically assigned to each interview, which provide the researchers with information about the content of the texts (interviews or extracts of interviews) and the location of the texts and text excerpts that are related to each other (even across several collections) and are relevant to the research issues.

Subject headings and subject indexes were generated for the interview texts, manually first and then aided by machines. Their adequacy was verified by our researchers. Even in those cases where documents were subjected to machine analysis only, without subsequent manual verification, the validation of results has led to efficient subject heading and subject index generation. Rather than being simply keywords or synonyms of keywords appearing in the texts, the subject headings or labels associated with the interviews are elements of a conceptual network created using the ELSST, an international social science thesaurus, which helps to reveal the sociological phenomena inherent in the texts. Steps have also been taken in the direction of NER (Named Entity Recognition). We identified and then wikified name elements and time tags appearing in the texts, linking them to the Wikidata knowledge graph, Geonames, VIAF, PIM and other name spaces.

To improve searchability, abstract subject headings and name elements obtained through machine processing are associated with the documents in several new metadata fields. Existing documents are thus opened up for new research. The subject headings are translated into English, making our archives searchable for researchers abroad. As a result, domestic resources that have previously been inaccessible due to the language barrier are now becoming visible and accessible to the international research community. The results of machine processing are visualised by highlighting the name elements and linking them to the dictionary entries as well as by exploring and displaying the frequency of certain topics and subject headings and their relationship.

On this project, we worked in cooperation with CESSDA (Consortium of European Social Science Data Archives). As part of that cooperation, a Hungarian translation of the ELSST English-language social science thesaurus, containing more than 3,300 terms, was completed in collaboration with the Research Centre of Linguistics (NYTK). It has been available online since September 2022. The project also involved a cooperation with the Budapest University of Technology and Economics (BME) for improving the efficiency of BEAST, a Hungarian database speech transcriber. BEAST is an open-source, research-ready system based on the SpeechBrain code, developed by the NYTK and BME, with financing from the Hungarian Scientific Research Fund (OTKA) and MILAB, which uses state-of-the-art transformer neural structures.

Researchers interested in the sociological sources will be able to access the results of our exploration of interview documents on a common online search platform created for the repositories of the CSS Research Documentation Centre. The pilot project was supported by the Hungarian Artificial Intelligence National Laboratory (MILAB).

Project participants:

CSS RDC: Gárdos Judit, Egyed-Gergely Júlia, Havadi Gergő, Horváth Anna, Jakab Miklós, Meiszterics Enikő, Vajda Róza

SZTAKI: Micsik András, Kovács László, Martin Dániel, Marx Attila, Pataki Balázs, Siket Melinda

Further researchers: Annus Szabolcs, Antal Emese, Filep Georgina, Lipp Veronika, Matyasovszky-Németh Márton, Neményi Mária, P. Tóth Tamás, Sass Bálint, Szöllősi Melinda

Publications:

Egyed-Gergely Júlia, Vajda Róza, Gárdos Judit, Horváth Anna, Meiszterics Enikő, Micsik András, Martin Dániel, Marx Attila, Pataki Balázs, Siket Melinda. Szociológia, kutatási adatok, mesterséges intelligencia: lehetőségek és tapasztalatok (Sociology, research data and artificial intelligence: opportunities and experiences). In: Tick József, Kokas Károly, Holl András (szerk.) Valós térben - az online térért: Networkshop 31: national conference. 20–22 April 2022. University of Debrecen. Budapest, Hungary, HUNGARNET Society, MTA Library and Information Centre, 364 p. pp. 161–169., 2022

Gárdos Judit, Egyed-Gergely Júlia, Horváth Anna, Pataki Balázs, Vajda Róza, Micsik András: Identification of social scientifically relevant topics in an interview repository. A natural language processing experiment. Under review.

Conferences:

FAIRsFAIR (EOSC alprojekt) Final Event 2022. Gárdos Judit: The National Perspective, online roundtable talk. 26 January 2022

Networkshop 31: national conference 2022. Egyed-Gergely Júlia, Micsik András, Vajda Róza: Szociológia, kutatási adatok, mesterséges intelligencia: lehetőségek és tapasztalatok. University of Debrecen, 20-22 April 2022

STS-hub.de 2023. Circulations Conference. Gárdos Judit, Horváth Anna, Vajda Róza: ‘NLP and metadata for large sociological interview collections. Aachen, 16 March 2023