Skip to Main Content (Press Enter)

Logo UNIPD
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Competenze

UNI-FIND
Logo UNIPD

|

UNI-FIND

unipd.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Competenze
  1. Pubblicazioni

Exploiting Large Language Models to Train Automatic Detectors of Sensitive Data

Contributo in Atti di convegno
Data di Pubblicazione:
2024
Abstract:
This paper describes a machine learning system designed to identify sensitive data within Italian text documents, aligning with the definitions and regulations outlined in the General Data Protection Regulation (GDPR). To overcome the lack of suitable training datasets, which would require the disclosure of sensitive data from real users, the proposed system exploits a Large Language Model (LLM) to generate synthetic documents that can be used to train supervised classifiers to detect the target sensitive data. We show that “artificial” sensitive data can be generated using both proprietary or open source LLMs, demonstrating that the proposed approach can be implemented either using external services or by relying on locally runnable models. We focus on the detection of six key domains of sensitive data, by training supervised classifiers based on the BERT Transformer architecture adapted to carry out text classification and Named-Entity Recognition (NER) tasks. We evaluate the performance of the system using fine-grained metrics, and show that the NER model can achieve a remarkable detection performance (over 90% F1 score), thus confirming the quality of the synthetic datasets generated with both proprietary and open source LLMs. The dataset we generated using the open source model is made publicly available for download.
Tipologia CRIS:
04.01 - Contributo in atti di convegno
Keywords:
BERT; Generative Artificial Intelligence; LLM; NER; Sensitive data detection
Elenco autori:
De Renzis, S.; Dosso, D.; Testolin, A.
Autori di Ateneo:
TESTOLIN ALBERTO
Link alla scheda completa:
https://www.research.unipd.it/handle/11577/3524608
Link al Full Text:
https://www.research.unipd.it//retrieve/handle/11577/3524608/856330/De%20Renzis,%20Dosso,%20Testolin%20-%20IRCDL%20-%202024.pdf
Titolo del libro:
CEUR Workshop Proceedings
Pubblicato in:
CEUR WORKSHOP PROCEEDINGS
Journal
CEUR WORKSHOP PROCEEDINGS
Series
  • Dati Generali

Dati Generali

URL

https://ceur-ws.org/Vol-3643/paper3.pdf
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.1.0