The renown Valencian translation agency Pangeanic has recently won a contract worth in a million euros for a project which consists of developing a multilingual toolkit for data anonymization for the European Union. In this project, the Spanish company works within a consortium alongside other important institutions from other European countries.
This anonymization toolkit will use AI (artificial intelligence) of Named Entity Recognition processing. Once the project is completed, the deliverable toolkit will be available for anyone to be downloaded as fully deployable docker, with an open-source license.
As said, Pangeanic is not working alone in this project. Several EU Public Administrations are involved in the consortium. Also, there are many other Member Stated which will probably need the toolkit in the short or middle term. This is because the anonymization necessity is growing in the public sector all over the different European countries. Additionally, Embassies, Commercial Chambers and other institutions from the European Public Administration sphere have a mandate of data transparency and open data. At the same time, these institutions must preserve personal data protection and they don´t have to share personal information with third parties.
The name of the project is MAPA (view here), which means Multilingual Anonymization toolkit for Public Administrations. It will use the most modern Natural Language Processing tools in order to be able to develop such an open source toolkit. The focus of the project will be in two specific domains: legal and medical, and the deliverable will be used by various Public Administrations from EU countries.
The toolkit is something that will help those institudions which need to share or release some particular information but they have to preserve personal data while doing so. The key to achieve this is de-identification, alongside with obfuscation and pseudo-anonymization of the personally identifiable data. For this, highly sophisticated technology will be needed.
Another important aspect of the project is that it will work independently of languages, which will allow institutions to be able to erase any data regardless which name or language is involved. The interest in data protection has grown a lot in the last few years because of GDPR (General Data Protection Regulation) politics. Because of that, regulations of data transmission have suffered a lot of changes. The project will address every official European language.
The partners which participate in the MAPA project are the following: the University of Malta, the R&D Center Vicomtech, ELRA language resource center, the National French Center for Scientific Research (LIMSI at CNRS), Tilde, the Barcelona Supercomputing Center (representing SEDIA - the Spanish Language Plan Government Office) and the Valencian Translation agency Pangeanic.
Why Anonymize Data?
Personal data must be protected. It is an obligation for organizations. Because of that, people´s personal information must not be shared to third parties. The MAPA toolkit will enable language data to be shared while protecting sensitive or personal information at the same time.
Communities will benefit a lot with this toolkit. The reason is because a lot of data will be released. This data will be able to be used as training data for machine learning, among other educational uses, for instance. Also, corporations with offices all over the world will be able to transfer information across different jurisdictions fast and safely.
Also, the information will be accessible for healthcare companies, justice departments and health authorities. Additionally, this type of organizations will be capable of managing a de-identification strategy. Tailored made cases will be a proof of the flexibility and customization capabilities of the toolkit. Most importantly, the whole GDPR requirements will be taken into account. Although nobody has developed an error-free machine translation so far, and because of that there is no software that can be 100% accurate in anonymization, this toolkit will make document sharing much more safe and easy compared to today.
Technical Approach to Anonymization
NERC (Named-Entity Recognition and Classification) techniques working by Deep Learning techniques and neural networks means will be used in the toolkit. The project has some challenges. The most prominent is the languages involved. Some languages are considered to be under-resourced. Among these we can find Estonian, Latvian, Slovenian, Lithuanian and Croatian. Also, languages such as Irish and Maltese are considered to be ultra-under-resourced; therefore, the challenge will be even bigger. All these languages will be address using a polyglot NERC method.
Additionally, these novel systems will be able to be trained to need small quantities a manually labelled information and datasets. This can be achieved because these modern Deep-Learning models have novel transfer learning features.
Last but not least, MAPA will be feature-rich and the NERC approach will be complemented with other configurable mechanisms such as pattern detection based on regular expressions (passport or ID numbers, telephone numbers, street addresses, blood groups, age, sex, marital status, email addresses, bank accounts, etc.).