By Fiona Nielsen, CEO at DNAdigest and Repositive and Marta Teperek, Research Data Facility Manager at the University of Cambridge
Sharing research data comes with many ethical and legal issues. Since these issues are often complex and they can rarely be solved with one size fits all solutions, they tend not to be addressed as topics of conferences and workshops. We therefore thought that gathering of data curation professionals at IDCC 16 would be an excellent opportunity to start these discussions. This blog post is our informal report from a Birds of a Feather discussion on sharing of personal/sensitive research data which took place at the International Digital Curation Conference in Amsterdam “Visible data, invisible infrastructure” on 23 February 2016.
The need for good models for sharing personal/sensitive data
Many funders and experts in data curation agree that sharing personal and sensitive data needs to be planned from the start of research project in order to be successful. Whenever it is possible to anonymise research data, this is the advised procedure to be followed before data is shared. For data which cannot be anonymised, governance procedures for data access need to be established. We were interested to find out what are the practical solutions around sharing of personal/sensitive data offered by data curators and data managers who came to the meeting. To our surprise, only two data curators admitted to provide solutions for hosting of personal/sensitive data. Among these two, one repository accepted only anonymised data. The rest was currently not making personal/sensitive data available via their repositories. Why is sharing personal/sensitive data so difficult to manage? Three main issues were discussed: anonymisation difficulty, problems with providing managed access to research data and technical issues.
There was a lot of discussion about data anonymisation. When anonymising data one has to consider both direct and indirect identifiers. One of the data curators present at the meeting explained that their repository would accept anonymised data providing that they had no direct identifiers and maximum three indirect identifiers. But sometimes even a small number of indirect identifiers can make participants identifiable, especially in combination with information available in the public domain. So perhaps instead of talking about data anonymisation one should rather focus on estimating the risk of re-identification of participants. It would be useful for the community if tools to perform risk assessment of participant re-identification in anonymised datasets were available to provide data curators with means to objectively assess and evaluate these risks.
Problems with managed access to research data
If repositories accept sensitive/personal research data they need to have robust workflows for managing access requests. The Expert Advisory Group on Data Access (EAGDA) has produced a comprehensive guidance document on governance of data access. However, there are difficulties in putting this guidance into practice. If a request for data access is received by a repository, the request will be forwarded to a person nominated by the research team to handle data requests. However, research data are usually expected to be preserved long-term (5 years plus) and such long term periods are often longer than the time researchers spend at their institutions. This creates a problem: who will be there to respond to data access requests? One of the institutions accepting sensitive/personal data has a workflow in which the initial request is forwarded to the nominated person. If the nominated person is no longer available, the request is then directed to the faculty’s head. However, this also creates problems:
- Contact details for the nominated person need to be kept up to date and researchers leaving the post might not remember to notify the repository managers.
- The faculty’s head might be too busy to respond to requests and might have insufficient knowledge about the data to be able to manage access requests effectively.
Technical issues and workflows if things go wrong
There are also technical issues associated with sharing of personal/sensitive research data. One of the institutions reported that due to a technical fault in the repository system, restricted research data was released as open access data and downloaded by several users (who did not sign the data access agreement) before the fault has been noticed. Follow up discussions led to a reflection that a repository can never be 100% sure of security of personal/sensitive data. Even assuming that technical faults will not happen, repositories can be also subject to hacking attacks. Therefore, when accepting personal/sensitive data for long term preservation, repository managers should also assess risks of data being inappropriately released and decide on a suitable risk mitigation strategy. Additionally, institutions should have workflows in place with procedures to be followed shall things go wrong and restricted data is inappropriately released.
Apart from the topics mentioned above we have also discussed other issues related to sharing personal/sensitive research data. What workflows do organisations have in place to check that data depositors have the rights to share confidential research data or data generated in collaboration with other third parties (external collaborators, external funding bodies, commercial partners)? How to properly balance the amount of checks required to validate that the data depositor has the rights to share and not to discourage data depositors from sharing their research via a repository? Or, if research data cannot be safely shared via a repository, do organisations offer the possibility of creating a metadata-only records to facilitate data discoverability? What are the implications for DOI creation?
Our discussions revealed that there are clearly more questions than answers available on how to effectively share personal/sensitive data. Therefore it is important that we, as the community of practitioners, start developing workflows and procedures to address these problems. SciDataCon 2016 (11-13 September 2016) is organising a call for session proposals (deadline: 7 March) and we would like to propose a session on sharing of personal/sensitive data. If you have any practice papers that you would like to propose for this session please fill in a google form here. Please note that the google form is to submit your proposals for the session to us (it is not an official submission form for the conference). We will use your proposed practice papers to form a session proposal for the conference.
Possible topics for practice papers for the session:
- What are the workflows for sharing commercial and sensitive data via repositories?
- How is your organisation trying to balance between protection of confidential data and encouragement for sharing?
- What safety mechanisms are there in place at your organisation to safeguard confidential data shared via your repository?
- What are the workflows and procedures in place in case confidential/restricted/embargoed data is accidentally released?
- What are adhered to ensure that data depositors have the rights to share confidential research data or data generated in collaboration with other third parties (external collaborators, external funding bodies, commercial partners)?
- How do organisations balance the amount of checks required to validate that the data depositor has the rights to share and not to discourage data depositors from sharing their research via a repository?
- Other case studies/practice papers on the subject
- Form to submit proposals for practice papers for our session on sharing of personal/sensitive data at SciDataCon 2016: http://goo.gl/forms/K4tlV0HEbT
- RDA/NISO Privacy Implications of Research Data Sets Woorking Group: https://rd-alliance.org/groups/rdaniso-privacy-implications-research-data-sets-wg.html
- EAGDA’s guidance document on data governance: http://www.wellcome.ac.uk/About-us/Policy/Spotlight-issues/Data-sharing/EAGDA/wtp059350.htm
- Photos of all the flip chart notes taken during the #IDCC16 BoF session on Feb 23, 2016: https://drive.google.com/file/d/0B9f7hIvaxS5eUWZBVTlKQ2JXSG8/view?usp=sharing
- Blog post by Marta Teperek about the Birds of a Feather discussion at IDCC 16: https://unlockingresearch.blog.lib.cam.ac.uk/?p=551