5.0 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Open scientific data | 9/11 | https://en.wikipedia.org/wiki/Open_scientific_data | reference | science, encyclopedia | 2026-05-05T03:49:42.862927+00:00 | kb-cron |
=== Integration to the research workflow === In contrast with the broad incitations for data sharing included in the early policies in favor of open scientific data, the complexity and the underlying costs and requirements of scientific data management are increasingly acknowledged: "Data sharing is difficult to do and to justify by the return on investment." Open data is not simply a supplementary task but has to be envisioned throughout the entire research process as it "requires changes in methods and practices of research." The opening of research data creates a new settlement of costs and benefits. Public data sharing introduces a new communication setting that primarily contrasts with private data exchange with research collaborators or partners. The collection, the purpose, and the limitation of data have to be explicit as it is impossible to rely on pre-existing informal knowledge: "the documentation and representations are the only means of communicating between data creator and user." Lack of proper documentation means that the burden of recontextualization falls on the potential users and may render the dataset useless. Publication requires further verification regarding the ownership of the data and the potential legal liability if the data is misused. This clarification phase becomes even more complex in international research projects that may overlap several jurisdictions. Data sharing and applying open science principles also bring significant long-term advantages that may not be immediately visible. Documentation of the dataset helps clarify the chain of provenance and ensure that the original data has not been significantly altered or that all the further treatments are fully documented if this is the case. Publication under a free license also allows delegating tasks such as long-term preservation to external actors. By the end of the 2010s, a new specialized literature on data management for research had emerged to codify existing practices and regulatory principles.
=== Storage and preservation === The availability of non-open scientific data decays rapidly: in 2014 a retrospective study of biological datasets showed that "the odds of a data set being reported as extant fell by 17% per year" Consequently, the "proportion of data sets that still existed dropped from 100% in 2011 to 33% in 1991". Data loss has also been singled out as a significant issue in major journals like Nature or Science Surveys of research practices have consistently shown that storage norms, infrastructures, and workflow remain unsatisfying in most disciplines. The storage and preservation of scientific data have been identified early on as critical issues, especially concerning observational data, which are considered essential to preserve because they are the most difficult to replicate. A 2017-2018 survey of 1372 researchers contacted through the American Geophysical Union shows that only "a quarter and a fifth of the respondents" report good data storage practices. Short-term and unsustainable storage remains widespread, with 61% of the respondents storing most or all of their data on personal computers. Due to their ease of use at an individual scale, unsustainable storage solutions are viewed favorably in most disciplines: "This mismatch between good practices and satisfaction may show that data storage is less important to them than data collection and analysis". First published in 2012, the reference model of Open Archival Information System states that scientific infrastructure should seek long-term preservation, that is, "long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community". Consequently, good practices of data management imply both on storage (to materially preserve the data) and, even more crucially on curation, "to preserve knowledge about the data to facilitate reuse". Data sharing on public repositories has contributed to mitigating preservation risks due to the long-term commitment of data infrastructures and the potential redundancy of open data. A 2021 study of 50,000 data availability statements published in PLOS One showed that 80% of the dataset could be retrieved automatically, and 98% of the dataset with a data DOI could be retrieved automatically or manually. Moreover, accessibility did not decay significantly for older publications: "URLs and DOIs make the data and code associated with papers more likely to be available over time". Significant benefits have not been found when the open data was not correctly linked or documented: "Simply requiring that data be shared in some form may not have the desired impact of making scientific data FAIR, as studies have repeatedly demonstrated that many datasets that are ostensibly shared may not actually be accessible."
=== Plan and governance ===