Sharing powder diffraction raw data: challenges and beneﬁts

Scientiﬁc data are as important as scientiﬁc publications. If this statement holds true, why are we not routinely sharing scientiﬁc data? The tools are now out there, for instance Zenodo and related repositories. It could be a lack of motivation of researchers derived from an apparent lack of short-term reward. Here the author will try to show the importance of sharing ready-to-analyse raw powder diffraction data with immediate beneﬁts for authors and for the wider community. Moreover, it is speculated that sharing curated scientiﬁc data may have more important medium-term beneﬁts, including credibility and not least reproducibility. Raw data sharing is coming.


Introduction
It is nine years since Nature dedicated an editorial and a section to preserving research data and making it accessible (Nature Editorial, 2009).Initially, sharing raw data was conceived mainly for helping experiment replication and data analysis improvement.However, the arrival of artificial intelligence and machine-learning tools makes sharing scientific data even more important as new (unexpected by the original research teams) correlations could emerge when interrogating many related and shared data sets (Warren, 2018).Warren (2018) in his article based on the Fred Kavli Distinguished Lectureship in Materials Science stated ' . . .The existing publication paradigm is an accident of history.If you were designing the scientific publication system today instead of letting it evolve over more than 500 years, it might be a little bit different, optimized toward better scientific outcomes.'I cannot agree more, and to me this is the main explanation of the fact that, as of today, most researchers are not releasing the raw data associated with their publications.It may also justify why well reputed journals are not requesting the sharing of raw data associated with a publication either.
Funding agencies are starting to request research data sharing, although this seems to be still in its infancy.As an example, the European Commission has very recently launched Recommendation C(2018)2375, adopted on 25 April 2018, on access to and preservation of scientific information (https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX: 32018H0790).Open data is just a subset of a much larger framework of 'open science', which has several main pillars:  From now on I will focus just on scientific data sharing.The International Union of Crystallography (IUCr) has been leading efforts for many years through its journals such as Acta Crystallographica Section C in the publishing and sharing of data (both reduced and derived) linked to the scientific publication narrative in words.In the crystal structure determination field, the raw data are the diffraction images/ patterns, the reduced data are the structure factors, and the derived data are the atomic coordinates, atomic displacement parameters and atom occupancies (Kroon-Batenburg et al., 2017).Building on all this, with the advent of new, huge, digital storage opportunities, there have been several recent reports underlining the importance of sharing raw diffraction data [see the article by Helliwell et al. (2017), and references therein, which provides a wide range of case studies in crystallography and related fields].These case studies span many of the IUCr Commissions and fall into three general categories: those where data sharing is beneficial, those where data preservation is important in allowing further progress and those where the absence of data is a significant problem.Furthermore, the necessity of having accurate metadata associated with the raw data for their correct processing has been highlighted (Kroon-Batenburg et al., 2017).
Concerning powder diffraction data, the International Centre for Diffraction Data (ICDD) has been collecting these types of data since 1941 in evolving formats (Bruno et al., 2017).The data are stored in the Powder Diffraction File database, housing more than 890 000 (reduced) diffraction patterns and 330 000 crystal structures in the 2016 release.A timely development for quite some years has been the archiving of more than 11 000 (unreduced) raw powder diffraction patterns.These data can be used for different purposes.For instance, the crystal structure of trandolapril was solved from its archived raw powder diffraction data (Reid et al., 2016), highlighting the utility of raw data deposition in the Powder Diffraction File.
In this context, the IUCr has over many years developed tools to facilitate the sharing of diffraction data in general and powder diffraction data in particular.For instance, the original development of the CIF format (Hall et al., 1991;Bernstein et al., 2016), already adopted in 1990 for storing and distributing crystallographic derived data, has evolved to be able to archive and share diffraction data (https://www.iucr.org/resources/cif).This mechanism is being extended and updated and now also includes a dictionary specifically for powder diffraction, pdCIF (see https://www.iucr.org/resources/cif/dictionaries/cif_pd).
Data sharing goes beyond replicating and improving analysis results.It can yield new science as machine-learning tools will obtain new outcomes when multiple data sets are investigated, our science's version of big data.However, there are intermediate stages as new databases will be created just by sharing curated raw data in appropriate repositories.For instance, the proposal for constructing an international X-ray absorption fine structure (XAFS) database (Asakura et al., 2018) is very interesting.This database could be developed by sharing curated raw XAFS data recorded at synchrotrons, ideally in cross-validated beamlines.In my opinion, to have such a common international database should be a priority.

Some definitions and scope
Raw data is a very difficult term to define, as even the first data set store out of a detector (normally termed 'primary data') can be already processed by the firmware of the detectors.These 'primary data' can be corrected (if needed, for instance dark and flat fields), and then pre-processed and postprocessed to give the final 'ready-to-analyse scientific data'.In some disciplines these are termed reduced data but this is not the case in other fields.In single-crystal work, for example, there are a variety of benefits of not just taking the predicted diffraction spot positions and their intensities; the whole diffraction image instead is preserved.Thus far it is deemed reasonable to assume that the detector firmware corrections are well established and acceptable and what we might call the very primary data before those corrections are applied need not be preserved.
Let us now consider other fields.A full pipeline for data processing (and analysis) for synchrotron X-ray full-field tomographic microscopy has been recently reported (Marone et al., 2017).For tomographic work, including X-ray diffraction computed tomography, key steps include the sinogram generator in the pre-processing step and the reconstruction module in the processing step.Note that tomographic beamlines can generate more than ten terabits of 'raw and processed data' in a single day.Then, data analysis takes over, with visualization, segmentation, understanding etc.
In the powder diffraction field, the terms raw data, reduced data and derived data are still under debate, and here I give my view.On the one hand, raw data could be considered any patterns (processed at different levels) which still keep their (intensity of scattered photons versus diffraction angle) datapoint character (see Fig. 1).On the other hand, there are many types of derived data depending upon the application: (i) atomic parameters in structure determination; (ii) phase contents in quantitative phase analysis; (iii) average coherent diffraction domain size and microstrains in microstructural analysis; and so on.Reduced data could be exemplified by a list of diffraction peak positions and intensities.A similar approach could be taken for the X-ray absorption field, where raw data could be considered any pattern conserving its transmitted/emitted photons versus energy data-point character.Therefore, incident intensity, I 0 , correction and energy calibration are processing steps yielding still (processed) raw data.
If raw data are in fact several (related) data sets (for instance, primary, pre-processed, post-processed), a key question arises: which 'raw data' should be shared by a meticulous researcher?There is no community-agreed answer but I offer my opinion.Firstly I state a caveat that concerns me as current Scientific Director of a large facility (the ALBA synchrotron).Data sharing is not free and someone (probably the funding agencies rather than the individual researcher) will pay the costs.The more raw/scientific data we store, the more funding will be needed.So, I think that we have to pay special attention, at least at large facilities, with regard to which raw data to store and how we store them.Furthermore, most science budgets are capped, and this is an additional reason to be efficient.Every large facility is developing a data policy plan and the ALBA data policy was approved in July 2017; it can be found at https://www.cells.es/en/users/callinformation.At this point it should be mentioned that many data sets may never be used in publications.Archived raw data (and metadata) will be made openly available after a number of years of embargo (three for most synchrotron facilities in Europe).This embargo time could be extended if a justified request by the researchers is issued.Some data sets labelled as tests, alignment etc. will not be archived and therefore these data will not be made openly available.
To make things a little easier for facilities but perhaps not straightforward for researchers, the HDF5 file format is becoming a standard, where data sets with different levels of processing are stored in the same file with a hierarchical structure (Ko ¨nnecke et al., 2015).The outcome of one experiment can be several data sets (for instance in HDF5 file format) with subsets.It could be possible to make openly available some data sets, associated with a given publication, while keeping other data sets under embargo.However, it is not foreseen that a subset of data within a data set would be published.Therefore, researchers must be aware of, and aim at having, the proper granularity when acquiring raw data.So, coming back to our question, an obvious answer could be 'primary raw data', in other words the images/data directly out of the detector.These data should be archived with all associated metadata for data processing and sample characterization.However, are these data ready to be analysed by peer scientists and to be easily interrogated by artificial intelligence and machine-learning tools?My answer, today, is still no.Therefore, I advocate compulsory sharing of processed powder diffraction data associated with publications, knowing that some flaw(s) in the processing steps could exist.There are reasons for this opinion, the most important being to facilitate use of the data.Primary raw data, and processed data at different levels, could be stored by the facilities (at their data centres or in the cloud) and retrieved on demand when needed (for free?).Additional views in this complex issue are given elsewhere [for instance Kroon-Batenburg et al. (2017)].
When the primary raw data have been archived by a facility, the link between the experimental data (primary raw data and metadata) and the publication can be made by providing the doi of the experimental data in the publication.The final goal of the facilities is to provide the doi to the user as soon as the data are produced, but this is still not implemented.For example, ESRF is currently archiving experimental data for macromolecular crystallography beamlines with their associated doi (https://doi.esrf.fr/).See, for instance, https:// doi.org/10.15151/ESRF-ES-86533633for the output of one experiment with the data set(s) archived in 2018 and so under embargo until 2021.However if only primary data are shared (which will require processing steps that do not need to be necessarily user friendly for every synchrotron user), the risk of a very limited outcome from this (big) effort should not be underestimated.This could lead to a poor perception of scientific research by the layperson and concerns from funding agencies.At this stage, however, it is also fair to say that the new digital storage centres have only recently been made available, and researchers have not taken advantage of them much as yet.Thus the coming decade, say, is in effect a pilot period of gaining much more experience of the opportunities available.
Therefore, the scope of this communication is now restricted to sharing of ready-to-analyse, fully processed when needed, powder diffraction data.A ready-to-analyse data set is defined as data that do not need further treatment/processing in order to be analysed by the software of the powder diffraction scientific community (Rietveld, pair distribution function, auto-indexing etc.).

Sharing raw powder diffraction data: what raw data?
We have to lead by example.Since mid-2017 our research group in eco-cements has been openly sharing all scientific diffraction data at Zenodo prior to the submission of all of our manuscripts.Therefore, normally in the supplementary information, we describe every file deposited and give the doi link to the Zenodo-deposited diffraction data sets, including sample description and experimental conditions.For instance, we studied gels in cements by the synchrotron radiation pair distribution function, PDF (Cuesta et al., 2017), where 12 processed raw diffraction data sets were deposited.These data sets included ten patterns for cement pastes, plus the data for the empty capillary for data pre-analysis, and the pattern for the nickel sample employed as standard for data analysis assessment, which was recorded under the same experimental conditions (https://doi.org/10.5281/zenodo.890584).These diffraction data were collected at the MYTHEN-II detector system of the MSPD beamline at the ALBA synchrotron.Our detector contains six modules with 1280 channels each (7680 data points).The standard protocol at ALBA for PDF data acquisition (a macro) is to collect 72 patterns at different starting angular positions, with more acquisitions at higher angles, taking 30 s per position.Considering the recording time and motor movement, a data set takes 37 min.'Primary data' are stored on the hard disk from the six modules, merged by the firmware, for every starting position.Then the 72 patterns are merged together, with local software, to give a total diffraction raw powder pattern.To improve the statistics, and to mitigate problems due to software failures and beam dumps, five patterns are usually collected and merged (185 min).The final fully processed 'ready-to-analyse' powder diffraction pattern is a single ascii file after all this data processing.I advocate, at present, that this is the scientific data to be deposited and shared and not the 360 (72 Â 5) 'primary' raw powder diffraction data sets or the five processed powder diffraction data sets.

electronic reprint
A summary of the information-processing scheme for synchrotron powder diffraction, from primary raw data to derived data, is given in Fig. 1.The primary data strongly depend on the detector used and, for just one point detector, raw data processing is probably not needed.For several point detectors, just merging of the primary data is likely to be required.On the other hand, for 2D detectors, several processing steps are needed for attaining validated raw powder diffraction patterns (see Fig. 1).There is an extensive tradition in the crystallography field of sharing derived data through a number of databases.The description of these databases is out of the scope of this manuscript and the reader is referred to specific publications (Graz ˇulis et al., 2009;Hellenbrandt, 2014;Glasser, 2016).However, the situation is evolving and the Cambridge Structural Database (Groom et al., 2016), initially focused on sharing derived data (crystal structures with carbon-carbon bonds), is moving forward by also archiving diffraction data in CIF format.Therefore, it could be a natural evolution that the Crystallography Open Database (Graz ˇulis et al., 2009), also initially centred on sharing derived data, could evolve by sharing raw diffraction data.
A second example is a study comparing the accuracy in Rietveld quantitative phase analysis, RQPA, by using strictly monochromatic Mo and Cu radiation, with synchrotron patterns as benchmarks (Leo ´n-Reina et al., 2016).This work was the basis for a chapter in International Tables for Crystallography, Vol. H (Leo ´n-Reina et al., 2019), in relation to which we have deposited 81 processed raw powder diffraction data sets from three different diffractometers: laboratory Cu K 1 , laboratory Mo K 1 and synchrotron radiation (https://doi.org/10.5281/zenodo.1291899).In this case, and for the synchrotron diffraction patterns, a MYTHEN-II detector system was also used with an RQPA data acquisition protocol: data were collected from three angular positions with a total acquisition time of 20 min.Three patterns, taken at different positions along the capillaries, were collected for each sample to ensure the homogeneity of the filling of the capillary.For the synchrotron data, every deposited diffraction pattern came from nine 'primary' data sets.A third powder diffraction example is even more tricky (Cuesta et al., 2018).In this case, PDF data for cement pastes were recorded at the ID15A beamline (ESRF synchrotron), which is equipped with a Pilatus3 X CdTe 2M hybrid photoncounting two-dimensional detector.In this case eight processed raw diffraction data sets were deposited: six patterns for cement pastes, one pattern for the empty capillary and one pattern for the nickel standard (https://doi.org/10.5281/zenodo.1255629).For the PDF study, the following data acquisition protocol was followed: Eight images were collected for each paste, with an acquisition time of 8 s per image, requiring approximately 1 min per pattern with very good statistics.The images were added together with outlier elimination, to remove artefacts from, for example, cosmic rays and decays of thorium contained in the granite upon which the diffractometer is mounted.This detector has 24 modules and each module is composed of six wire-bonded submodules, which makes the pixel size in the bonding region of the submodules three times larger.The detector has an energy cut-off which is typically set at half the incident energy to avoid over-or undercounting photons of wafer borders.The consequence of this (firmware-based) cut-off is to eliminate the vast majority of sample fluorescence (and part of the Compton contribution), particularly when working at high electronic reprint 5. Credibility section

Competing Interests
The author declares no competing interests.

Raw data sharing
All 'ready-to-analyse' raw powder diffraction data described in this communication have been deposited with Zenodo at the links given in x3.

Weakest point(s) self-assessment
I do not have a sound ground to advise which 'raw data' should be shared.This would require a powder diffraction community discussion led by the IUCr's Commission on Powder Diffraction.Suffice to say, I see that there are advantages and disadvantages for the different options.I advocate in detail here that 'ready-to-analyse' fully processed powder diffraction data should be routinely deposited and shared rather than the data before the instrument, i.e. detector, corrections.This would then be the same approach as the current single-crystal diffractionist approach, where the firmware corrections are fully trusted and it is after those firmware corrections that the raw diffraction images are recommended to be preserved.

Data accountability
Synchrotron powder diffraction data were recorded at the BL04-MSPD beamline, ALBA synchrotron, and the ID15A beamline, ESRF.Laboratory powder diffraction data were recorded at the diffractometers of the SCAI central services, University of Malaga, Spain.
(i) open access (for publications); (ii) open data (for replication and new learning); (iii) open science evaluation (for improving metrics and impacts); and (iv) open science tools (for repositories, data services etc.).An overview of open science is ISSN 1600-5767 # 2018 International Union of Crystallography electronic reprint

Figure 2
Figure 2 Laboratory Cu K 1 pattern for NIST reference Portland clinker SRM 2686a.(Top) Powder pattern figure after Garcı ´a-Mate ´et al. (2019).(Bottom) Powder data downloaded by the reviewer, visualized and then annotated, requesting additional information from us, the authors.