Thought Leaders

Addressing the Challenges When Accessing Scientific Data

Dr. Stuart Chalk, Professor at the University of North Florida, talks to AZoM about the challenges facing researchers and industry when accessing scientific data, the research he is conducting and how resources like SpringerMaterials are helping make it easier to access.

SM: What are the major pain points for researchers when searching for scientific data?

SC: With such a heterogeneous and widely distributed global network of both proprietary and open data there are many issues that researchers must deal with when looking for scientific data. The general areas that these fall into are:

  • Data access – individual data is not copyrightable, however the aggregation of data into a collection is.  Additionally, some of these collections are freely available and some are behind access restrictions or embargos. Consequently, it is hard in many situations to really know if researchers can take and use data legally.  Access is also restricted if the data is published in inappropriate file formats (e.g. PDF files). Finally, much data that is collected is never published – so called ‘Dark data’ – and is thus not available for researchers to find.

  • Data representation – how the data is reported has a big impact on its findability.  Simple things like the use of a period or comma to indicate decimal places, or the annotation of the unit of measure (how it is reported) can impede searches. Data can be poorly characterized – with either to little or inaccurate metadata (contextual information), and the organization of data (tables v’s databases v’s XML) can be a problem.

  • Data exchange – even if a researcher can obtain data they need, their ability to “read” or upload into their specific research applications to work further with it and this can be a significant issue if it is not in an open format.  Much data (e.g. spectral outputs) is stored in proprietary formats that may not be readable 20 years after creation.  Individual large datasets or a large number of small data files (e.g. gigabytes in size or larger) may take a long time to load.

SM: Which initiatives take care of these pain points?

SC: There are a number of groups working on policies, tools, standards, or vocabularies/ontologies to move scientific data to an era where it will be significantly easier to find.  These include (but are not limited to):

  • Research Data Alliance (RDA) (https://www.rd-alliance.org/) was launched as a community-driven organization in 2013 by the European Commission, the United States National Science Foundation and National Institute of Standards and Technology, and the Australian Government’s Department of Innovation with the goal of building the social and technical infrastructure to enable open sharing of data. RDA working groups (WGs) focus on (for example) “Persistent Identifier (PID) Information Types”, “Data Foundation and Terminology”, “Wheat Data Interoperability”, and “Research Data Repository Interoperability” and their recommendations are being adopted across the globe.

  • Force11 (https://www.force11.org/)  grew out of a workshop of “Future of Research Communications” and is focused on development of a new publishing paradigm (including data) for the electronic era. To quote the Force11 website “FORCE11 is a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing. Individually and collectively, we aim to bring about a change in modern scholarly communications through the effective use of information technology”.

  • The Data FAIRport Initiative (http://www.datafairport.org/) came out of a workshop in 2014 where participants were focused on making access to scientific data fair.  FAIR in this context means Findability, Accessibility, Interoperability and Reusability, i.e. traits of FAIR data implementations.  These principles have been adopted by a number of organizations (e.g. RDA and FORCE11 above) and have become a foundation of the open data movement.

  • The Pistoia Alliance (http://www.pistoiaalliance.org/) is trying to tackle access to scientific data in the life science industry.  In a competitive industrial environment data is intellectual property (IP), and the bigger the dataset the bigger the potential IP.  However, the companies that started the alliance realized that collaboration of pre-competitive activities would be beneficial to all involved.  The alliance projects now include the Chemical Safety Library, Hierarchical Editing Language for Macromolecules (HELM), and an Ontologies Mapping Project (improving consistency and identifying gaps in the semantic representation of knowledge).

  • International Council for Science: Committee on Data for Science and Technology (CODATA) (http://www.codata.org/) serves all the science disciplines and in addition to being responsible for consistency checking and updating the fundamental physical constants also has a number of activities focused on improving access to scientific data.  They include (Working Group (WG), Task Group (TG), Initiative (I)):

    • Uniform Description System v2.0 (for nanoscale materials)
    • Standard Glossary for Research Data Management (IRIDIUM) (WG)
    • Legal Interoperability of Research Data (WG)
    • Coordinating Data Standards amongst Scientific Unions (TG)
  • International Union of Pure and Applied Chemistry (IUPAC) (https://iupac.org/) as one of the scientific unions responsible for data standards has formed the Subcommittee on Chemical Data Standards (SCDS) under the IUPAC Committee on Publications and Chemical Data Standards (CPCDS).  The sub-committee is focusing on identifying current standards used in chemistry (e.g. JCAMP, ThermoML and InChI) and the development of future standards needed in the chemistry community.  Additionally, the IUPAC Compendium of Chemical Terminology (Gold Book - http://goldbook.iupac.org) is being evaluated as a source for a chemistry concept ontology.

  • National Institute of Standards and Technology (NIST) (https://www.nist.gov/) the U.S. National Metrological Institute (NMI) and is responsible for high quality measurements, metrology, measurement science, and standard reference materials. It is currently initiating a project to develop a Digital Units Repository based on the QUDT (http://qudt.org/) semantic representation of scientific units.  Once implemented, scientific data can be made available with internationally recognizable digital units with mechanisms for automated conversion into equivalent units.

With all the activities above there is clear evidence that scholarly publications, as we currently know them, are being reinvented – transitioning to a data centric model where it is as important (if not more) that published research have the raw data the work is based on, so others can evaluate and reuse it.  In addition, the evolution of tools to manage, integrate and visualize data is closely linked to this change, as users will find themselves overwhelmed with data. One such recently announced tool is SpringerMaterials Interactive, a system to allow users to interact with data that our research group has captured from volumes of the Landolt-Börnstein series.

SM: How is your work addressing these pain points?

SC: Fundamental to issues around scientific data access/discovery is providing a mechanism to transmit the context of a measurement along with the data. Scientists have historically provided this through the publication of research papers in the peer-reviewed literature, however this mechanism is quickly being recognized as a poor approach given the complexity, scope, and size of data that the research is based on.  Currently scientists only publish the ‘important’ research data, normally in a condensed (aggregate) form, and if they do provide the data that a research paper is based on it is commonly in a format (e.g. PDF) not amenable to use by other scientists – it’s not in a format that is FAIR.

In my group, we are working on providing a framework that will allow any scientific data to be represented along with its metadata but without mandating a structure for the data or requiring a particular platform.  The SciData framework for scientific data (https://doi.org/10.1186/s13321-016-0168-9) is based on the idea that just like a research paper, scientific data and metadata (its contextual data) can be organized into three based categories:

  • Methodology – how the data was obtained and what equipment was used
  • System – what is the data about, a chemical, an organism, a material, a molecular system (in computational chemistry)
  • Dataset – the data that was collected logically organized and linked to the information in the methodology and system categories

Along with some additional metadata about who did the research, what project it was part of, a citable link to access the data and a licensing statement, this is succinct representation of the important information about a single datum all the way up to project based dataset.  Examples of the framework can be found on the projects GitHub website at (http://stuchalk.github.io/scidata/).

SM: In an ideal world, what would a scientific database look like? What are the most important features?

SC: This is a very important question and one I have thought about a lot in order to arrive at an answer.  Traditional databases are relational, meaning they have a schema (layout) that defines how data in one table relates to data in another, usually implemented by adding a column to a table that contains the unique foreign key of a row in another table.  This ‘rigid’ structure is not amenable to scientific data because it forces the data into a format that may not be appropriate to represent the data.

More recently, graph databases have become very popular and are based on the idea that any piece of information can be related to any other piece of information using a subject-predicate-object (spo) ‘triple’ (e.g. datapoint (s) - has numeric value (p) – 0.1234 (o); datapoint (s) – has unit (p) – gram (o)).  Clearly, there is no structure to this as anything can be related to anything else and as a result organization of data in such a database can be heterogeneous – making it difficult to search for data in a meaningful way because related data may be characterized differently.

The answer is, in my opinion, to define a framework (schema) that organizes the data at a more abstract level allowing systematic searches, yet allows the data and metadata to be characterized in a way that fits the data.  This is the approach we took in SciData, where the framework can be implemented in either a relational database or a graph database because it is a hybrid model.  The key to its success is the ongoing development of semantic (ontological) representation of contextual data types and domain specific knowledge mapping using open and dereferenceable ontologies.

SM: You have or are working on the digitization of numerical data from both the IUPAC Solubility Data Series and the Landolt-Börnstein Book Series, both well-established resources in their respective fields. What are or have been the major challenges during your work?

SC: The hardest part about these projects has been the development of the migration strategy for each resource.  By that I mean even though the data is presented in relatively structured formats (tables for instance) there is a lot of interpretation a human does automatically to understand how the information on the page is related.  Take the figure below.  There is a significant amount of information implied by the structure of the page, in addition to the data on the page, that needs to be interpreted by the computer.

For instance, this page contains two completely separate datasets, indicated by the outer black boxes.  In the bottom dataset, a chemist can understand that the string 75-69-4 is a chemical abstracts registry number (CASRN) for the compound CCl3F, and that R-11 is likely a trade name because the compound is fluorinated and could be a refrigerant (this is in fact Freon-11).  The variable temperature is indicated with units of Kelvin, yet in the table these numbers are not to be found because the temperature is reported in °C.  The string ‘100w1’ represents the mass percent of substance 1, so if you want to convert it to the mass fraction you must divide the values in the column by 100.  Although not stated in the page, this is an original research data value curated from the reference indicated in the “ORIGINAL MEASUREMENTS’ box rather than the data in the ‘104 x1’ (mole fraction) and ‘100 w1M1-1’ (molality in units of mol g-1 rather than mol kg-1) columns that were calculated by the compiler.  Finally, the reference at the bottom right is to a paper indicated in the ‘METHOD/APPARATUS/PROCEDURE’ section of research article referenced top right.

Our approach to deal with this working on the Landolt-Börnstein Book Series is to use regular expression (regex) matching of text strings and their locations on a page relative to other information on the page.  This allows us to identify strings as identifiers of properties and units on columns, chemical formulas and names, and data in decimal or scientific notation formats.  For instance, the regular expression below can be used to definitively identify a string of numbers and hyphens in the CASRN format.  The ‘[0-9]’ represents any digit, the ‘{2,7}’ indicates a sequence from two to seven consecutive digits and the ‘{2}’ a sequence of two digits.  Whether the CASRN has been correctly transcribed on the page is not known but that can be checked against a number of online databases.

[0-9]{2,7}-[0-9]{2}-[0-9]

SM: How exactly does this digitization of scientific data help researchers in their daily work?

SC: For a number of years, scientists have been using a myriad of different laboratory information management systems (LIMS), electronic laboratory notebooks (ELNs) and computer databases.  In a general sense, the biggest drawback of each of these systems has been getting the scientist to annotate the data with a rich set of metadata.  This is primarily because this task is boring and scientists are keen to do the research work rather than spend a significant amount of time correctly characterizing their data, even though it will make it easier to find and use that data down the road.  If automated systems can be used to infer as much information about research data as possible (with a way for the scientist to verify it) then the research enterprise can be made more efficient and cost-effective. For this to truly work it needs three key pieces to be implemented

  1. Automated migration (translation and characterization) of data from other systems into the researcher’s data system
  2. Semantic annotation of instrument data (from the instrument data system) in addition to raw data including unique identifiers for the instrument (and id equivalent to a serial number) and the software used to collect the data
  3. A Digital Research Notebook (DRN) that is integrated into the laboratory (as well as authenticated online systems) and collects the data from instruments and apparatus, environmental laboratory conditions, photo/video/audio annotations of the laboratory workflow, and equipment used in sample/solution/reaction preparation

SM: What are the benefits of having a database like SpringerMaterials within a multi-disciplinary field such as materials science?

SC: What researchers are looking for in a database like SpringerMaterials is information they would not find otherwise.  In other words, they are looking to find information they don’t currently know exists but that could be important in the research they are doing.  The only way for this to happen is if the database contains information about a chemical from multiple different perspectives – or disciplines.  As a result, databases like SpringerMaterials are vital to the long-term progress of materials research projects.  In addition, as mentioned above, tools are needed to allow the user to take full advantage of the available data – to visualize large datasets or integrate data from multiple sources.  The recent introduction of Springer Materials Interactive is an important step to allow users from a variety of disciplines to leverage large datasets, by being able to view data based on their perspective.

About Stuart Chalk

Stuart Chalk, Professor at the University of North Florida,  is an analytical chemist by training with research focus in the areas of flow analysis and environment monitoring. In the last few years he has become more and more interested in cheminformatics, which is now his major focus. Current projects include: development of the ChemExtractor, development of a semantic unit's repository to support scientific big data, the IUPAC Gold Book Project, and the design and development of a web based teaching tool for informatics (ChemCurator).

Disclaimer: The views expressed here are those of the interviewee and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.

Ask A Question

Do you have a question you'd like to ask regarding this article?

Leave your feedback
Submit