1 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Data handling and processing of scientific data with FAIR principles
University of Maribor
Faculty of Electrical Engineering and Computer Science Koroška cesta 46, 2000 Maribor, Slovenia
Milan.ojstersek@um.si
Milan Ojsteršek
2 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Learning objectives
▶
Openscience
▶
Semantic web technologies
▶
FAIR Open data
▶
Open data management
▶
Licensing of open data
▶
Ethics issues
▶
EOSC
▶
Slovenian Open access infrastructure
3 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Openscience
Michel Nielsen said „Open Science is the idea that scientific knowledge of all kinds should be openly shared as early as is practical in the Discovery process”. Scientific Knowledge of all kinds:
journal articles, data, code, online software tools, questions, ideas, speculations, failures, …and anything which can be considered knowledge. ”
Foster: What is Open Science? Introduction. Available at
https://www.fosteropenscience.eu/content/what-open-science-introduction
4 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Promoting openness at different stages of the research process
Foster: Open Science and Research Initiative (2014). Open Science and Research Handbook.
[English version]. Available at https://www.fosteropenscience.eu/sites/default/files/pdf/3986.pdf
5 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Source: Why open science? Available at: http://www.researchsupport.uct.ac.za/why-open-science
6 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Open Acess
▶ Open Access refers to online, free of cost access to peer reviewed scientific content with limited copyright and licensing restrictions. The main purpose of open access is to allow use and reuse of the peer reviewed scientific research.
▶ The Green route to open access is delivered via self-archiving (depositing) an output into a repository. There are two types of repositories, institutional and subject repositories, free of cost access to peer reviewed scientific content with limited copyright and licensing restrictions.
Publisher is a copyright owner. He allows author to publish his research work in open access under his conditions.
▶ The Gold route to open access is delivered via publishing an article in a journal. The journal may be an open access journal (pure open access), or a subscription based journal (hybrid open access) that offers an open access option. Author is copyright owner. Some journals request payment of Article processing charge (APC) for tranfering copyright licence to authors.
7 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Open Metrics and Impact
An alternative to traditional impact metrics systems, open metrics have developed new way of evaluating the impact of the scholarly outputs.
▶
Bibliometrics: Citation and content analysis used in Open Science.
▶
Altmetrics: A project that produces article level metrics of scholarly articles from information collected from the Internet, such as social media sites, newspapers, and other sources.
▶
Semantometrics: As opposed to existing Bibliometrics, Webometrics, Altmetrics, etc.,
Semantometrics are not based on measuring the number of interactions in the scholarly
communication network, but exploit primarily the fulltext of manuscripts to assess the value
of a publication.
8 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Citizen science
Citizen science is the public involvement in inquiry and discovery of new scientific knowledge:
▶
Finding data
▶
Collecting data
▶
Classifying data
▶
Analyzing data
▶
Curating data
▶
Home-made set-ups for measuring
▶
etc.
Good example is COVID 19 tracker in Slovenia .
9 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Open Reproducible Research
The act of practicing Open Science and the provision of offering to users free access to experimental elements for research reproduction.
▶ Reproducibility means that research data and code are made available so that others are able to reach the same results as are claimed in scientific outputs. Closely related is the concept of replicability, the act of repeating a scientific methodology to reach similar
conclusions. These concepts are core elements of empirical research.
▶ Improving reproducibility leads to increased rigour and quality of scientific outputs, and thus to greater trust in science. There has been a growing need and willingness to expose research workflows from initiation of a project and data collection right through to the
interpretation and reporting of results. These developments have come with their own sets of challenges, including designing integrated research workflows that can be adopted by collaborators while maintaining high standards of integrity.
▶ The concept of reproducibility is directly applied to the scientific method, the cornerstone of Science, and particularly to the following five steps:
▶ Formulating a hypothesis.
▶ Designing the study.
▶ Running the study and collecting the data.
▶ Analyzing the data.
▶ Reporting the study.
Each of these steps should be clearly reported by providing clear and open documentation, and thus making the study transparent and reproducible.
10 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Open Reproducible Research
Open Reproducible Research is based on:
▶ Irreproducibility Studies: The act during which the results of a study or an experiment can be replicated and reproduced.
▶ Open Lab/Notebooks: Laboratory research records, diaries, journals, workbooks etc. offered online free of cost with terms that allow reuse and redistribution of the recorded material.
▶ Open Science Workflows: A sequence of processes scientists make to administer and disseminate convoluted scientific examinations offered online and free of cost allowing the reuse of the material.
▶ Open Source in Open Science: Software where the source code is available free of cost with terms that allow dissemination and adaptation.
▶ Reproducibility Guidelines: Ground rules to assist with the recreation of research experiments and studies.
▶ Reproducibility Testing refers to the process of validating that the reported research results can be obtained in an independent experiment.
11 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Open Science Tools
Refers to the tools that can assist in the process of delivering and building on Open Science.
Tools are:
▶
Open archives that host scientific literature, data, software and other research objectcts and make their content freely accessible to everyone in the world.
▶
Open services offered by organisations and institutions which is possible to use free of cost.
▶
Open Workflow Tools (apparatuses and services ) that promote open scientific projects.
12 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Open Data
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike” -- opendefinition.org
This means, according to the Open Knowledge Foundation:
▶ Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
▶ Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets.
▶ Universal Participation: everyone must be able to use, reuse and redistribute - there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed
13 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
RDF in the stack of Semantic Web technologies
▶
RDF stands for:
▶
Resource: Everything that can have a unique identifier (URI), e.g. pages, places, people, dogs, products...
▶
Description: attributes, features, and relations of the resources
▶
Framework: model, languages and syntaxes for these descriptions
▶
RDF was published as a W3C recommendation in 1999.
▶
RDF was originally introduced as a data model for metadata.
▶
RDF was generalised to cover knowledge of all kinds.
14 Data handling and processing of scientific data with FAIR principles 14 www.prace-ri.eu
RDF
The Resource Description Framework (RDF ) is a syntax for representing data and resources in the Web
▶
RDF breaks every piece of information down in triples:
▶
Subject – a resource, which may be identified with a URI.
▶
Predicate – a URI-identified reused specification of the relationship.
▶
Object – a resource or literal to which the subject is related.
<rdf:RDF
<rdf:Description about="http://www.w3.org">
<s:Publisher> World Wide Web Consortium </s:Publisher>
<s:Title> W3C Home Page </s:Title>
<s:Date> 1998-10-03T02:27 </s:Date>
</rdf:Description>
</rdf:RDF>
15 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
RDF/XML and N3 notation
RDF/XML:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Maribor">
<dc:title>Maribor</dc:title>
<dc:publisher>Wikipedia</dc:publisher>
</rdf:Description>
</rdf:RDF>
N3 Notation:
@prefix dc: <http://purl.org/dc/elements/1.1/>. <http://en.wikipedia.org/wiki/Maribor>
dc:title „Maribor";
dc:publisher "Wikipedia".
16 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Turtle notation
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ex: <http://example.org/stuff/1.0/> . <http://www.w3.org/TR/rdf- syntax-grammar>
dc:title "RDF/XML Syntax Specification (Revised)" ; ex:editor [ ex:fullname "Dave Beckett";
ex:homePage <http://purl.org/net/dajobe/> ] .
17 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
N-Triples
N-Triples:
<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ↵
<http://xmlns.com/foaf/0.1/Document> .
<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://purl.org/dc/terms/title> "N-Triples"@en-US .
<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://xmlns.com/foaf/0.1/maker> _:art .
<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://xmlns.com/foaf/0.1/maker> _:dave . _:art <http://www.w3.org/1999/02/22-rdf-syntax-ns#> <http://xmlns.com/foaf/0.1/Person> . _:art <http://xmlns.com/foaf/0.1/name> "Art Barstow".
_:dave <http://www.w3.org/1999/02/22-rdf-syntax-ns#> <http://xmlns.com/foaf/0.1/Person> .
_:dave <http://xmlns.com/foaf/0.1/name> "Dave Beckett".
18 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
SKOS - Simple Knowledge Organization System
SKOS is an area of work developing
specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web.
19 Data handling and processing of scientific data with FAIR principles 19 www.prace-ri.eu
RDF schema
<rdf:RDFxmlns:rdf= "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xml:base= "http://www.animals.fake/animals#">
<rdf:Descriptionrdf:ID="animal">
<rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
</rdf:Description>
<rdf:Descriptionrdf:ID="horse">
<rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
<rdfs:subClassOfrdf:resource="#animal"/>
</rdf:Description>
</rdf:RDF>
20 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Example of RDF resource about Angola
21 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Example of links between instances
22 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Example of links between RDF triples
23 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Primer uporabe večjezičnosti
24 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
25 Data handling and processing of scientific data with FAIR principles 25 www.prace-ri.eu
SPARQL (Protocol and RDF Query Language)
PREFIX abc: <http://example.com/exampleOntology#> . SELECT ?capital ?country
WHERE {
?x abc:cityname ?capital ; abc:isCapitalOf ?y.
?y abc:countryname ?country ; abc:isInContinent abc:Africa.
}
▶
SPARQL is the standard language to query graph data represented as RDF triples.
26 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
SPARQL query - SELECT – Return all books under a certain price
27 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Ontology
Components of ontology are:
• Individuals: Instances or objects (the basic or "ground level" objects).
• Classes: Sets, collections, concepts, classes in programming, types of objects or kinds of things.
• Attributes; Aspects, properties, features, characteristics or parameters that objects (and classes) can have.
• Relations: Ways in which classes and individuals can be related to one another.
• Function terms: Complex structures formed from certain relations that can be used in place of an individual term in a statement.
• Restrictions: Formally stated descriptions of what must be true in order for some assertion to be accepted as input.
• Rules: Statements in the form of an if-then (antecedent-consequent) sentence that describe the logical inferences that can be drawn from an assertion in a particular form.
28 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
IoT Lite Ontology example
Source: IoT Lite Ontology: Available on
https://www.w3.org/Submission/iot-lite /
29 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Semantic Web Rule Markup Language
Languge is used for describing antecedent ⇒ consequent Example: hasParent(?x1,?x2) ∧hasBrother(?x2,?x3)⇒hasUncle(?x1,?x3)
<ruleml:imp> <ruleml:_rlab ruleml:href="#example1"/>
<ruleml:_body>
<swrlx:individualPropertyAtom swrlx:property="hasParent">
<ruleml:var>x1</ruleml:var>
<ruleml:var>x2</ruleml:var>
</swrlx:individualPropertyAtom>
<swrlx:individualPropertyAtom swrlx:property="hasBrother">
<ruleml:var>x2</ruleml:var>
<ruleml:var>x3</ruleml:var>
</swrlx:individualPropertyAtom>
</ruleml:_body>
<ruleml:_head>
<swrlx:individualPropertyAtom swrlx:property="hasUncle">
<ruleml:var>x1</ruleml:var>
<ruleml:var>x3</ruleml:var>
</swrlx:individualPropertyAtom>
</ruleml:_head>
</ruleml:imp>
30 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
OWL
▶ Functional syntax
DisjointClasses( :Woman :Man )
▶ RDF/XML syntax
<owl:AllDisjointClasses>
<owl:members rdf:parseType="Collection">
<owl:Class rdf:about="Woman"/>
<owl:Class rdf:about="Man"/>
</owl:members> </owl:AllDisjointClasses>
▶ Turtle syntax
[] rdf:type owl:AllDisjointClasses ; owl:members ( :Woman :Man ) .
▶ Manchester syntax
DisjointClasses: Woman, Man
▶ OWL/XML syntax
<DisjointClasses>
<Class IRI="Woman"/>
<Class IRI="Man"/>
</DisjointClasses>
31 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Metadata – What is it?
▶ Data about Data
▶ => Metadata ISdata
▶ A way of describing a research artefact in a structured, machine readable way
▶ Something my community or funder insist I have
▶ Many distinct types of metadata exist, including:
▶ Descriptive metadata is descriptive information about a resource. It is used for discovery and identification. It includes elements such as title, abstract, author, and keywords.
▶ Structural metadata is metadata about containers of data and indicates how compound objects are put together, for example, how pages are ordered to form chapters. It describes the types, versions, relationships and other characteristics of digital materials.
▶ Administrative metadata is information to help manage a resource, like resource type, permissions, and when and how it was created.
▶ Reference metadata is information about the contents and quality of statistical data.
▶ Statistical metadata, also called process data, may describe processes that collect, process, or produce statistical data.
Some examples of metadata schemes are: DC, COMARC, PREMIS, DCAT, FOAF, DDI… More about metadata standards you can find on https://www.dcc.ac.uk/guidance/standards/metadata/list
31
32 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Metadata in life
33 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Metadata in Life
34 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Metadata Standards - Simple
Dublin Core
• Title
• Creator
• Subject
• Description
• Publisher
• Contributor
• Date
• Type
• Format
• Identifier
• Source
• Language
• Relation
• Coverage
• Rights
DataCite
• Title
• Creator
• Publisher
• Identifer
• Publication Year
• Resource Type
• Subject
• Contributor
• Date
• Related identifier
• Description
• Geolocation
• Language
• Alternate identifier
• Size
• Format
• Version
• Rights
• Funding Reference
• Version
• Metric
• Same as
• Spatial Coverage
• Temporal coverage
• Citation
• Reference citation
• compression EDMI
• Name
• Description
• Identifier
• url
• Creator
• Date Created
• license
• Data Standard
• Date Modified
• Access URL
• Access Interface
• Structure
• Included In
• Measurement Technique
• Keywords
• Variable Measured
• Format
• Scientific Type
• Includes
• Content Type
• Size
• Authentications
35 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Metadata standards - complex
The CERIF Standard – Base Entities
36 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Exercises in metadata – Guess the film!
Italian, English,Spanish Subject
Description Creator Publisher Date Type Format Identifier Source Language Relation Coverage Rights
Part of trilogy US Civil War
Produzioni Europee Associate Original Work
None
35mm anamorphic Spaghetti Western 23 December 1966
Produzioni Europee Associate
Age & Scarpelli, Luciano Vincenzoni Three men searching for stolen gold None
?
37 Data handling and processing of scientific data with FAIR principles 37 www.prace-ri.eu
Dublin Core ( DC)
▶
The Dublin Core, also known as the Dublin Core Metadata Element Set, is a set of fifteen
"core" elements (properties) for describing resources. The resources described using the Dublin Core may be digital resources (video, images, web pages, etc), as well as physical resources such as books or CDs, and objects like artworks.
▶
Dublin Core metadata may be used for multiple purposes, from simple resource description to combining metadata vocabularies of different metadata standards, to providing
interoperability for metadata vocabularies in the linked data cloud and Semantic Web implementations.
▶
The Dublin Core standard originally included two levels: Simple and Qualified. Simple Dublin Core comprised 15 elements; Qualified Dublin Core included three additional elements (Audience, Provenance and RightsHolder), as well as a group of element
refinements (also called qualifiers) that could refine the semantics of the elements in ways
that may be useful in resource discovery.
38 Data handling and processing of scientific data with FAIR principles 38 www.prace-ri.eu
Example of metadata for
diploma work in Digital library of University of Maribor
<?xmlversion="1.0"?>
-<rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
-<rdf:Description rdf:about="http://dkum.uni-mb.si/IzpisiGradiva.php?id=5946">
<dc:title>Placilni procesi multinacionalnega podjetja : diplomsko delo</dc:title>
<dc:creator>Filipan Kraljic, Biserka (Komentor )</dc:creator>
<dc:creator>Zbašnik, Dušan (Mentor)</dc:creator>
<dc:creator>Kutnjak, Matej (Avtor)</dc:creator>
<dc:subject>plačilni promet</dc:subject>
<dc:subject>prenosni sistemi</dc:subject>
<dc:subject>multinacionalne družbe</dc:subject>
<dc:subject>plačilni instrumenti</dc:subject>
<dc:subject>kliring</dc:subject>
<dc:subject>dokumentarni akreditivi</dc:subject>
<dc:subject>bančno poslovanje</dc:subject>
<dc:subject>garancije</dc:subject>
<dc:subject>jamstvo</dc:subject>
<dc:subject>mednarodne finance</dc:subject>
<dc:subject>menice</dc:subject>
<dc:subject>plačilni sistemi</dc:subject>
<dc:subject>poravnava</dc:subject>
<dc:subject>mednarodna podjetja</dc:subject>
<dc:subject>cilj</dc:subject>
<dc:subject>centralizacija</dc:subject>
<dc:subject>zakladnica</dc:subject>
<dc:description>[M. Kutnjak]</dc:description>
<dc:date>2006</dc:date>
<dc:date>2007-09-11 14:51:18</dc:date>
<dc:type>Bibliografija</dc:type>
<dc:identifier>5946</dc:identifier>
<dc:identifier>COBISS_ID: 402472</dc:identifier>
<dc:identifier>UDK: 339.727(043.2):334.726(497.4:4)</dc:identifier>
<dc:source>Velenje</dc:source>
<dc:language>sl</dc:language>
</rdf:Description>
</rdf:RDF>
39 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Knowledge collections
▶
Dictionaries – example Pons
▶
Thesauruses – examples Evrovoc , Agrovoc, UDC, LCH
▶
Semantic dictionaries– example Wordnet
▶
Dbpedia – see also Relfinder and it‘s classes
▶
Babelnet
▶
Geonames
▶
UMLS
40 Data handling and processing of scientific data with FAIR principles 40 www.prace-ri.eu
WordNet
{v e h ic le }
{c o n v e y a n c e ; tra n s p o rt}
{c a r; a u to ; a u to m o b ile ; m a c h in e ; m o to rc a r}
{c ru is e r; s q u a d c a r; p a tro l c a r; p o lic e c a r; p ro w l c a r} {c a b ; ta x i; h a c k ; ta x ic a b ; }
{ m o to r v eh icle; au to m o tiv e v eh icle}
{ b u m p er}
{ car d o o r}
{ car w in d o w } { car m irro r}
{ h in g e; flex ib le jo in t}
{ d o o rlo ck } { arm rest}
h yp ero n ym
h yp ero n ym
h yp ero n ym
h yp ero n ym h yp ero n ym
m ero n ym
m ero n ym
m ero n ym
m ero n ym
41 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Cyc
42 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Linked open data
42
43 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Some useful links
▶ FAIRsharing.org - A curated, informative and educational resource on data and metadata standards, inter-related to databasesand data policies.
▶ RDA metadata standards
▶ Bioportal– a comprehensive repository of biomedical ontologies
▶ AgroPortal is an ontology portal/repository (with periodically updated versions) dedicated to the agronomic and plant domains.
▶ OGC standards– list of standards and ontologies on geospatial domain.
▶ Linked open vocabularies
▶ QUDT CATALOG - Quantities, Units, Dimensions and Data Types Ontologies
▶ Schema.orgis a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.
44 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Open Data
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike” -- opendefinition.org
This means, according to the Open Knowledge Foundation:
▶ Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
▶ Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets.
▶ Universal Participation: everyone must be able to use, reuse and redistribute - there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed
45 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Research Data
▶ Raw / Cleaned or Filtered
▶ Field data
▶ Experimental data
▶ Derived data
▶ Qualitative / Quantitaive
▶ Structured / Semi-structured/
Unstructured
▶ Tabular/ Hierarhical / Graphs
▶ Open access / Restricted access
▶ Linked data
▶ Metadata
▶ Big data
Source: https://www.bitmat.it/blog/news/83536/sviluppare-
applicazioni-iot-riducendo-costi-e-risorse
46 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Data sources
▶ Devices
▶ Instruments
▶ Sensors
▶ Software
▶ People
▶ Observations
▶ Experiments
▶ Simultations
▶ Emulations
▶ Surveys
▶ Interviews
▶ Text analysis
▶ Text mining
▶ …………
47 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Categorization of the file formats of Open Data
48 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
5 Star Open Data
48
Source: http://5stardata.info/
49 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Persistent identifiers
▶
Resolvable links independent of data locations
▶
Common technologies:
▶
DOI, PURL, URNs, ARKs, PMID…
doi:10.1594/PANGAEA.667386
Resolver
Site A
Site B
50 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Structure of a Digital Specimen Digital Object (DSDO)
▶ PIDs are pointers that resolve to location of the item e.g., DO itself, physical specimen, hi-res images, label information, tissue sample, DNA sequence, etc.
Source: Alex Hardisty FAIR Digital Objects as Basic Design Choice and the Need for PIDs
PID Kernel information
(metadata about specimen DO)
D O , = A n e n ve lo p e
PID
1PID
2PID
…PID
Nmetadata2
… … … …
… … …
… … … …
… … … metadataN
… … … …
… … …
content
N… … … …
… … … …
… … … …
… … … …
… … … …
Bit
sequence
51 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Physical Object Digital Surrogate
FAIR Digital Object
An machine actionable knowledge unit
Source: Alex Hardisty FAIR Digital Objects as
Basic Design Choice and the Need for PIDs
52 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
FAIR digital objects and persistent identifiers (PIDs)
Source: RDA's Data Foundation & Terminology Group (DFT) 2014:
Core Model
53 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
FAIR (Findable, Accessible, Interoperable, Reusable)
▶ To be Findable:
▶ F1. (meta)data are assigned a globally unique and eternally persistent identifier.
F2. data are described with rich metadata.
F3. (meta)data are registered or indexed in a searchable resource.
F4. metadata specify the data identifier.
▶ TO BE ACCESSIBLE:
▶ A1 (meta)data are retrievable by their identifier using a standardized communications protocol.
A1.1 the protocol is open, free, and universally implementable.
A1.2 the protocol allows for an authentication and authorization procedure, where necessary.
A2 metadata are accessible, even when the data are no longer available.
▶ TO BE INTEROPERABLE:
▶ I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles.
I3. (meta)data include qualified references to other (meta)data.
▶ TO BE RE-USABLE:
▶ R1. meta(data) have a plurality of accurate and relevant attributes.
R1.1. (meta)data are released with a clear and accessible data usage license.
R1.2. (meta)data are associated with their provenance.
R1.3. (meta)data meet domain-relevant community standards.
Source: https://www.force11.org/group/fairgroup/fairprinciples Source: https://libereurope.eu/blog/2018/07/13/fairdataconsultation/liber-
fair-data-2/
54 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
RDA FAIR Data Maturity Model 1
55 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
RDA FAIR Data
Maturity Model 2
56 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Research data managament plan
A research data management plan (DMP) is a written document that describes the data you expect to acquire or generate during the course of a research project, how you will manage, describe, analyze, and store those data, and what mechanisms you will use at the end of your project to share and preserve your data.
You may have already considered some or all of these issues with regard to your research project, but writing them down helps you formalize the process, identify weaknesses in your plan, and provide you with a record of what you intend(ed) to do.
Data management is best addressed in the early stages of a research project, but it is never too late to
develop a data management plan.
57 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Source: Research data management roles and responsibilities. Available at:
http://www.researchsupport.uct.ac.za/research-data-management-roles-and-responsibilities
58 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Open data life cycle
▶
Discovery and planning
▶
Data collection
▶
Processing and analysis
▶
Publish and share
▶
Long term managament
▶
Reusing data
59 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Data managament plan
A data managament plan is a formal document that provides a framework for how to handle the data material during and after the research project. The way a DMP will look once it is finished is not universal. It is a
“living” document that changes together with the needs of a project and its participants. It is updated
throughout the project to make sure that it tracks such changes over time and that it reflects the current state of your project.
Please see how to prepare data managament plan on:
CESSDA Data Management Expert Guide Funders’ data plan requirements
Checklist for a Data Management Plan
DMPonline
60 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Working Data Registered Data
Published Data
In cr e a sin g S h a rin g
In cr e a si n g V o lu me
The Data Pyramid
61 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Is Data Storage the Same as Data Preservation?
▶ No!
▶ Storage is subject to many issues
▶ Media failure
▶ ‘bit rot’
▶ Natural disaster
▶ Format obsolescence
▶ Human deletion – accidental or malicious
▶ Media Loss
▶ Loss of funding
▶ Link rot and reference rot
▶ Data Preservation is ensuring data is available for re-use into the future
▶ Minimising the risk of data loss
62 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Preservation vs Curation
Definitions (lexico.com)
▶
Preserve
▶
Maintain (something) in its original or existing state.
▶
Curate
▶
Select, organize, and look after items (in a collection or exhibition)
▶
In the digital era the concepts are similar:
▶
Preserve: Make sure the bits you store stay the same over time
▶
Curate: Make sure the information you store is discoverable and usable over time
▶
Lets look at some examples…
63 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Clear license information is important because...
▶
It tells users and reusers exactly what they can do with your data and metadata.
▶
It encourages the use and reuse of your data and metadata the way you want them to be used and reused.
▶
It creates visibility of your efforts downstream (if you ask for attribution).
If no explicit licence is provided, a user does not know what can be done with the data/metadata – the
default legal position is that nothing can be done without contacting the owner on a case-by-case
basis.
64 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Four key tips when publishing information about the licence
1.
Make sure your licensing information is easy to find.
2.
Include information about the license in the metadata of each data set.
3.
Use simple licenses to ensure they are easy to understand by re-users.
4.
Check spelling, typos and spacing to ensure consistency in the names of the licenses used.
65 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Clear licence information - example
65
66 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Different data have different licensing needs
▶
Some data(sets) may be required to be openly available (e.g. subject to a Constitution of the Republic of Slovenia).
▶
Some data(sets) may be subject to restrictions (e.g. privacy, national security, third party rights).
▶
Some data(sets) may be available for reuse but not for modification (e.g. legal texts, public budgets (if modifications are made, it must be made clear that the data is not the actual authentic version).
▶
Some data(sets) may be published allowing derivations with attribution of authoritative source (e.g. legal
commentary, translations).
67 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Licensing approaches: Creative Commons (1)
68 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Licensing approaches: Creative Commons (2)
69 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Good practices for licensing your data
Good practices:
▶
If the original data is in the public domain (e.g. by law), keep it there – use for example the Creative Commons Zero Public Domain Dedication or the Open Data Commons Public Domain Dedication and License (PDDL)
▶
For some documentation integrity needs to be protected – use a No- Derivatives licence, for example Creative Commons Attribution- NoDerivs, but only if really necessary
▶
Avoid Non-Commercial licences if at all possible, as these seriously restrict reuse.
Licenses for data should provide appropriate security and control (but not more than that).
70 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Using an open and unrestricting license for your data
Whenever data is licensed for open and unrestricted access, reusers can create new knowledge from combining it.
For example:
▶
Cross-referencing public spending with geographic data to visualise which regions are better funded.
▶
Matching public transport timetables with GPS data to be able to give real time information on delays.
▶
Measuring performance of public services based on transaction counters and waiting times.
▶
Deriving recommendations for prevention policies relating accident statistics with weather data and road
maps.
71 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Software licenses
72 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Good practices for licensing your metadata
What you need to think about:
▶
Metadata helps people to discover your data.
▶
The wider your metadata is distributed, the higher your visibility is.
▶
Others may want to add to it, enhance it, link to other resources.
Good practices:
▶
Licences for metadata should be as open as possible.
▶
A public domain licence allows the widest reuse.
▶
An attribution licence ensures you get credit downstream, but may cause problems if data is shared
multiple times (attribution stacking).
73 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Machine readability of licenses – use Open Digital Right Language
The Open Digital Rights Language (ODRL) is a policy expression language that provides a flexible and interoperable information model, vocabulary, and encoding mechanisms for representing statements about the usage of content and services. The ODRL Information Model describes the underlying concepts, entities, and relationships that form the foundational basis for the semantics of the ODRL policies.
Policies are used to represent permitted and prohibited actions over a certain asset, as well as the obligations required to be meet by stakeholders. In addition, policies may be limited by constraints (e.g., temporal or
spatial constraints) and duties (e.g. payments) may be imposed on permissions.
Please see ecxamples of machine readable licenses on http://rdflicense.appspot.com/.
74 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Ethics and research data management
Source: Ethics and research data managament. Available at:
http://www.researchsupport.uct.ac.za/ethics-and-research-data-management
75 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Sharing sensitive data
▶
Sensitive data that contain potentially identifying information -- whether it be human subject data or other types of sensitive data -- will likely need to be modified prior to sharing these data with the public. It is important that these modifications are made in order to protect participant confidentiality, the location of endangered wildlife, or for other relevant reasons. However, these modifications may affect the data to the point where reproducibility or additional subsequent research by others is no loner possible. You might consider retaining multiple versions of the data: one that is suitable for public release, and one that is suitable for further research but that is available on a highly restricted basis.
▶
For patient health information (PHI), HIPAA privacy rules provide two methods for de-identification: the expert determination method and the safe harbor method. See the resources listed below for
documentation on these methods from the US Department of Health and Human Services, as well as
information on how to satisfying these two methods.
76 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Types of identifying information
▶ Direct identifiers: These data point directly to an individual and are typically removed from data sets before sharing with the public. These may include:name, initials, mailing address, phone number, email address, unique identifying numbers, like Social Security numbers or driver's license numbers, vehicle identifiers, medical device identifiers, web or IP addresses, biometric data, photographs of the person, audio recordings, names of relatives, dates specific to individual, like date of birth, marriage, etc.
▶ Indirect identifiers: These may seem harmless on their own, but can point to an individual when combined with other data. It has been recommended that datasets containing three or more indirect identifiers should be reviewed by an independent researcher or ethics committee to evaluate identification risk. Any indirect information not needed for the analysis should be removed. It may be reasonable to supply some of these types of data in aggregated form (like ranges of annual incomes instead of exact numbers). Indirect identifiers may include: place of medical treatment or doctor's name, gender, rare disease or treatment, sensitive data like illicit drug use or other "risky behaviors„, place of birth, socioeconomic data, like workplace, occupation, annual income, education, etc, general geographic indicators, like postal code of residence, household and family composition, ethnicity. birth year or age, verbatim responses or transcripts
77 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Sharing Sensitive Data with Confidence: The Datatags System
Non-confidental information
Non-confidential information
Potentially harmful personal information Sensitive personal information
Very sensitive personal information
Maximum sensitive personal information
78 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
EUDAT Collaborative Data Infrastructure (CDI)
Access and reuse of data:
• Repository service (B2SHARE and B2DROP),
• data discovery service (B2FIND),
• authentication and authorization
(B2ACCESS).
Data preservation and storage:
• Data managament service(B2SAFE).
• PID service (B2HANDLE).
Data transfer (B2STAGE).
Sensitive data
managament.
79 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
A network diagram of the Slovenian open access infrastructure
80 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Structure of repository in the Slovenian open access infrastructure
81 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu
Slovenian big data archive structure
81
82 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu