• Rezultati Niso Bili Najdeni

Data handling and processing of scientific data with FAIR principles

N/A
N/A
Protected

Academic year: 2022

Share "Data handling and processing of scientific data with FAIR principles"

Copied!
82
0
0

Celotno besedilo

(1)

1 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Data handling and processing of scientific data with FAIR principles

University of Maribor

Faculty of Electrical Engineering and Computer Science Koroška cesta 46, 2000 Maribor, Slovenia

Milan.ojstersek@um.si

Milan Ojsteršek

(2)

2 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Learning objectives

Openscience

Semantic web technologies

FAIR Open data

Open data management

Licensing of open data

Ethics issues

EOSC

Slovenian Open access infrastructure

(3)

3 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Openscience

Michel Nielsen said „Open Science is the idea that scientific knowledge of all kinds should be openly shared as early as is practical in the Discovery process”. Scientific Knowledge of all kinds:

journal articles, data, code, online software tools, questions, ideas, speculations, failures, …and anything which can be considered knowledge. ”

Foster: What is Open Science? Introduction. Available at

https://www.fosteropenscience.eu/content/what-open-science-introduction

(4)

4 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Promoting openness at different stages of the research process

Foster: Open Science and Research Initiative (2014). Open Science and Research Handbook.

[English version]. Available at https://www.fosteropenscience.eu/sites/default/files/pdf/3986.pdf

(5)

5 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Source: Why open science? Available at: http://www.researchsupport.uct.ac.za/why-open-science

(6)

6 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Open Acess

Open Access refers to online, free of cost access to peer reviewed scientific content with limited copyright and licensing restrictions. The main purpose of open access is to allow use and reuse of the peer reviewed scientific research.

The Green route to open access is delivered via self-archiving (depositing) an output into a repository. There are two types of repositories, institutional and subject repositories, free of cost access to peer reviewed scientific content with limited copyright and licensing restrictions.

Publisher is a copyright owner. He allows author to publish his research work in open access under his conditions.

The Gold route to open access is delivered via publishing an article in a journal. The journal may be an open access journal (pure open access), or a subscription based journal (hybrid open access) that offers an open access option. Author is copyright owner. Some journals request payment of Article processing charge (APC) for tranfering copyright licence to authors.

(7)

7 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Open Metrics and Impact

An alternative to traditional impact metrics systems, open metrics have developed new way of evaluating the impact of the scholarly outputs.

Bibliometrics: Citation and content analysis used in Open Science.

Altmetrics: A project that produces article level metrics of scholarly articles from information collected from the Internet, such as social media sites, newspapers, and other sources.

Semantometrics: As opposed to existing Bibliometrics, Webometrics, Altmetrics, etc.,

Semantometrics are not based on measuring the number of interactions in the scholarly

communication network, but exploit primarily the fulltext of manuscripts to assess the value

of a publication.

(8)

8 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Citizen science

Citizen science is the public involvement in inquiry and discovery of new scientific knowledge:

Finding data

Collecting data

Classifying data

Analyzing data

Curating data

Home-made set-ups for measuring

etc.

Good example is COVID 19 tracker in Slovenia .

(9)

9 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Open Reproducible Research

The act of practicing Open Science and the provision of offering to users free access to experimental elements for research reproduction.

Reproducibility means that research data and code are made available so that others are able to reach the same results as are claimed in scientific outputs. Closely related is the concept of replicability, the act of repeating a scientific methodology to reach similar

conclusions. These concepts are core elements of empirical research.

Improving reproducibility leads to increased rigour and quality of scientific outputs, and thus to greater trust in science. There has been a growing need and willingness to expose research workflows from initiation of a project and data collection right through to the

interpretation and reporting of results. These developments have come with their own sets of challenges, including designing integrated research workflows that can be adopted by collaborators while maintaining high standards of integrity.

The concept of reproducibility is directly applied to the scientific method, the cornerstone of Science, and particularly to the following five steps:

Formulating a hypothesis.

Designing the study.

Running the study and collecting the data.

Analyzing the data.

Reporting the study.

Each of these steps should be clearly reported by providing clear and open documentation, and thus making the study transparent and reproducible.

(10)

10 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Open Reproducible Research

Open Reproducible Research is based on:

Irreproducibility Studies: The act during which the results of a study or an experiment can be replicated and reproduced.

Open Lab/Notebooks: Laboratory research records, diaries, journals, workbooks etc. offered online free of cost with terms that allow reuse and redistribution of the recorded material.

Open Science Workflows: A sequence of processes scientists make to administer and disseminate convoluted scientific examinations offered online and free of cost allowing the reuse of the material.

Open Source in Open Science: Software where the source code is available free of cost with terms that allow dissemination and adaptation.

Reproducibility Guidelines: Ground rules to assist with the recreation of research experiments and studies.

Reproducibility Testing refers to the process of validating that the reported research results can be obtained in an independent experiment.

(11)

11 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Open Science Tools

Refers to the tools that can assist in the process of delivering and building on Open Science.

Tools are:

Open archives that host scientific literature, data, software and other research objectcts and make their content freely accessible to everyone in the world.

Open services offered by organisations and institutions which is possible to use free of cost.

Open Workflow Tools (apparatuses and services ) that promote open scientific projects.

(12)

12 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Open Data

“A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike” -- opendefinition.org

This means, according to the Open Knowledge Foundation:

Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.

Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets.

Universal Participation: everyone must be able to use, reuse and redistribute - there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed

(13)

13 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

RDF in the stack of Semantic Web technologies

RDF stands for:

Resource: Everything that can have a unique identifier (URI), e.g. pages, places, people, dogs, products...

Description: attributes, features, and relations of the resources

Framework: model, languages and syntaxes for these descriptions

RDF was published as a W3C recommendation in 1999.

RDF was originally introduced as a data model for metadata.

RDF was generalised to cover knowledge of all kinds.

(14)

14 Data handling and processing of scientific data with FAIR principles 14 www.prace-ri.eu

RDF

The Resource Description Framework (RDF ) is a syntax for representing data and resources in the Web

RDF breaks every piece of information down in triples:

Subject – a resource, which may be identified with a URI.

Predicate – a URI-identified reused specification of the relationship.

Object – a resource or literal to which the subject is related.

<rdf:RDF

<rdf:Description about="http://www.w3.org">

<s:Publisher> World Wide Web Consortium </s:Publisher>

<s:Title> W3C Home Page </s:Title>

<s:Date> 1998-10-03T02:27 </s:Date>

</rdf:Description>

</rdf:RDF>

(15)

15 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

RDF/XML and N3 notation

RDF/XML:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:dc="http://purl.org/dc/elements/1.1/">

<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Maribor">

<dc:title>Maribor</dc:title>

<dc:publisher>Wikipedia</dc:publisher>

</rdf:Description>

</rdf:RDF>

N3 Notation:

@prefix dc: <http://purl.org/dc/elements/1.1/>. <http://en.wikipedia.org/wiki/Maribor>

dc:title „Maribor";

dc:publisher "Wikipedia".

(16)

16 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Turtle notation

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix dc: <http://purl.org/dc/elements/1.1/> .

@prefix ex: <http://example.org/stuff/1.0/> . <http://www.w3.org/TR/rdf- syntax-grammar>

dc:title "RDF/XML Syntax Specification (Revised)" ; ex:editor [ ex:fullname "Dave Beckett";

ex:homePage <http://purl.org/net/dajobe/> ] .

(17)

17 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

N-Triples

N-Triples:

<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ↵

<http://xmlns.com/foaf/0.1/Document> .

<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://purl.org/dc/terms/title> "N-Triples"@en-US .

<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://xmlns.com/foaf/0.1/maker> _:art .

<http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://xmlns.com/foaf/0.1/maker> _:dave . _:art <http://www.w3.org/1999/02/22-rdf-syntax-ns#> <http://xmlns.com/foaf/0.1/Person> . _:art <http://xmlns.com/foaf/0.1/name> "Art Barstow".

_:dave <http://www.w3.org/1999/02/22-rdf-syntax-ns#> <http://xmlns.com/foaf/0.1/Person> .

_:dave <http://xmlns.com/foaf/0.1/name> "Dave Beckett".

(18)

18 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

SKOS - Simple Knowledge Organization System

SKOS is an area of work developing

specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web.

(19)

19 Data handling and processing of scientific data with FAIR principles 19 www.prace-ri.eu

RDF schema

<rdf:RDFxmlns:rdf= "http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

xml:base= "http://www.animals.fake/animals#">

<rdf:Descriptionrdf:ID="animal">

<rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>

</rdf:Description>

<rdf:Descriptionrdf:ID="horse">

<rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>

<rdfs:subClassOfrdf:resource="#animal"/>

</rdf:Description>

</rdf:RDF>

(20)

20 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Example of RDF resource about Angola

(21)

21 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Example of links between instances

(22)

22 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Example of links between RDF triples

(23)

23 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Primer uporabe večjezičnosti

(24)

24 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

(25)

25 Data handling and processing of scientific data with FAIR principles 25 www.prace-ri.eu

SPARQL (Protocol and RDF Query Language)

PREFIX abc: <http://example.com/exampleOntology#> . SELECT ?capital ?country

WHERE {

?x abc:cityname ?capital ; abc:isCapitalOf ?y.

?y abc:countryname ?country ; abc:isInContinent abc:Africa.

}

SPARQL is the standard language to query graph data represented as RDF triples.

(26)

26 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

SPARQL query - SELECT – Return all books under a certain price

(27)

27 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Ontology

Components of ontology are:

Individuals: Instances or objects (the basic or "ground level" objects).

Classes: Sets, collections, concepts, classes in programming, types of objects or kinds of things.

Attributes; Aspects, properties, features, characteristics or parameters that objects (and classes) can have.

Relations: Ways in which classes and individuals can be related to one another.

Function terms: Complex structures formed from certain relations that can be used in place of an individual term in a statement.

Restrictions: Formally stated descriptions of what must be true in order for some assertion to be accepted as input.

Rules: Statements in the form of an if-then (antecedent-consequent) sentence that describe the logical inferences that can be drawn from an assertion in a particular form.

(28)

28 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

IoT Lite Ontology example

Source: IoT Lite Ontology: Available on

https://www.w3.org/Submission/iot-lite /

(29)

29 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Semantic Web Rule Markup Language

Languge is used for describing antecedent ⇒ consequent Example: hasParent(?x1,?x2) ∧hasBrother(?x2,?x3)hasUncle(?x1,?x3)

<ruleml:imp> <ruleml:_rlab ruleml:href="#example1"/>

<ruleml:_body>

<swrlx:individualPropertyAtom swrlx:property="hasParent">

<ruleml:var>x1</ruleml:var>

<ruleml:var>x2</ruleml:var>

</swrlx:individualPropertyAtom>

<swrlx:individualPropertyAtom swrlx:property="hasBrother">

<ruleml:var>x2</ruleml:var>

<ruleml:var>x3</ruleml:var>

</swrlx:individualPropertyAtom>

</ruleml:_body>

<ruleml:_head>

<swrlx:individualPropertyAtom swrlx:property="hasUncle">

<ruleml:var>x1</ruleml:var>

<ruleml:var>x3</ruleml:var>

</swrlx:individualPropertyAtom>

</ruleml:_head>

</ruleml:imp>

(30)

30 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

OWL

Functional syntax

DisjointClasses( :Woman :Man )

RDF/XML syntax

<owl:AllDisjointClasses>

<owl:members rdf:parseType="Collection">

<owl:Class rdf:about="Woman"/>

<owl:Class rdf:about="Man"/>

</owl:members> </owl:AllDisjointClasses>

Turtle syntax

[] rdf:type owl:AllDisjointClasses ; owl:members ( :Woman :Man ) .

Manchester syntax

DisjointClasses: Woman, Man

OWL/XML syntax

<DisjointClasses>

<Class IRI="Woman"/>

<Class IRI="Man"/>

</DisjointClasses>

(31)

31 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Metadata – What is it?

Data about Data

=> Metadata ISdata

A way of describing a research artefact in a structured, machine readable way

Something my community or funder insist I have

Many distinct types of metadata exist, including:

Descriptive metadata is descriptive information about a resource. It is used for discovery and identification. It includes elements such as title, abstract, author, and keywords.

Structural metadata is metadata about containers of data and indicates how compound objects are put together, for example, how pages are ordered to form chapters. It describes the types, versions, relationships and other characteristics of digital materials.

Administrative metadata is information to help manage a resource, like resource type, permissions, and when and how it was created.

Reference metadata is information about the contents and quality of statistical data.

Statistical metadata, also called process data, may describe processes that collect, process, or produce statistical data.

Some examples of metadata schemes are: DC, COMARC, PREMIS, DCAT, FOAF, DDI… More about metadata standards you can find on https://www.dcc.ac.uk/guidance/standards/metadata/list

31

(32)

32 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Metadata in life

(33)

33 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Metadata in Life

(34)

34 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Metadata Standards - Simple

Dublin Core

• Title

• Creator

• Subject

• Description

• Publisher

• Contributor

• Date

• Type

• Format

• Identifier

• Source

• Language

• Relation

• Coverage

• Rights

DataCite

• Title

• Creator

• Publisher

• Identifer

• Publication Year

• Resource Type

• Subject

• Contributor

• Date

• Related identifier

• Description

• Geolocation

• Language

Alternate identifier

Size

Format

Version

Rights

Funding Reference

Version

Metric

Same as

Spatial Coverage

Temporal coverage

Citation

Reference citation

compression EDMI

• Name

• Description

• Identifier

• url

• Creator

• Date Created

• license

• Data Standard

• Date Modified

• Access URL

• Access Interface

• Structure

• Included In

• Measurement Technique

• Keywords

• Variable Measured

• Format

• Scientific Type

• Includes

• Content Type

• Size

• Authentications

(35)

35 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Metadata standards - complex

The CERIF Standard – Base Entities

(36)

36 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Exercises in metadata – Guess the film!

Italian, English,Spanish Subject

Description Creator Publisher Date Type Format Identifier Source Language Relation Coverage Rights

Part of trilogy US Civil War

Produzioni Europee Associate Original Work

None

35mm anamorphic Spaghetti Western 23 December 1966

Produzioni Europee Associate

Age & Scarpelli, Luciano Vincenzoni Three men searching for stolen gold None

?

(37)

37 Data handling and processing of scientific data with FAIR principles 37 www.prace-ri.eu

Dublin Core ( DC)

The Dublin Core, also known as the Dublin Core Metadata Element Set, is a set of fifteen

"core" elements (properties) for describing resources. The resources described using the Dublin Core may be digital resources (video, images, web pages, etc), as well as physical resources such as books or CDs, and objects like artworks.

Dublin Core metadata may be used for multiple purposes, from simple resource description to combining metadata vocabularies of different metadata standards, to providing

interoperability for metadata vocabularies in the linked data cloud and Semantic Web implementations.

The Dublin Core standard originally included two levels: Simple and Qualified. Simple Dublin Core comprised 15 elements; Qualified Dublin Core included three additional elements (Audience, Provenance and RightsHolder), as well as a group of element

refinements (also called qualifiers) that could refine the semantics of the elements in ways

that may be useful in resource discovery.

(38)

38 Data handling and processing of scientific data with FAIR principles 38 www.prace-ri.eu

Example of metadata for

diploma work in Digital library of University of Maribor

<?xmlversion="1.0"?>

-<rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:dc="http://purl.org/dc/elements/1.1/">

-<rdf:Description rdf:about="http://dkum.uni-mb.si/IzpisiGradiva.php?id=5946">

<dc:title>Placilni procesi multinacionalnega podjetja : diplomsko delo</dc:title>

<dc:creator>Filipan Kraljic, Biserka (Komentor )</dc:creator>

<dc:creator>Zbašnik, Dušan (Mentor)</dc:creator>

<dc:creator>Kutnjak, Matej (Avtor)</dc:creator>

<dc:subject>plačilni promet</dc:subject>

<dc:subject>prenosni sistemi</dc:subject>

<dc:subject>multinacionalne družbe</dc:subject>

<dc:subject>plačilni instrumenti</dc:subject>

<dc:subject>kliring</dc:subject>

<dc:subject>dokumentarni akreditivi</dc:subject>

<dc:subject>bančno poslovanje</dc:subject>

<dc:subject>garancije</dc:subject>

<dc:subject>jamstvo</dc:subject>

<dc:subject>mednarodne finance</dc:subject>

<dc:subject>menice</dc:subject>

<dc:subject>plačilni sistemi</dc:subject>

<dc:subject>poravnava</dc:subject>

<dc:subject>mednarodna podjetja</dc:subject>

<dc:subject>cilj</dc:subject>

<dc:subject>centralizacija</dc:subject>

<dc:subject>zakladnica</dc:subject>

<dc:description>[M. Kutnjak]</dc:description>

<dc:date>2006</dc:date>

<dc:date>2007-09-11 14:51:18</dc:date>

<dc:type>Bibliografija</dc:type>

<dc:identifier>5946</dc:identifier>

<dc:identifier>COBISS_ID: 402472</dc:identifier>

<dc:identifier>UDK: 339.727(043.2):334.726(497.4:4)</dc:identifier>

<dc:source>Velenje</dc:source>

<dc:language>sl</dc:language>

</rdf:Description>

</rdf:RDF>

(39)

39 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Knowledge collections

Dictionaries – example Pons

Thesauruses – examples Evrovoc , Agrovoc, UDC, LCH

Semantic dictionaries– example Wordnet

Dbpedia – see also Relfinder and it‘s classes

Babelnet

Geonames

UMLS

(40)

40 Data handling and processing of scientific data with FAIR principles 40 www.prace-ri.eu

WordNet

{v e h ic le }

{c o n v e y a n c e ; tra n s p o rt}

{c a r; a u to ; a u to m o b ile ; m a c h in e ; m o to rc a r}

{c ru is e r; s q u a d c a r; p a tro l c a r; p o lic e c a r; p ro w l c a r} {c a b ; ta x i; h a c k ; ta x ic a b ; }

{ m o to r v eh icle; au to m o tiv e v eh icle}

{ b u m p er}

{ car d o o r}

{ car w in d o w } { car m irro r}

{ h in g e; flex ib le jo in t}

{ d o o rlo ck } { arm rest}

h yp ero n ym

h yp ero n ym

h yp ero n ym

h yp ero n ym h yp ero n ym

m ero n ym

m ero n ym

m ero n ym

m ero n ym

(41)

41 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Cyc

(42)

42 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Linked open data

42

(43)

43 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Some useful links

FAIRsharing.org - A curated, informative and educational resource on data and metadata standards, inter-related to databasesand data policies.

RDA metadata standards

Bioportal– a comprehensive repository of biomedical ontologies

AgroPortal is an ontology portal/repository (with periodically updated versions) dedicated to the agronomic and plant domains.

OGC standards– list of standards and ontologies on geospatial domain.

Linked open vocabularies

QUDT CATALOG - Quantities, Units, Dimensions and Data Types Ontologies

Schema.orgis a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.

(44)

44 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Open Data

“A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike” -- opendefinition.org

This means, according to the Open Knowledge Foundation:

Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.

Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets.

Universal Participation: everyone must be able to use, reuse and redistribute - there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed

(45)

45 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Research Data

Raw / Cleaned or Filtered

Field data

Experimental data

Derived data

Qualitative / Quantitaive

Structured / Semi-structured/

Unstructured

Tabular/ Hierarhical / Graphs

Open access / Restricted access

Linked data

Metadata

Big data

Source: https://www.bitmat.it/blog/news/83536/sviluppare-

applicazioni-iot-riducendo-costi-e-risorse

(46)

46 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Data sources

Devices

Instruments

Sensors

Software

People

Observations

Experiments

Simultations

Emulations

Surveys

Interviews

Text analysis

Text mining

…………

(47)

47 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Categorization of the file formats of Open Data

(48)

48 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

5 Star Open Data

48

Source: http://5stardata.info/

(49)

49 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Persistent identifiers

Resolvable links independent of data locations

Common technologies:

DOI, PURL, URNs, ARKs, PMID…

doi:10.1594/PANGAEA.667386

Resolver

Site A

Site B

(50)

50 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Structure of a Digital Specimen Digital Object (DSDO)

PIDs are pointers that resolve to location of the item e.g., DO itself, physical specimen, hi-res images, label information, tissue sample, DNA sequence, etc.

Source: Alex Hardisty FAIR Digital Objects as Basic Design Choice and the Need for PIDs

PID Kernel information

(metadata about specimen DO)

D O , = A n e n ve lo p e

PID

1

PID

2

PID

PID

N

metadata2

… … … …

… … …

… … … …

… … … metadataN

… … … …

… … …

content

N

… … … …

… … … …

… … … …

… … … …

… … … …

Bit

sequence

(51)

51 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Physical Object Digital Surrogate

FAIR Digital Object

An machine actionable knowledge unit

Source: Alex Hardisty FAIR Digital Objects as

Basic Design Choice and the Need for PIDs

(52)

52 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

FAIR digital objects and persistent identifiers (PIDs)

Source: RDA's Data Foundation & Terminology Group (DFT) 2014:

Core Model

(53)

53 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

FAIR (Findable, Accessible, Interoperable, Reusable)

To be Findable:

F1. (meta)data are assigned a globally unique and eternally persistent identifier.

F2. data are described with rich metadata.

F3. (meta)data are registered or indexed in a searchable resource.

F4. metadata specify the data identifier.

TO BE ACCESSIBLE:

A1 (meta)data are retrievable by their identifier using a standardized communications protocol.

A1.1 the protocol is open, free, and universally implementable.

A1.2 the protocol allows for an authentication and authorization procedure, where necessary.

A2 metadata are accessible, even when the data are no longer available.

TO BE INTEROPERABLE:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles.

I3. (meta)data include qualified references to other (meta)data.

TO BE RE-USABLE:

R1. meta(data) have a plurality of accurate and relevant attributes.

R1.1. (meta)data are released with a clear and accessible data usage license.

R1.2. (meta)data are associated with their provenance.

R1.3. (meta)data meet domain-relevant community standards.

Source: https://www.force11.org/group/fairgroup/fairprinciples Source: https://libereurope.eu/blog/2018/07/13/fairdataconsultation/liber-

fair-data-2/

(54)

54 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

RDA FAIR Data Maturity Model 1

(55)

55 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

RDA FAIR Data

Maturity Model 2

(56)

56 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Research data managament plan

A research data management plan (DMP) is a written document that describes the data you expect to acquire or generate during the course of a research project, how you will manage, describe, analyze, and store those data, and what mechanisms you will use at the end of your project to share and preserve your data.

You may have already considered some or all of these issues with regard to your research project, but writing them down helps you formalize the process, identify weaknesses in your plan, and provide you with a record of what you intend(ed) to do.

Data management is best addressed in the early stages of a research project, but it is never too late to

develop a data management plan.

(57)

57 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Source: Research data management roles and responsibilities. Available at:

http://www.researchsupport.uct.ac.za/research-data-management-roles-and-responsibilities

(58)

58 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Open data life cycle

Discovery and planning

Data collection

Processing and analysis

Publish and share

Long term managament

Reusing data

(59)

59 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Data managament plan

A data managament plan is a formal document that provides a framework for how to handle the data material during and after the research project. The way a DMP will look once it is finished is not universal. It is a

“living” document that changes together with the needs of a project and its participants. It is updated

throughout the project to make sure that it tracks such changes over time and that it reflects the current state of your project.

Please see how to prepare data managament plan on:

CESSDA Data Management Expert Guide Funders’ data plan requirements

Checklist for a Data Management Plan

DMPonline

(60)

60 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Working Data Registered Data

Published Data

In cr e a sin g S h a rin g

In cr e a si n g V o lu me

The Data Pyramid

(61)

61 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Is Data Storage the Same as Data Preservation?

No!

Storage is subject to many issues

Media failure

‘bit rot’

Natural disaster

Format obsolescence

Human deletion – accidental or malicious

Media Loss

Loss of funding

Link rot and reference rot

Data Preservation is ensuring data is available for re-use into the future

Minimising the risk of data loss

(62)

62 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Preservation vs Curation

Definitions (lexico.com)

Preserve

Maintain (something) in its original or existing state.

Curate

Select, organize, and look after items (in a collection or exhibition)

In the digital era the concepts are similar:

Preserve: Make sure the bits you store stay the same over time

Curate: Make sure the information you store is discoverable and usable over time

Lets look at some examples…

(63)

63 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Clear license information is important because...

It tells users and reusers exactly what they can do with your data and metadata.

It encourages the use and reuse of your data and metadata the way you want them to be used and reused.

It creates visibility of your efforts downstream (if you ask for attribution).

If no explicit licence is provided, a user does not know what can be done with the data/metadata – the

default legal position is that nothing can be done without contacting the owner on a case-by-case

basis.

(64)

64 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Four key tips when publishing information about the licence

1.

Make sure your licensing information is easy to find.

2.

Include information about the license in the metadata of each data set.

3.

Use simple licenses to ensure they are easy to understand by re-users.

4.

Check spelling, typos and spacing to ensure consistency in the names of the licenses used.

(65)

65 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Clear licence information - example

65

(66)

66 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Different data have different licensing needs

Some data(sets) may be required to be openly available (e.g. subject to a Constitution of the Republic of Slovenia).

Some data(sets) may be subject to restrictions (e.g. privacy, national security, third party rights).

Some data(sets) may be available for reuse but not for modification (e.g. legal texts, public budgets (if modifications are made, it must be made clear that the data is not the actual authentic version).

Some data(sets) may be published allowing derivations with attribution of authoritative source (e.g. legal

commentary, translations).

(67)

67 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Licensing approaches: Creative Commons (1)

(68)

68 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Licensing approaches: Creative Commons (2)

(69)

69 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Good practices for licensing your data

Good practices:

If the original data is in the public domain (e.g. by law), keep it there – use for example the Creative Commons Zero Public Domain Dedication or the Open Data Commons Public Domain Dedication and License (PDDL)

For some documentation integrity needs to be protected – use a No- Derivatives licence, for example Creative Commons Attribution- NoDerivs, but only if really necessary

Avoid Non-Commercial licences if at all possible, as these seriously restrict reuse.

Licenses for data should provide appropriate security and control (but not more than that).

(70)

70 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Using an open and unrestricting license for your data

Whenever data is licensed for open and unrestricted access, reusers can create new knowledge from combining it.

For example:

Cross-referencing public spending with geographic data to visualise which regions are better funded.

Matching public transport timetables with GPS data to be able to give real time information on delays.

Measuring performance of public services based on transaction counters and waiting times.

Deriving recommendations for prevention policies relating accident statistics with weather data and road

maps.

(71)

71 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Software licenses

(72)

72 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Good practices for licensing your metadata

What you need to think about:

Metadata helps people to discover your data.

The wider your metadata is distributed, the higher your visibility is.

Others may want to add to it, enhance it, link to other resources.

Good practices:

Licences for metadata should be as open as possible.

A public domain licence allows the widest reuse.

An attribution licence ensures you get credit downstream, but may cause problems if data is shared

multiple times (attribution stacking).

(73)

73 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Machine readability of licenses – use Open Digital Right Language

The Open Digital Rights Language (ODRL) is a policy expression language that provides a flexible and interoperable information model, vocabulary, and encoding mechanisms for representing statements about the usage of content and services. The ODRL Information Model describes the underlying concepts, entities, and relationships that form the foundational basis for the semantics of the ODRL policies.

Policies are used to represent permitted and prohibited actions over a certain asset, as well as the obligations required to be meet by stakeholders. In addition, policies may be limited by constraints (e.g., temporal or

spatial constraints) and duties (e.g. payments) may be imposed on permissions.

Please see ecxamples of machine readable licenses on http://rdflicense.appspot.com/.

(74)

74 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Ethics and research data management

Source: Ethics and research data managament. Available at:

http://www.researchsupport.uct.ac.za/ethics-and-research-data-management

(75)

75 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Sharing sensitive data

Sensitive data that contain potentially identifying information -- whether it be human subject data or other types of sensitive data -- will likely need to be modified prior to sharing these data with the public. It is important that these modifications are made in order to protect participant confidentiality, the location of endangered wildlife, or for other relevant reasons. However, these modifications may affect the data to the point where reproducibility or additional subsequent research by others is no loner possible. You might consider retaining multiple versions of the data: one that is suitable for public release, and one that is suitable for further research but that is available on a highly restricted basis.

For patient health information (PHI), HIPAA privacy rules provide two methods for de-identification: the expert determination method and the safe harbor method. See the resources listed below for

documentation on these methods from the US Department of Health and Human Services, as well as

information on how to satisfying these two methods.

(76)

76 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Types of identifying information

Direct identifiers: These data point directly to an individual and are typically removed from data sets before sharing with the public. These may include:name, initials, mailing address, phone number, email address, unique identifying numbers, like Social Security numbers or driver's license numbers, vehicle identifiers, medical device identifiers, web or IP addresses, biometric data, photographs of the person, audio recordings, names of relatives, dates specific to individual, like date of birth, marriage, etc.

Indirect identifiers: These may seem harmless on their own, but can point to an individual when combined with other data. It has been recommended that datasets containing three or more indirect identifiers should be reviewed by an independent researcher or ethics committee to evaluate identification risk. Any indirect information not needed for the analysis should be removed. It may be reasonable to supply some of these types of data in aggregated form (like ranges of annual incomes instead of exact numbers). Indirect identifiers may include: place of medical treatment or doctor's name, gender, rare disease or treatment, sensitive data like illicit drug use or other "risky behaviors„, place of birth, socioeconomic data, like workplace, occupation, annual income, education, etc, general geographic indicators, like postal code of residence, household and family composition, ethnicity. birth year or age, verbatim responses or transcripts

(77)

77 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Sharing Sensitive Data with Confidence: The Datatags System

Non-confidental information

Non-confidential information

Potentially harmful personal information Sensitive personal information

Very sensitive personal information

Maximum sensitive personal information

(78)

78 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

EUDAT Collaborative Data Infrastructure (CDI)

Access and reuse of data:

• Repository service (B2SHARE and B2DROP),

• data discovery service (B2FIND),

• authentication and authorization

(B2ACCESS).

Data preservation and storage:

• Data managament service(B2SAFE).

• PID service (B2HANDLE).

Data transfer (B2STAGE).

Sensitive data

managament.

(79)

79 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

A network diagram of the Slovenian open access infrastructure

(80)

80 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Structure of repository in the Slovenian open access infrastructure

(81)

81 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

Slovenian big data archive structure

81

(82)

82 Data handling and processing of scientific data with FAIR principles www.prace-ri.eu

THANK YOU FOR YOUR ATTENTION

www.prace-ri.eu

Reference

POVEZANI DOKUMENTI

Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020 Guidelines on FAIR Data Management in Horizon 2020 Commission Recommendation of 25.4.2018

 Responsible research and innovation including: Use and protection of modelling data; How to reduce the environmental impact of the computational power used for modelling;

of text editing Data base Text processing Presentation Multimedia Information (basics of. information handling) WWW

Comparison of the fit results (open histograms) with experimental data (points with error bars) for Υ(1S)π 0 π 0 events in the signal region.. Red and blue open histograms show the

The Abidjan Principles underscore that states should realise the right to education by prioritising a free, quality, public education system accessible to everyone and that

Digital forms of data reveal the accuracy of data and the completeness of their attributes, which means that the digital world does not allow for poor or missing data that do

The increasingly intensive development of digital sensors and their qualitative and quantitative efficiency in the collection of spatial data and their processing is a precondition

EDGG’s vegetation- plot database GrassPlot with multi-scale and multi-taxon diversity data of grasslands and other open habitats of the Palaearctic is now integrated into the