Data product manifest format - DRAFT
This started out life as a means to describe to OINK how datasets and other data sources were to be (re)loaded by the OINK Housekeeper. We believe it can have other uses as well. Generally, the standard W3C solution to this problem of being able to access RDF content via the URI of the corresponding namespace (content negotiation and other "server magic") is not always particularly usable.
Basics
In the context of this document, a "manifest" is an RDF file that contains an "opinionated" form of a single DCAT catalog. The catalog (an instance of dcat:Catalog
) can have one or more datasets (instances of dcat:Dataset
), each of which have one distribution (instance of dcat:Distribution
; possibly multiple distributions in the future). A special case of a dataset with no distributions is also discussed below.
The identifier of the catalog is the URL of the manifest file, so that you can write something like this:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
<> a dcat:Catalog ;
# ...
.
We provide a SHACL shapes document for validating our assumptions of what these catalogs look like.
Datasets and their distributions
A simple catalog would look like this:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
<> a dcat:Catalog ;
dcat:dataset <NAMESPACE-URI> .
<NAMESPACE-URI> a dcat:Dataset ;
dcat:distribution [
a dcat:Distribution ;
dcat:downloadURL <DOWNLOAD-URL> ;
dcat:mediaType <MEDIA-TYPE-URI>
] .
This example includes a few things of note: First, each dataset uses the URI of its namespace as its identifier (in the example, <NAMESPACE-URI>
). Sexond, a distribution provides a download URL (in this case, <DOWNLOAD-URL>
) and (optionally) a media type (indicated as <MEDIA-TYPE-URI>
; see the IANA Media Types for URIs to use). These are some of the most common ones:
- Turtle:
https://www.iana.org/assignments/media-types/text/turtle
- RDF/XML:
https://www.iana.org/assignments/media-types/application/rdf+xml
- NTriples:
https://www.iana.org/assignments/media-types/application/n-triples
- JSON-LD:
https://www.iana.org/assignments/media-types/application/ld+json
Note that OINK can only load "triple-based" but not "quad-based" formats due to its use of named graphs in housekeeping.
Declaring namespaces
If a dataset wants to declare a namespace prefix for its namespace, this can be done as part of the dataset, using vocabulary from SHACL:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
<NAMESPACE-URI> a dcat:Dataset ;
sh:declare [
sh:namespace <NAMESPACE-URI> ;
sh:prefix "PREFIX"
] ;
dcat:distribution [
a dcat:Distribution ;
dcat:downloadURL <DOWNLOAD-URL> ;
dcat:mediaType <MEDIA-TYPE-URI>
] .
Note that some RDF processors, when loading data, "harvest" namespace prefix declarations for later use, but this cannot necessarily be relied on. Also, in most cases the URI of the dataset is the same as the URI of the namespace being declared, but not always (at the time of writing this, gist is the only notable exception we know).
A special case is when we want to just declare a namespace prefix without an actual distrbution to load:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
<NAMESPACE-URI> a dcat:Dataset ;
sh:declare [
sh:namespace <NAMESPACE-URI> ;
sh:prefix "PREFIX"
] .
In some cases, one may want to declare an "alternate" namespace URI. This happens most typically when an ontology is referred to using a URI without a hash (#) at the end, but its namespace declaration has the hash (e.g., http://www.w3.org/2000/01/rdf-schema#
vs. http://www.w3.org/2000/01/rdf-schema
) or when the publishers of an ontology have been confused between http:
and https:
(Note: OINK can perform "canonicalization" of URIs by substituting the canonical URI for the alternate URI). Example:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix oink: <https://somanyaircraft.com/data/schema/oink#> .
<http://www.w3.org/2000/01/rdf-schema#> a dcat:Dataset ;
sh:declare [
sh:namespace <http://www.w3.org/2000/01/rdf-schema#> ;
sh:prefix "rdfs" ;
oink:alternateNamespace <http://www.w3.org/2000/01/rdf-schema>
] .
Grouping
Datasets can be grouped into dataset series (instances of dcat:DatasetSeries
). In this case, the series is included in the catalog as a dataset, and each of the constituent datasets in the series is linked to the series via dcat:inSeries
.
Here is an example:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
<> a dcat:Catalog ;
dcat:dataset <#SERIES> .
<#SERIES> a dcat:DatasetSeries .
<NAMESPACE-URI> a dcat:Dataset ;
dcat:inSeries <#SERIES> ;
sh:declare [
sh:namespace <NAMESPACE-URI> ;
sh:prefix "PREFIX"
] ;
dcat:distribution [
a dcat:Distribution ;
dcat:downloadURL <DOWNLOAD-URL> ;
dcat:mediaType <MEDIA-TYPE-URI>
] .
Note that dcat:DatasetSeries
is a subclass of dcat:Dataset
, so a data series could be linked to a "parent series" via dcat:inSeries
. This allows unlimited hierarchies of datasets in a catalog.
Indirection and "catalog recursion"
Our current scheme allows a manifest to "point" to other manifests rather than enumerating sources explicitly. The dcat:catalog
property on catalogs is used for this purpose. Since catalogs inside manifest files are identified with the URL of the file, this works out nicely. Processing of a manifest file requires recursive traversal and loading of dependent manifest files.
Periodicity
A dataset, not a distribution, can specify periodicity. This is understood as periodic re-loading of the dataset's distribution.
A simple example, something loaded once a day:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix freq: <http://purl.org/cld/freq/> .
[
a dcat:Dataset ;
dcterms:accrualPeriodicity freq:daily ;
# dcat:distribution ...
] .
Since the Dublin Core Collection Description Frequency Vocabulary does not specify periods shorter than one day, we adopt the convention that these short periods are specified using a combination of dcterms:accrualPeriodicity
with the value frec:continuous
and using an xsd:duration
value for dcat:temporalResolution
.
Here is an example of something loaded every 15 minutes:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix freq: <http://purl.org/cld/freq/> .
[
a dcat:Dataset ;
dcterms:accrualPeriodicity freq:continuous ;
dcat:temporalResolution "PT15M"^^xsd:duration ;
# dcat:distribution ...
] .
Loading entire directories
A distribution whose dcat:packageFormat
is the special value oink:Diectory
is considered to mean tha tthe distribution's dcat:downloadURL
is a directory, and that the entire distribution is the collection of files in that directory. The implication of this is that one can only use this with file:
URLs.
Here is an example:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix oink: <https://somanyaircraft.com/data/schema/oink#> .
[
a dcat:Distribution ;
dcat:downloadURL <file://Users/ora/data/> ;
dcat:packageFormat oink:Directory
] .
Loading non-RDF sources
When defining a source in a manifest, it is possible to indicate that some pre-processing needs to happen before the resulting data can be loaded. The most typical use of this facility is to specify an RML mapping to be applied to the raw source data.
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix oink: <https://somanyaircraft.com/data/schema/oink#> .
[
a dcat:Distribution ;
dcat:downloadURL <fileL//Users/ora/data/tabular.csv> ;
dcat:mediaType <https://www.iana.org/assignments/media-types/text/csv> ;
oink:mapping <MAPPING-GRAPH-URL>
] .
It is an error to specify a non-RDF data source without specifying a mapping via oink:mapping
.
Note that any file format accepted by the tinyrml
and rdfproducers
mapping implementation is acceptable, for example:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix oink: <https://somanyaircraft.com/data/schema/oink#> .
[
a dcat:Distribution ;
dcat:downloadURL <file://Users/ora/data/spreadsheet.xlsx> ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/vnd.ms-excel> ;
oink:mapping <MAPPING-GRAPH-URL>;
] .
The basic assumptions are that the mapping file itself is a tinyrml
-compatible RML mapping, and that the media type (dcat:mediaType
) determines which mapper from rdfproducers
is applied:
+ text/csv
: rdfproducers.PlainCSVMapper
+ application/vnd.ms-excel
: rdfproducers.ExcelMapper
(with more media types to be added...)
To load "plain JSON" as JSON-LD, a context document has to be provided. this can be done as follows:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix oink: <https://somanyaircraft.com/data/schema/oink#> .
[
a dcat:Distribution ;
dcat:downloadURL <fileL//Users/ora/data/stuff.json> ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/ld+json> ;
oink:context <CONTEXT-URL>
] .
Linking to shapes graphs
A dataset can specify a shapes graph to be used to validate the dataset (a distribution thereof) when loaded. We use the SHACL-provided property sh:shapesGraph
for this.
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
<> a dcat:Catalog ;
dcat:dataset <NAMESPACE-URI> .
<NAMESPACE-URI> a dcat:Dataset ;
sh:shapesGraph <SHAPES-GRAPH-URI> ;
# dcat:distribution ...
.
Note that OINK validates the source distribution before loading, and will issue a warning if the validation fails (but will still load the source).
Open question: Should we assume that the <SHAPES-GRAPH-URI>
merely designates a named graph (that is accessible to the processor) or should we further assume that the URI can be dereferenced and an actual shapes graph can be loaded?
Feature wish list
- Read ZIP and
tar
archives like directories, decompress and readgzip
andbz2
files - Provide access credentials
- Read data through an API (e.g., Google Sheet combined with an RML mapping)
- Recursive traversal of directory trees
- Unix-style pathname expansion (e.g.,
glob
) - Specify sources as
rdfproducer.Producer
orrdfhelpers.Composable
pipelines - Ability to "patch" datasets using something like RDF Patch (useful if you do not have control over the original source).