Writing profiles
Writing custom profiles¶
Internally, profiles are classes that define a particular set of methods called during the parsing process.
For instance, the parse_dataset()
method is called on each DCAT dataset found when parsing an RDF file, and should return a CKAN dataset.
Conversely, the graph_from_dataset()
will be called when requesting an RDF representation for a dataset, and will need to generate the necessary RDF graph.
Custom profiles should always extend the ckanext.dcat.profiles.RDFProfile
class. This class has several helper
functions to make getting metadata from the RDF graph easier. These include helpers for getting fields for FOAF and VCard entities like the ones
used to define publishers or contact points. Check the source code of ckanex.dcat.profiles.base.py
to see what is available.
Profiles can extend other profiles to avoid repeating rules, or can be completely independent.
The following example shows a complete example of a profile built on top of the European DCAT-AP profile (euro_dcat_ap
):
from rdflib.namespace import Namespace
from ckanext.dcat.profiles import RDFProfile
DCT = Namespace("http://purl.org/dc/terms/")
class SwedishDCATAPProfile(RDFProfile):
'''
An RDF profile for the Swedish DCAT-AP recommendation for data portals
It requires the European DCAT-AP profile (`euro_dcat_ap`)
'''
def parse_dataset(self, dataset_dict, dataset_ref):
# Spatial label
spatial = self._object(dataset_ref, DCT.spatial)
if spatial:
spatial_label = self.g.label(spatial)
if spatial_label:
dataset_dict['extras'].append({'key': 'spatial_text',
'value': str(spatial_label)})
return dataset_dict
def graph_from_dataset(self, dataset_dict, dataset_ref):
g = self.g
spatial_uri = self._get_dataset_value(dataset_dict, 'spatial_uri')
spatial_text = self._get_dataset_value(dataset_dict, 'spatial_text')
if spatial_uri:
spatial_ref = URIRef(spatial_uri)
else:
spatial_ref = BNode()
if spatial_text:
g.add((dataset_ref, DCT.spatial, spatial_ref))
g.add((spatial_ref, RDF.type, DCT.Location))
g.add((spatial_ref, RDFS.label, Literal(spatial_text)))
Note how the dataset dict is passed between profiles so it can be further tweaked.
Extensions define their available profiles using the ckan.rdf.profiles
entrypoint in the setup.py
file, as in this example from this same extension:
[ckan.rdf.profiles]
euro_dcat_ap=ckanext.dcat.profiles:EuropeanDCATAPProfile
euro_dcat_ap_2=ckanext.dcat.profiles:EuropeanDCATAP2Profile
euro_dcat_ap_3=ckanext.dcat.profiles:EuropeanDCATAP3Profile
euro_dcat_ap_scheming=ckanext.dcat.profiles:EuropeanDCATAPSchemingProfile
schemaorg=ckanext.dcat.profiles:SchemaOrgProfile
Internals¶
RDF DCAT Parser¶
The ckanext.dcat.processors.RDFParser
class allows to read RDF serializations in different
formats and extract CKAN dataset dicts. It will look for DCAT datasets and distributions
and create CKAN datasets and resources, as dictionaries that can be passed to package_create
or package_update
.
Here is a quick overview of how it works:
from ckanext.dcat.processors import RDFParser, RDFParserException
parser = RDFParser()
# Parsing a local RDF/XML file
with open('datasets.rdf', 'r') as f:
try:
parser.parse(f.read())
for dataset in parser.datasets():
print('Got dataset with title {0}'.format(dataset['title'])
except RDFParserException, e:
print ('Error parsing the RDF file: {0}'.format(e))
# Parsing a remote JSON-LD file
import requests
parser = RDFParser()
content = requests.get('https://some.catalog.org/datasets.jsonld').content
try:
parser.parse(content, _format='json-ld')
for dataset in parser.datasets():
print('Got dataset with title {0}'.format(dataset['title'])
except RDFParserException, e:
print ('Error parsing the RDF file: {0}'.format(e))
The parser is implemented using RDFLib, a Python library for working with RDF. Any
RDF serialization format supported by RDFLib can be parsed into CKAN datasets. The examples
folder contains
serializations in different formats including RDF/XML, Turtle or JSON-LD.
RDF DCAT Serializer¶
The ckanext.dcat.processors.RDFSerializer
class generates RDF serializations in different
formats from CKAN dataset dicts, like the ones returned by package_show
or package_search
.
Here is an example of how to use it:
from ckanext.dcat.processors import RDFSerializer
# Serializing a single dataset
dataset = get_action('package_show')({}, {'id': 'my-dataset'})
serializer = RDFserializer()
dataset_ttl = serializer.serialize_dataset(dataset, _format='turtle')
# Serializing the whole catalog (or rather part of it)
datasets = get_action('package_search')({}, {'q': '*:*', 'rows': 50})
serializer = RDFserializer()
catalog_xml = serializer.serialize_catalog({'title': 'My catalog'},
dataset_dicts=datasets,
_format='xml')
# Creating and RDFLib graph from a single dataset
dataset = get_action('package_show')({}, {'id': 'my-dataset'})
serializer = RDFserializer()
dataset_reference = serializer.graph_from_dataset(dataset)
# serializer.g now contains the full dataset graph, an RDFLib Graph class
The serializer uses customizable profiles to generate an RDF graph (an RDFLib Graph class). By default these use the mapping described in the previous section.
In some cases, if the default CKAN field that maps to a DCAT property is not present, some other fallback
values will be used instead. For instance, if the contact_email
field is not found, maintainer_email
and author_email
will be used (if present) for the email property of the adms:contactPoint
property.
Note that the serializer will look both for a first level field or an extra field with the same key, ie both
the following values will be used for dct:accrualPeriodicity
:
{
"name": "my-dataset",
"frequency": "monthly",
...
}
{
"name": "my-dataset",
"extras": [
{"key": "frequency", "value": "monthly"},
]
...
}
Once the dataset graph has been obtained, this is serialized into a text format using RDFLib, so any format it supports can be obtained (common formats are 'xml', 'turtle' or 'json-ld').