Load Datasets¶
You can upload individual datasets through the CKAN front-end, but for importing datasets on masse, you have two choices:
- Import Data with the CKAN API. You can use the CKAN API to script import. To simplify matters, we offer provide standard loading scripts for Google Spreadsheets, CSV and Excel.
- Import Data with the Harvester Extension. The CKAN harvester extension provides web and command-line interfaces for larger import tasks.
If you need advice on data import, contact the ckan-dev mailing list.
Note
If loading your data requires scraping a web page regularly, you may find it best to write a scraper on ScraperWiki and combine this with either of the methods above.
Import Data with the CKAN API¶
You can use the CKAN API to upload datasets directly into your CKAN instance.
The Simplest Approach - CKAN API¶
The simplest way to automate dataset loading is with a Python script using CKAN’s API. Here’s an example script to create a new dataset:
#!/usr/bin/env python
import urllib2
import urllib
import json
import pprint
# Put the details of the dataset we're going to create into a dict.
dataset_dict = {
'name': 'my_dataset_name',
'notes': 'A long description of my dataset',
}
# Use the json module to dump the dictionary to a string for posting.
data_string = urllib.quote(json.dumps(dataset_dict))
# We'll use the package_create function to create a new dataset.
request = urllib2.Request(
'http://www.my_ckan_site.com/api/action/package_create')
# Creating a dataset requires an authorization header.
# Replace *** with your API key, from your user account on the CKAN site
# that you're creating the dataset on.
request.add_header('Authorization', '***')
# Make the HTTP request.
response = urllib2.urlopen(request, data_string)
assert response.code == 200
# Use the json module to load CKAN's response into a dictionary.
response_dict = json.loads(response.read())
assert response_dict['success'] is True
# package_create returns the created package as its result.
created_package = response_dict['result']
pprint.pprint(created_package)
Loader Scripts¶
‘Loader scripts’ provide a simple way to take any format metadata and bulk upload it to a remote CKAN instance.
Essentially each set of loader scripts converts the dataset metadata to the standard ‘dataset’ format, and then loads it into CKAN.
To get a flavour of what loader scripts look like, take a look at the ONS scripts.
Loader Scripts for CSV and Excel¶
For CSV and Excel formats, the SpreadsheetPackageImporter (found in ckanext-importlib/ckanext/importlib/spreadsheet_importer.py) loader script wraps the file in SpreadsheetData before extracting the records into SpreadsheetDataRecords.
SpreadsheetPackageImporter copes with multiple title rows, data on multiple sheets, dates. The loader can reload datasets based on a unique key column in the spreadsheet, choose unique names for datasets if there is a clash, add/merge new resources for existing datasets and manage dataset groups.
Loader Scripts for Google Spreadsheets¶
The SimpleGoogleSpreadsheetLoader class (found in ckanclient.loaders.base) simplifies the process of loading data from Google Spreadsheets (there is an additional dependency on the gdata Python package).
This script has a simple example of loading data from Google Spreadsheets.
Write Your Own Loader Script¶
## this needs work ##
First, you need an importer that derives from PackageImporter (found in ckan/lib/importer.py). This takes whatever format the metadata is in and sorts it into records of type DataRecord.
Next, each DataRecord is converted into the correct fields for a dataset using the record_2_package method. This results in dataset dictionaries.
The PackageLoader takes the dataset dictionaries and loads them onto a CKAN instance using the ckanclient. There are various settings to determine:
- ##how to identify the same dataset, previously been loaded into CKAN.## This can be simply by name or by an identifier stored in another field.
- how to merge in changes to an existing datasets. It can simply replace it or maybe merge in resources etc.
The loader should be given a command-line interface using the Command base class (ckanext/command.py).
You need to add a line to the CKAN setup.py (under [console_scripts]) and when you run python setup.py develop it creates a script for you in your Python environment.
Import Data with the Harvester Extension¶
The CKAN harvester extension provides useful tools for more advanced data imports.
These include a command-line interface and a web user interface for running harvesting jobs.
To use the harvester extension, create a class that implements the harvester interface <https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/interfaces.py> derived from the base class of the harvester extension.
For more information on working with extensions, see Add Extensions.
