Introduction
The HDX Python Library is designed to simplify using the HDX JSON API which is built on top of CKAN. The underlying GET and POST requests are wrapped in Python methods. The HDX objects, such as datasets and resources, are represented by Python classes. The API documentation can be found here: http://mcarans.github.io/hdx-python-api/
Keeping it Simple
The major goal of the library is to make interacting with HDX as simple as possible for the user. There are several ways this is achieved.
- The library avoids CKAN syntax instead using HDX terminology. Hence there is no reference to CKAN related items, only gallery items. The user does not need to learn about CKAN and makes it easier to understand what will be the result in HDX when calling a Python method.
- The class structure of the library should be as logical as possible (within the restrictions of the CKAN API it relies on). In HDX, a dataset can contain zero or more resources and a gallery (consisting of gallery items), so the library reflects this even though the CKAN API presents a different interface for gallery items to resources.
The UML diagram below shows the relationships between the major classes in the library.
Datasets, resources and gallery items can use dictionary methods like square brackets to handle metadata which feels natural. (The HDXObject class extends UserDict.) eg.
dataset['name'] = 'My Dataset'
Static metadata can be imported from a YAML file, recommended for being very human readable, or a JSON file eg.
dataset.update_yaml([path])
Static metadata can be passed in as a dictionary on initialisation of a dataset, resource or gallery item eg.
dataset = Dataset(configuration, {
'name': slugified_name,
'title': title,
'dataset_date': dataset_date, # has to be MM/DD/YYYY
'groups': iso
})- The code is very well documented. Detailed API documentation (generated from Google style docstrings using Sphinx) can be found in the Introduction above.
def load_from_hdx(self, id_or_name: str) -> bool:
"""Loads the dataset given by either id or name from HDXArgs:
id_or_name (str): Either id or name of datasetReturns:
bool: True if loaded, False if not"""
- The method arguments and return parameter have type hints. (Although this is a feature of Python 3.5, it has been backported.) Type hints enable sophisticated IDEs like PyCharm to warn of any inconsistencies in using types bringing one of the major benefits of statically typed languages to Python.
def merge_dictionaries(dicts: List[dict]) -> dict:
gives:
- Default parameters mean that there is a very easy default way to get set up and going eg.
def update_yaml(self, path: Optional[str] = join('config', 'hdx_dataset_static.yml')) -> None:
- Configuration is made as simple as possible with a Configuration class that handles the HDX API key and the merging of configurations from multiple YAML or JSON files or dictionaries:
class Configuration(UserDict):
"""Configuration for HDXArgs:
hdx_key_file (Optional[str]): Path to HDX key file. Defaults to ~/.hdxkey
**kwargs: See below
hdx_config_dict (dict): HDX configuration dictionary OR
hdx_config_json (str): Path to JSON HDX configuration OR
hdx_config_yaml (str): Path to YAML HDX configuration. Defaults to internal hdx_configuration.yml.
scraper_config_dict (dict): Scraper configuration dictionary OR
scraper_config_json (str): Path to JSON Scraper configuration OR
scraper_config_yaml (str): Path to YAML Scraper configuration. Defaults to internal scraper_configuration.yml.
""" - Logging is something often neglected so the library aims to make it a breeze to get going with logging and so avoid the spread of print statements. A few loggers are created in the default configuration:
console:
class: logging.StreamHandler
level: DEBUG
formatter: color
stream: ext://sys.stdouterror_file_handler:
class: logging.FileHandler
level: ERROR
formatter: simple
filename: errors.log
encoding: utf8
mode: werror_mail_handler:
class: logging.handlers.SMTPHandler
level: CRITICAL
formatter: simple
mailhost: localhost
fromaddr: noreply@localhost - There are utility functions to handle dictionary merging, loading multiple YAML or JSON files and a few other helpful tasks eg.
def script_dir_plus_file(filename: str, pyobject: Any, follow_symlinks: Optional[bool] = True) -> str:
"""Get current script's directory and then append a filenameArgs:
filename (str): Filename to append to directory path
pyobject (Any): Any Python object in the script
follow_symlinks (Optional[bool]): Follow symlinks or not. Defaults to True.Returns:
str: Current script's directory and with filename appended
""" - There are setup wrappers to which the scraper's main function is passed. They neatly cloak the setup of logging and one of them hides the required calls for pushing status into ScraperWiki (used internally in HDX) eg.
from hdx.collector.scraperwiki import wrapper
def main(configuration):
dataset = generate_dataset(configuration, datetime.now())
...
if __name__ == '__main__':
wrapper(main)
Getting Started
The first task is to decide if the scraper will report status to ScraperWiki or not. Unless you are in the HDX team, you will use the simple wrapper (otherwise replace "simple" with "scraperwiki" in the code below):
from hdx.collector.simple import wrapper
def main(configuration):
***YOUR CODE HERE***
if __name__ == '__main__':
wrapper(main)
The wrapper sets up both logging and HDX configuration passed to your main function in the "configuration" argument above.
The default configuration assume an internal HDX configuration
It is possible to pass configuration parameters in the wrapper call eg.
wrapper(main, hdx_key_file = LOCATION_OF_HDX_KEY_FILE, hdx_config_yaml=PATH_TO_HDX_YAML_CONFIGURATION, scraper_config_dict = {'MY_PARAMETER', 'MY_VALUE'})
The logging configuration from the defaults
If you wish to change the logging configuration from the defaults