HDX Python Library

Introduction

The HDX Python Library is designed to enable you to easily develop code that interacts with the Humanitarian Data Exchange platform. It provides a simple interface that communicates with the HDX JSON API which is built on top of CKAN. The underlying GET and POST requests are wrapped in Python methods. The HDX objects, such as datasets and resources, are represented by Python classes. The API documentation can be found here: http://mcarans.github.io/hdx-python-api/

Keeping it Simple

The major goal of the library is to make interacting with HDX as simple as possible for the end user. There are several ways this is achieved.

The library avoids CKAN syntax instead using HDX terminology. Hence there is no reference to CKAN related items, only gallery items. The user does not need to learn about CKAN and makes it easier to understand what will be the result in HDX when calling a Python method.
The class structure of the library should be as logical as possible (within the restrictions of the CKAN API it relies on). In HDX, a dataset can contain zero or more resources and a gallery (consisting of gallery items), so the library reflects this even though the CKAN API presents a different interface for gallery items to resources.
The UML diagram below shows the relationships between the major classes in the library.
Datasets, resources and gallery items can use dictionary methods like square brackets to handle metadata which feels natural. (The HDXObject class extends UserDict.) eg.
```
dataset['name'] = 'My Dataset'
```
Static metadata can be imported from a YAML file, recommended for being very human readable, or a JSON file eg.
```
dataset.update_yaml([path])
```
Static metadata can be passed in as a dictionary on initialisation of a dataset, resource or gallery item eg.
```
dataset = Dataset(configuration, {
    'name': slugified_name,
    'title': title,
    'dataset_date': dataset_date, # has to be MM/DD/YYYY
    'groups': iso
})
```

The code is very well documented. Detailed API documentation (generated from Google style docstrings using Sphinx) can be found in the Introduction above.

def load_from_hdx(self, id_or_name: str) -> bool:
    """Loads the dataset given by either id or name from HDX

    Args:
        id_or_name (str): Either id or name of dataset

    Returns:
        bool: True if loaded, False if not

"""

The method arguments and return parameter have type hints. (Although this is a feature of Python 3.5, it has been backported.) Type hints enable sophisticated IDEs like PyCharm to warn of any inconsistencies in using types bringing one of the major benefits of statically typed languages to Python.
```
def merge_dictionaries(dicts: List[dict]) -> dict:
```
gives:

Default parameters mean that there is a very easy default way to get set up and going eg.

def update_yaml(self, path: Optional[str] = join('config', 'hdx_dataset_static.yml')) -> None:

Configuration is made as simple as possible with a Configuration class that handles the HDX API key and the merging of configurations from multiple YAML or JSON files or dictionaries:

class Configuration(UserDict):
    """Configuration for HDX

    Args:
        **kwargs: See below
        hdx_key_file (Optional[str]): Path to HDX key file. Defaults to ~/.hdxkey.
        hdx_config_dict (dict): HDX configuration dictionary OR
        hdx_config_json (str): Path to JSON HDX configuration OR
        hdx_config_yaml (str): Path to YAML HDX configuration. Defaults to library's internal hdx_configuration.yml.
        collector_config_dict (dict): Collector configuration dictionary OR
        collector_config_json (str): Path to JSON Collector configuration OR
        collector_config_yaml (str): Path to YAML Collector configuration. Defaults to config/collector_configuration.yml.
 """

Logging is something often neglected so the library aims to make it a breeze to get going with logging and so avoid the spread of print statements. A few loggers are created in the default configuration:

console:
    class: logging.StreamHandler
    level: DEBUG
    formatter: color
    stream: ext://sys.stdout

error_file_handler:
    class: logging.FileHandler
    level: ERROR
    formatter: simple
    filename: errors.log
    encoding: utf8
    mode: w

error_mail_handler:
    class: logging.handlers.SMTPHandler
    level: CRITICAL
    formatter: simple
    mailhost: localhost
    fromaddr: noreply@localhost

There are utility functions to handle dictionary merging, loading multiple YAML or JSON files and a few other helpful tasks eg.

def script_dir_plus_file(filename: str, pyobject: Any, follow_symlinks: Optional[bool] = True) -> str:
    """Get current script's directory and then append a filename

    Args:
        filename (str): Filename to append to directory path
        pyobject (Any): Any Python object in the script
        follow_symlinks (Optional[bool]): Follow symlinks or not. Defaults to True.

    Returns:
        str: Current script's directory and with filename appended
 """

There are setup wrappers to which the collector's main function is passed. They neatly cloak the setup of logging and one of them hides the required calls for pushing status into ScraperWiki (used internally in HDX) eg.
```
from hdx.collector.scraperwiki import wrapper
```
```
def main(configuration):
```
```
    dataset = generate_dataset(configuration, datetime.now())
```
```
    ...
```
```
if __name__ == '__main__':
    wrapper(main)
```

Creating the API Key File

The first task is to create an API key file. By default this is assumed to be called .hdxkey and is located in the current user's home directory (~). Assuming you are using a desktop browser, the API key is obtained by:

Browse to the HDX website
Left click on LOG IN in the top right of the web page if not logged in and log in
Left click on your username in the top right of the web page and select PROFILE from the drop down menu
Scroll down to the bottom of the profile page
Copy the API key which will be of the form xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Paste the API key into a text file
Save the text file with filename ".hdxkey" in the current user's home directory

Building the Data Collector

The easiest way to get started is to use the wrappers and configuration defaults. You will most likely just need the simple wrapper. If you are in the HDX team, you may need to use the ScraperWiki wrapper which reports status to that platform (in which case replace "simple" with "scraperwiki" in the code below):

from hdx.collector.simple import wrapper

def main(configuration):
    ***YOUR CODE HERE***

if __name__ == '__main__':
    wrapper(main)

The wrapper sets up both logging and HDX configuration, the latter being passed to your main function in the "configuration" argument above.

Setting up the Configuration

The default configuration loads an internal HDX configuration located within the library, and assumes that there is an API key file called .hdxkey in the current user's home directory and a YAML collector configuration located at config/collector_configuration.yml which you must create. The collector configuration is used for any configuration specific to your collector.

It is possible to pass configuration parameters in the wrapper call eg.

wrapper(main, hdx_key_file = LOCATION_OF_HDX_KEY_FILE, hdx_config_yaml=PATH_TO_HDX_YAML_CONFIGURATION,

    collector_config_dict = {'MY_PARAMETER', 'MY_VALUE'})

If you did not need a collector configuration, you could simply provide an empty dictionary eg.

wrapper(main, collector_config_dict = {})

If you do not use the wrapper, you can use the Configuration class directly, passing in appropriate keyword arguments ie.

from hdx.configuration import Configuration
...
cfg = Configuration(ARGUMENTS)

ARGUMENTS can be:

Choose	Argument	Type	Value	Default
	hdx_key_file	Optional[str]	Path to HDX key file	~/.hdxkey
One of:	hdx_config_dict	dict	HDX configuration dictionary
	hdx_config_json	str	Path to JSON HDX configuration
	hdx_config_yaml	str	Path to YAML HDX configuration	Library's internal hdx_configuration.yml
One of:	collector_config_dict	dict	Collector configuration dictionary
	collector_config_json	str	Path to JSON Collector configuration
	collector_config_yaml	str	Path to YAML Collector configuration	config/collector_configuration.yml

Configuring Logging

The default logging configuration reads a configuration file internal to the library that sets up an coloured console handler outputting at DEBUG level, a file handler writing to errors.log at ERROR level and an SMTP handler sending an email in the event of a CRITICAL error. It assumes that you have created a file config/smtp_configuration.yml which contains parameters of the form:

handlers:
    error_mail_handler:
        toaddrs: EMAIL_ADDRESSES
        subject: "COLLECTOR FAILED: MY_COLLECTOR_NAME"

If you wish to change the logging configuration from the defaults, you will need to call setup_logging with arguments. If you have used the simple or ScraperWiki wrapper, you must make the call after the import line for the wrapper.

from hdx.logging import setup_logging
...
logger = logging.getLogger(__name__)
setup_logging(ARGUMENTS)

ARGUMENTS can be:

Choose	Argument	Type	Value	Default
One of:	logging_config_dict	dict	Logging configuration dictionary
	logging_config_json	str	Path to JSON Logging configuration
	logging_config_yaml	str	Path to YAML Logging configuration	Library's internal logging_configuration.yml
One of:	smtp_config_dict	dict	Email Logging configuration dictionary
	smtp_config_json	str	Path to JSON Email Logging configuration
	smtp_config_yaml	str	Path to YAML Email Logging configuration	config/smtp_configuration.yml