Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

Table of Contents

...

Page Contents

Table of Contents
exclude(All Pages|Page Contents)

Introduction

The HDX Python Library is designed to enable you to easily develop code that interacts with the Humanitarian Data Exchange platform . It provides a simple interface that communicates with the HDX JSON API which is built on top of the CKAN . The underlying GET and POST requests are wrapped in Python methodsopen-source data management system. The major goal of the library is to make pushing and pulling data from HDX as simple as possible for the end user. There are several ways this is achieved. It provides a simple interface that communicates with HDX using the CKAN Python API, a thin wrapper around the CKAN JSON API. The HDX objects, such as datasets and resources, are represented by Python classes. The API documentation can be found here: http://mcarans.github.io/hdx-python-api/. The code for the library is here: https://github.com/mcarans/hdx-python-api.This should make the learning curve gentle and enable users to quickly get started with using HDX programmatically.

You can jump to the Getting Started page or continue reading below about the purpose and design philosophy of the library.

...

Keeping it Simple

  1. The

...

  1. The library avoids CKAN syntax instead using HDX terminology. Hence there is no reference to CKAN related items, only gallery items

    library hides CKAN's idiosyncrasies and tries to make the library match the HDX user interface experience. The user does not need to learn about CKAN and the library makes it easier to understand what will be the result in HDX when calling a Python method.

  2. The class structure of the library should be as logical as possible (within the restrictions of the CKAN API it relies on). In HDX, a dataset can contain zero or more resources and

    a gallery (consisting of gallery items

    it can be in one or more showcases (which themselves can contain more than one dataset), so the library reflects this even though the

    CKAN

    showcase API

    presents a different interface for gallery items to resources. 

    The UML diagram below shows the relationships between the major classes in the library.  

     

    Drawio
    baseUrlhttps://humanitarian.atlassian.net/wiki
    diagramNameClasses
    width601
    pageId6356996
    height421
    revision3

    Datasets, resources and gallery items comes from a plugin and is not part of the core CKAN API. 


  3. Datasets, resources and showcases can use dictionary methods like square brackets to handle metadata which feels natural. (The HDXObject class extends UserDict.) eg.

    Code Block
    dataset['name'] = 'My Dataset'
     


  4. Static metadata can be imported from a YAML file, recommended for being very human readable, or a JSON file eg.

    Code Block
    dataset.update_yaml([path])

    Static metadata can be passed in as a dictionary on initialisation of a dataset, resource or gallery item showcase eg.

    Code Block
    dataset = Dataset(
    configuration,
    {
    
    {
        'name': slugified_name,

    
        'title': title,
    })


  5. There are functions to help with adding more complicated types like dates and date ranges,


    'dataset_date': dataset_date, # has to be MM/DD/YYYY
    'groups': iso
    })

     

  6. The code is very well documented. Detailed API documentation (generated from Google style docstrings using Sphinx) can be found in the Introduction above. 
    def load_from_hdx(self, id_or_name: str) -> bool:
    """Loads the dataset given by either id or name from HDX
        Args:
    id_or_name (str): Either id or name of dataset
        Returns:
    bool: True if loaded, False if not
    """

    IDEs can take advantage of the documentation eg.
    Image Removed

  7. The method arguments and return parameter have type hints. (Although this is a feature of Python 3.5, it has been backported.) Type hints enable sophisticated IDEs like PyCharm to warn of any inconsistencies in using types bringing one of the major benefits of statically typed languages to Python.
    def merge_dictionaries(dicts: List[dict]) -> dict:

    gives:
    Image Removed

  8. Default parameters mean that there is a very easy default way to get set up and going eg.
    def update_yaml(self, path: Optional[str] = join('config', 'hdx_dataset_static.yml')) -> None:

    locations etc. eg.

    Code Block
    dataset.set_date_of_dataset('START DATE', 'END DATE')


  9. There are separate country code and utility libraries that provide functions to handle converting between country codes, dictionary merging, loading multiple YAML or JSON files and a few other helpful tasks eg. 

    Code Block
    Country.get_iso3_country_code_fuzzy('Czech Rep.')

Easy Configuration and Logging

  1. Logging is something often neglected so the library aims to make it a breeze to get going with logging and so avoid the spread of print statements. A few handlers are created in the default configuration:

    Code Block
    console:
        class: logging.StreamHandler
        level: DEBUG
        formatter: color
        stream: ext://sys.stdout
    Code Block
    error_file_handler:
        class: logging.FileHandler
        level: ERROR
        formatter: simple
        filename: errors.log
        encoding: utf8
        mode: w


  2. If using the default logging configuration, then it is possible to also add the default email (SMTP) handler: 

    Code Block
    error_mail_handler:
        class: logging.handlers.SMTPHandler
        level: CRITICAL
        formatter: simple
        mailhost: localhost
        fromaddr: noreply@localhost
  3. Configuration is made as simple as possible with a Configuration class that handles the HDX API key and the merging of configurations from multiple YAML or JSON files or dictionaries:

    Code Block
    class Configuration(UserDict):

    
        """Configuration for HDX
    
        Args:

    
            **kwargs: See below

    
            hdx_key_file (Optional[str]): Path to HDX key file. Defaults to ~/.hdxkey.

    
            hdx_config_dict (dict): HDX configuration dictionary OR

    
            hdx_config_json (str): Path to JSON HDX configuration OR

    
            hdx_config_yaml (str): Path to YAML HDX configuration. Defaults to library's internal hdx_configuration.yml.

    
            
    collector
    project_config_dict (dict): 
    Collector
    Project configuration dictionary OR

    
            
    collector
    project_config_json (str): Path to JSON 
    Collector
    Project configuration OR

    
            
    collector
    project_config_yaml (str): Path to YAML 
    Collector
    Project configuration. Defaults to config/
    collector
    project_configuration.yml.

    
     """
     
  4. Logging is something often neglected so the library aims to make it a breeze to get going with logging and so avoid the spread of print statements. A few loggers are created in the default configuration:
    console:
    class: logging.StreamHandler
    level: DEBUG
    formatter: color
    stream: ext://sys.stdout
    error_file_handler:
    class: logging.FileHandler
    level: ERROR
    formatter: simple
    filename: errors.log
    encoding: utf8
    mode: w
    error_mail_handler:
    class: logging.handlers.SMTPHandler
    level: CRITICAL
    formatter: simple
    mailhost: localhost
    fromaddr: noreply@localhost

     

  5. The library itself uses logging at appropriate levels to ensure that it is clear what operation are being performed eg.

    Code Block
    WARNING - 2016-06-07 11:08:04 - hdx.data.dataset - Dataset exists. Updating acled-conflict-data-for-africa-realtime-2016
     


  6. The library makes errors plain by throwing exceptions rather than returning a False or None (except where that would be more appropriate) eg.

    Code Block
    hdx.configuration.ConfigurationError: More than one 
    collector
    project configuration file given!
     
    Code Block
  7. There are

    utility functions to handle dictionary merging, loading multiple YAML or JSON files and a few other helpful tasks eg.
     
    def script_dir_plus_file(filename: str, pyobject: Any, follow_symlinks: Optional[bool] = True) -> str:
    """Get current script's directory and then append a filename
        Args:
    filename (str): Filename to append to directory path
    pyobject (Any): Any Python object in the script
    follow_symlinks (Optional[bool]): Follow symlinks or not. Defaults to True.
        Returns:
    str: Current script's directory and with filename appended
    """
    There are setup wrappers to which the collector

    facades to simplify setup to which the project's main function is passed. They neatly cloak the setup of logging and one of them hides the required calls for pushing status into ScraperWiki (used internally in HDX) eg.

    Code Block
    from hdx.
    collector
    facades.scraperwiki import 
    wrapper
    facade
    Code Block
    def main(
    configuration
    ):
    Code Block
        dataset = generate_dataset(
    configuration,
    datetime.now())
    Code Block
        ...
    Code Block
    if __name__ == '__main__':

    
        
    wrapper
    facade(main)

...

Documentation of the API

...

  1. The

...

  1. Browse to the HDX website
  2. Left click on LOG IN in the top right of the web page if not logged in and log in
  3. Left click on your username in the top right of the web page and select PROFILE from the drop down menu
  4. Scroll down to the bottom of the profile page
  5. Copy the API key which will be of the form xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  6. Paste the API key into a text file
  7. Save the text file with filename ".hdxkey" in the current user's home directory

Starting the Data Collector

To include the HDX Python library in your project, add the following to your requirements.txt file:

git+git://github.com/mcarans/hdx-python-api.git#egg=hdx-python-api

The easiest way to get started is to use the wrappers and configuration defaults. You will most likely just need the simple wrapper. If you are in the HDX team, you may need to use the ScraperWiki wrapper which reports status to that platform (in which case replace "simple" with "scraperwiki" in the code below):

from hdx.collector.simple import wrapper
def main(configuration):
***YOUR CODE HERE***
if __name__ == '__main__':
wrapper(main)

The wrapper sets up both logging and HDX configuration, the latter being passed to your main function in the "configuration" argument above.

Setting up the Configuration

The default configuration loads an internal HDX configuration located within the library, and assumes that there is an API key file called .hdxkey in the current user's home directory and a YAML collector configuration located at config/collector_configuration.yml which you must create. The collector configuration is used for any configuration specific to your collector.

It is possible to pass configuration parameters in the wrapper call eg.

wrapper(main, hdx_key_file = LOCATION_OF_HDX_KEY_FILE, hdx_config_yaml=PATH_TO_HDX_YAML_CONFIGURATION, 
    collector_config_dict = {'MY_PARAMETER', 'MY_VALUE'})

If you did not need a collector configuration, you could simply provide an empty dictionary eg.

wrapper(main, collector_config_dict = {})

If you do not use the wrapper, you can use the Configuration class directly, passing in appropriate keyword arguments ie.

from hdx.configuration import Configuration
...
cfg = Configuration(ARGUMENTS)

ARGUMENTS can be:

...

Configuring Logging

The default logging configuration reads a configuration file internal to the library that sets up an coloured console handler outputting at DEBUG level, a file handler writing to errors.log at ERROR level and an SMTP handler sending an email in the event of a CRITICAL error. It assumes that you have created a file config/smtp_configuration.yml which contains parameters of the form:

...

  1. code is very well documented. Detailed API documentation (generated from Google style docstrings using Sphinx) is available and mentioned in the Getting Started guide. 

    Code Block
    def load_from_hdx(self, id_or_name: str) -> bool:
        """Loads the dataset given by either id or name from HDX
    Code Block
        Args:
           

...

If you wish to change the logging configuration from the defaults, you will need to call setup_logging with arguments. If you have used the simple or ScraperWiki wrapper, you must make the call after the import line for the wrapper.

from hdx.logging import setup_logging
...
logger = logging.getLogger(__name__)
setup_logging(ARGUMENTS)

ARGUMENTS can be:

...

To use logging in your files, simply add the line below to the top of each Python file:

logger = logging.getLogger(__name__)

Then use the logger like this:

logger.debug('DEBUG message')
logger.info('INFORMATION message')
logger.warning('WARNING message')
logger.error('ERROR message')
logger.critical('CRITICAL error message')

Operations on HDX Objects

You can create an HDX Object, such as a dataset, resource or gallery item by calling the constructor with a configuration, which is required, and an optional dictionary containing metadata. For example:

dataset = Dataset(configuration, {
'name': slugified_name,
'title': title,
'dataset_date': dataset_date, # has to be MM/DD/YYYY
'groups': iso
})

You can add metadata using the standard Python dictionary square brackets eg.

dataset['name'] = 'My Dataset'

You can also do so by the standard dictionary update method, which takes a dictionary eg.

dataset.update({'name': 'My Dataset'})

Larger amounts of static metadata are best added from files. YAML is very human readable and recommended, while JSON is also accepted eg.

dataset.update_yaml([path])
dataset.update_json([path])

The default path if unspecified is config/hdx_TYPE_static.yml for YAML and config/hdx_TYPE_static.json for JSON where TYPE is an HDX object's type like dataset or resource eg. config/hdx_galleryitem_static.json. The YAML file takes the following form:

owner_org: "acled"
maintainer: "acled"
...
tags:
- name: "conflict"
- name: "political violence"
gallery:
- title: "Dynamic Map: Political Conflict in Africa"
type: "visualization"
description: "The dynamic maps below have been drawn from ACLED Version 6."
...

Notice how you can define a gallery with one or more gallery items (each starting with a dash '-') within the file as shown above. You can do the same for resources.

You can check if all the fields required by HDX are populated by calling check_required_fields with an optional list of fields to ignore. This will throw an exception if any fields are missing. Before the library posts data to HDX, it will call this method automatically. An example usage:

resource.check_required_fields(['package_id'])

A dataset can have resources and a gallery so if you wish to add them, you can supply a list and call the appropriate add_update_* function, for example:

resources = [{
'name': xlsx_resourcename,
'format': 'xlsx',
'url': xlsx_url
}, {
'name': csv_resourcename,
'format': 'zipped csv',
'url': csv_url
}]
for resource in resources:
resource['description'] = resource['url'].rsplit('/', 1)[-1]
dataset.add_update_resources(resources)

Calling add_update_resources creates a list of HDX Resource objects in dataset and operations can be performed on those objects.

Once the HDX object is ready ie. it has all the required metadata, you simply call create_in_hdx eg.

dataset.create_in_hdx()

You can delete HDX objects using delete_from_hdx and update an object that already exists in HDX with the method update_in_hdx. These do not take any parameters or return anything and throw exceptions for failures like the object to delete or update not existing.

You can load an existing HDX object with the load_from_hdx method which takes an identifier parameter and returns True or False depending upon whether the object was loaded eg.

dataset.load_from_hdx('DATASET_ID_OR_NAME')

Full Example

An example that puts all this together can be found here: https://github.com/mcarans/hdxscraper-acled-africa

In particular, take a look at the files run.py, acled_africa.py and the config folder.

 

...

  1.  id_or_name (str): Either id or name of dataset
    Code Block
        Returns:
            bool: True if loaded, False if not
    Code Block
    """
    Image Added

    IDEs can take advantage of the documentation eg.
    Image Added

  2. The method arguments and return parameter have type hints. (Although this is a feature of Python 3.5, it has been backported.) Type hints enable sophisticated IDEs like PyCharm to warn of any inconsistencies in using types bringing one of the major benefits of statically typed languages to Python.

    Code Block
    def merge_dictionaries(dicts: List[dict]) -> dict:

    gives: 
    Image Added

  3. Default parameters mean that there is a very easy default way to get set up and going eg.

    Code Block
    def update_yaml(self, path: Optional[str] = join('config', 'hdx_dataset_static.yml')) -> None: