Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Page Contents

Table of Contents
exclude(All Pages|Page Contents)

Introduction

The HDX Python Library is designed to enable you to easily develop code that interacts with the Humanitarian Data Exchange platform which is built on top of the CKAN open-source data management system. The major goal of the library is to make pushing and pulling data from HDX as simple as possible for the end user. There are several ways this is achieved. It provides a simple interface that communicates with HDX using the CKAN Python API, a thin wrapper around the CKAN JSON API. The HDX objects, such as datasets and resources, are represented by Python classes. This should make the learning curve gentle and enable users to quickly get started with using HDX programmatically.

You can jump to the Getting Started page or continue reading below about the purpose and design philosophy of the library. With the upgrade to CKAN 2.6, the UML diagram and gallery terminology is outdated (TODO: Update this)Image Removed

...

Keeping it Simple

  1. The library

    avoids CKAN syntax instead using HDX terminology. Hence there is no reference to CKAN's "related items", only HDX's "gallery items"

    hides CKAN's idiosyncrasies and tries to make the library match the HDX user interface experience. The user does not need to learn about CKAN and the library makes it easier to understand what will be the result in HDX when calling a Python method.

  2. The class structure of the library should be as logical as possible (within the restrictions of the CKAN API it relies on). In HDX, a dataset can contain zero or more resources and

    a gallery (consisting of gallery items

    it can be in one or more showcases (which themselves can contain more than one dataset), so the library reflects this even though the

    CKAN

    showcase API

    presents a different interface for gallery items to resources. 

    The UML diagram below shows the relationships between the major classes in the library.  

    Include PageUML DiagramUML Diagramcomes from a plugin and is not part of the core CKAN API. 


  3. Datasets, resources and gallery items showcases can use dictionary methods like square brackets to handle metadata which feels natural. (The HDXObject class extends UserDict.) eg.

    Code Block
    dataset['name'] = 'My Dataset'

     


  4. Static metadata can be imported from a YAML file, recommended for being very human readable, or a JSON file eg.

    Code Block
    dataset.update_yaml([path])

    Static metadata can be passed in as a dictionary on initialisation of a dataset, resource or gallery item showcase eg.

    Code Block
    dataset = Dataset(
    configuration,
    {
    
    {
        'name': slugified_name,

    
        'title': title,

    'dataset_date': dataset_date, # has to be MM/DD/YYYY
    'groups': iso
    })

     

    There are utility functions to handle
    
    })


  5. There are functions to help with adding more complicated types like dates and date ranges, locations etc. eg.

    Code Block
    dataset.set_date_of_dataset('START DATE', 'END DATE')


  6. There are separate country code and utility libraries that provide functions to handle converting between country codes, dictionary merging, loading multiple YAML or JSON files and a few other helpful tasks eg. 

    def script_dir_plus_file(filename: str, pyobject: Any, follow_symlinks: Optional[bool] = True) -> str:
    """Get current script's directory and then append a filename
        Args:
    filename (str): Filename to append to directory path
    pyobject (Any): Any Python object in the script
    follow_symlinks (Optional[bool]): Follow symlinks or not. Defaults to True.
    Returns:
    str: Current script's directory and with filename appended
    """
    Code Block
    Country.get_iso3_country_code_fuzzy('Czech Rep.')

Easy Configuration and Logging

  1. Logging is something often neglected so the library aims to make it a breeze to get going with logging and so avoid the spread of print statements. A few handlers are created in the default configuration:

    Code Block
    console:

    
        class: logging.StreamHandler

    
        level: DEBUG

    
        formatter: color

    
        stream: ext://sys.stdout
    Code Block
    error_file_handler:

    
        class: logging.FileHandler

    
        level: ERROR

    
        formatter: simple

    
        filename: errors.log

    
        encoding: utf8

    
        mode: w

     


  2. If using the default logging configuration, then it is possible to also add the default email (SMTP) handler: 

    Code Block
    error_mail_handler:

    
        class: logging.handlers.SMTPHandler

    
        level: CRITICAL

    
        formatter: simple

    
        mailhost: localhost

    
        fromaddr: noreply@localhost
  3. Configuration is made as simple as possible with a Configuration class that handles the HDX API key and the merging of configurations from multiple YAML or JSON files or dictionaries:

    Code Block
    class Configuration(UserDict):

    
        """Configuration for HDX

    
        Args:

    
            **kwargs: See below

    
            hdx_key_file (Optional[str]): Path to HDX key file. Defaults to ~/.hdxkey.

    
            hdx_config_dict (dict): HDX configuration dictionary OR

    
            hdx_config_json (str): Path to JSON HDX configuration OR

    
            hdx_config_yaml (str): Path to YAML HDX configuration. Defaults to library's internal hdx_configuration.yml.

    
            project_config_dict (dict): Project configuration dictionary OR

    
            project_config_json (str): Path to JSON Project configuration OR

    
            project_config_yaml (str): Path to YAML Project configuration. Defaults to config/project_configuration.yml.

    
     """
  4. The library itself uses logging at appropriate levels to ensure that it is clear what operation are being performed eg.

    Code Block
    WARNING - 2016-06-07 11:08:04 - hdx.data.dataset - Dataset exists. Updating acled-conflict-data-for-africa-realtime-2016

     


  5. The library makes errors plain by throwing exceptions rather than returning a False or None (except where that would be more appropriate) eg.

    Code Block
    hdx.configuration.ConfigurationError: More than one project configuration file given!
    Code Block
  6. There are facades to simplify setup to which the project's main function is passed. They neatly cloak the setup of logging and one of them hides the required calls for pushing status into ScraperWiki (used internally in HDX) eg.

    Code Block
    from hdx.facades.scraperwiki import facade
    Code Block
    def main(
    configuration
    ):
    Code Block
        dataset = generate_dataset(
    configuration,
    datetime.now())
    Code Block
        ...
    Code Block
    if __name__ == '__main__':

    
        facade(main)

Documentation of the API

  1. The code is very well documented. Detailed API documentation (generated from Google style docstrings using Sphinx) is available and mentioned in the Getting Started guide. 

    Code Block
    def load_from_hdx(self, id_or_name: str) -> bool:

    
        """Loads the dataset given by either id or name from HDX
    Code Block
        Args:

    
            id_or_name (str): Either id or name of dataset
    Code Block
        Returns:

    
            bool: True if loaded, False if not
    Code Block
    """
    Image RemovedImage Added

    IDEs can take advantage of the documentation eg.


  2. The method arguments and return parameter have type hints. (Although this is a feature of Python 3.5, it has been backported.) Type hints enable sophisticated IDEs like PyCharm to warn of any inconsistencies in using types bringing one of the major benefits of statically typed languages to Python.

    Code Block
    def merge_dictionaries(dicts: List[dict]) -> dict:

    gives: 


  3. Default parameters mean that there is a very easy default way to get set up and going eg.

    Code Block
    def update_yaml(self, path: Optional[str] = join('config', 'hdx_dataset_static.yml')) -> None:

 

 

...