Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

You can jump to the Getting Started page or continue reading below about the purpose and design philosophy of the library. 

...

The major goal of the library is to make interacting with HDX as simple as possible for the end user. There are several ways this is achieved.

Keeping it Simple

  1. The library avoids CKAN syntax instead using HDX terminology. Hence there is no reference to CKAN related items, only gallery items. The user does not need to learn about CKAN and makes it easier to understand what will be the result in HDX when calling a Python method.

  2. The class structure of the library should be as logical as possible (within the restrictions of the CKAN API it relies on). In HDX, a dataset can contain zero or more resources and a gallery (consisting of gallery items), so the library reflects this even though the CKAN API presents a different interface for gallery items to resources. 

    The UML diagram below shows the relationships between the major classes in the library.  

     

    Drawio
    baseUrlhttps://humanitarian.atlassian.net/wiki
    diagramNameClasses
    width601
    pageId6356996
    height421
    revision3

  3. Datasets, resources and gallery items can use dictionary methods like square brackets to handle metadata which feels natural. (The HDXObject class extends UserDict.) eg.

    dataset['name'] = 'My Dataset'

     

  4. Static metadata can be imported from a YAML file, recommended for being very human readable, or a JSON file eg.

    dataset.update_yaml([path])

    Static metadata can be passed in as a dictionary on initialisation of a dataset, resource or gallery item eg.

    dataset = Dataset(configuration, {
    'name': slugified_name,
    'title': title,
    'dataset_date': dataset_date, # has to be MM/DD/YYYY
    'groups': iso
    })

     

    The code is very well documented. Detailed API documentation (generated from Google style docstrings using Sphinx) can be found in the Introduction above
  5. There are utility functions to handle dictionary merging, loading multiple YAML or JSON files and a few other helpful tasks eg. 

    def loadscript_dir_fromplus_hdx(selffile(filename: str, pyobject: Any, idfollow_or_namesymlinks: strOptional[bool] = True) -> boolstr:
    """LoadsGet thecurrent datasetscript's givendirectory byand eitherthen idappend or name from HDXa filename
        Args:
    id_or_namefilename (str): EitherFilename idto orappend nameto ofdirectory datasetpath
    Returns: pyobject (Any): Any Python object in the script
    follow_symlinks (Optional[bool]): TrueFollow ifsymlinks loaded,or Falsenot. if not
    """

    IDEs can take advantage of the documentation eg.
    Image Removed

  6. The method arguments and return parameter have type hints. (Although this is a feature of Python 3.5, it has been backported.) Type hints enable sophisticated IDEs like PyCharm to warn of any inconsistencies in using types bringing one of the major benefits of statically typed languages to Python.
    def merge_dictionaries(dicts: List[dict]) -> dict:

    gives:
    Image Removed

  7. Default parameters mean that there is a very easy default way to get set up and going eg.
    def update_yaml(self, path: Optional[str] = join('config', 'hdx_dataset_static.yml')) -> None:

Easy Setup of Configuration and Logging

...

  1. Defaults to True.
        Returns:
    str: Current script's directory and with filename appended
    """

Easy Configuration and Logging

  1. Logging is something often neglected so the library aims to make it a breeze to get going with logging and so avoid the spread of print statements. A few loggers are created in the default configuration:
    console:
    class: logging.StreamHandler
    level: DEBUG
    formatter: color
    stream: ext://sys.stdout
    error_file_handler:
    class: logging.FileHandler
    level: ERROR
    formatter: simple
    filename: errors.log
    encoding: utf8
    mode: w
    error_mail_handler:
    class: logging.handlers.SMTPHandler
    level: CRITICAL
    formatter: simple
    mailhost: localhost
    fromaddr: noreply@localhost

     

  2. Configuration is made as simple as possible with a Configuration class that handles the HDX API key and the merging of configurations from multiple YAML or JSON files or dictionaries:

    class Configuration(UserDict):
    """Configuration for HDX
    Args:
    **kwargs: See below
    hdx_key_file (Optional[str]): Path to HDX key file. Defaults to ~/.hdxkey.
    hdx_config_dict (dict): HDX configuration dictionary OR
    hdx_config_json (str): Path to JSON HDX configuration OR
    hdx_config_yaml (str): Path to YAML HDX configuration. Defaults to library's internal hdx_configuration.yml.
    collector_config_dict (dict): Collector configuration dictionary OR
    collector_config_json (str): Path to JSON Collector configuration OR
    collector_config_yaml (str): Path to YAML Collector configuration. Defaults to config/collector_configuration.yml.
    """
  3. The library itself uses logging at appropriate levels to ensure that it is clear what operation are being performed eg.

    WARNING - 2016-06-07 11:08:04 - hdx.data.dataset - Dataset exists. Updating acled-conflict-data-for-africa-realtime-2016

     

  4. The library makes errors plain by throwing exceptions rather than returning a False or None (except where that would be more appropriate) eg.

    hdx.configuration.ConfigurationError: More than one collector configuration file given!

  5. There are utility functions to handle dictionary merging, loading multiple YAML or JSON files and a few other helpful tasks eg.
     
    def script_dir_plus_file(filename: str, pyobject: Any, follow_symlinks: Optional[bool] = True) -> str:
    """Get current script's directory and then append a filename
    Args:
    filename (str): Filename to append to directory path
    pyobject (Any): Any Python object in the scriptsetup wrappers to which the collector's main function is passed. They neatly cloak the setup of logging and one of them hides the required calls for pushing status into ScraperWiki (used internally in HDX) eg.
    from hdx.collector.scraperwiki import wrapper
    def main(configuration):
        dataset = generate_dataset(configuration, datetime.now())
        ...
    if __name__ == '__main__':
    wrapper(main)

API Documentation

  1. The code is very well documented. Detailed API documentation (generated from Google style docstrings using Sphinx) can be found in the Introduction above. 
    def load_from_hdx(self, id_or_name: str) -> bool:
    """Loads the dataset given by either id or name from HDX
        Args:
    followid_or_symlinksname (Optional[bool]str): FollowEither symlinksid or not.name Defaults to True.of dataset
        Returns:
    strbool: CurrentTrue script's directory and with filename appended
    if loaded, False if not
    """
  2. There are setup wrappers to which the collector's main function is passed. They neatly cloak the setup of logging and one of them hides the required calls for pushing status into ScraperWiki (used internally in HDX) eg.
    from hdx.collector.scraperwiki import wrapper
    def main(configuration):
        dataset = generate_dataset(configuration, datetime.now())
        ...
    if __name__ == '__main__':
    wrapper(main)

...

  1. IDEs can take advantage of the documentation eg.
    Image Added

  2. The method arguments and return parameter have type hints. (Although this is a feature of Python 3.5, it has been backported.) Type hints enable sophisticated IDEs like PyCharm to warn of any inconsistencies in using types bringing one of the major benefits of statically typed languages to Python.
    def merge_dictionaries(dicts: List[dict]) -> dict:

    gives: 
    Image Added

  3. Default parameters mean that there is a very easy default way to get set up and going eg.
    def update_yaml(self, path: Optional[str] = join('config', 'hdx_dataset_static.yml')) -> None: