Background
...
HDX has a RESTful API largely unchanged from the underlying CKAN API which can be used from any programming language that supports HTTP GET and POST requests. The HDX Python API provides a simple interface that communicates with HDX using the CKAN Python API, a thin wrapper around the CKAN REST API. It is a mature library that supports Python 2.7 and 3 with tests that have a high level of code coverage. The major goal of the library is to make pushing and pulling data from HDX as simple as possible for the end user. HDX objects, such as datasets and resources, are represented by Python classes. Scrapers The scrapers we will discuss here use this library to communicate with HDX.
...
Rather than each scraper executing within a Python virtualenv as currently, they will each be in a Docker container. The scraper's Docker image will build upon (inherit) a base image owned by OCHA IT. The draft base image is here. It inherits from unocha/alpine-base:3.8 and contains a Python 3 environment suitable for running scrapers - it includes HDX Python API library, awesome-slugify and Pandas (including its dependencies on Scipy and Numpy). An example scraper that inherits this base image is the FTS scraper.
There is some private information that is needed by the scrapers to run. Currently it resides in a private OCHA GitHub repository, but it will be moved to Ansible.
The setup will comply with OCHA IT's Hosting in Shared Infrastructure: Project Requirements.