Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Background

...

The current platform is an external service called ScraperWiki. It has a web based user interface that enables the status of scrapers to be viewed. Through the UI, it is possible to create a new environment on which to run new scrapers. The environment is an Amazon Web Services virtual server with various packages included on it like Python. The UI allows the user to obtain a url for a server. The user can ssh into that server and set up the scraper by, for example, git cloning its code onto the server and then setting up a virtualenv with the required Python packages. Cron is used to execute the scrapers according to the desired schedule.

...

Requirements for New Platform

With the move to the new platform, it was decided to deprecate many old scrapers and so the range of technologies needed has been reduced dramatically. These are the high level requirements for the new platform:

  • The new platform needs to have the facility to schedule the running of code.
  • The granularity of the schedule is typically weekly or yearly, but sometimes daily.
  • It needs to be able to execute Python 3 code that will broadly follow this template.
  • It should be able to read and write to external urls.
  • Some disk space is needed for temporary files produced by scrapers, in the order a maximum of 10Mb per scraper, but typically much less.
  • Some disk space will be needed for the environment for the each scraper, in the order of 300Mb for a Docker image.
  • There should be the possibility to execute 2 scrapers simultaneously (rare but possible).
  • It should support scrapers (1 currently) that may have to wait for new quota from servers, which means that they will run for many hours, but be mostly idle (eg. pinging the server once a minute).
  • The platform should be able to cope with in the order of 10000 calls to the web in a 3 hour period. (This is a worst case based on the FTS scraper which makes in the order of 2500 reads from FTS and 1000 read/writes to HDX in a one hour period. Planning for the future, I've allowed for 2 more scrapers like this.)
  • It needs to have a user interface where the status of scrapers can be determined.
  • It would be nice to be able to start scrapers from the interface.
  • The process for adding new scrapers should be technically simple and bureaucratically light.

...

Rather than each scraper executing within a Python virtualenv as currently, they will each be in a Docker container. The scraper's Docker image will build upon (inherit) a base image owned by OCHA IT. The draft base image is here. It inherits from unocha/alpine-base:3.8 and contains a Python 3 environment suitable for running scrapers - it includes HDX Python API library, awesome-slugify and Pandas (including its dependencies on Scipy and Numpy). An example scraper that inherits this base image is the FTS scraper.

There is some private information that is needed by the scrapers to run. Currently it resides in a private OCHA GitHub repository, but it will be moved to Ansible.

The setup will comply with OCHA IT's Hosting in Shared Infrastructure: Project Requirements.