Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

There are various enhancements to HDX that we can consider to improve the user experience, simplify the quality assurance work of Data Partnerships and support the development of dashboards and other visualisations. I want to document these enhancements here as so we consider can think about if and how they might fit into our development plans and wherefor HDX.

Problems

Issues that were already identified 

The following issues have been identified:

  • archiving of "old" datasets (where data is from a long time ago, but is as up to date as it can be eg. ebola 2013) - this is done

  • tag cleanup - dataset tags should not be freeform, they should be from a fixed list with the facility to request to add more - some of this is done

  • fixing data URLs so that charts, reports etc. don't break - USAID have asked for this and there is a Jira Epic for it (the Fixed Data URLs Idea goes further than what USAID have asked for)

  • a workflow that tries to alert a contributor when an update to a resource they are making has unexpected field names, data type changes, etc. - USAID have asked for this and there is a Jira Epic for it

  • a system whereby automated users (and maybe normal users as well) can register to receive important information about a dataset they are using eg. a breaking change to the format, no longer being updated etc. - USAID have asked for this and there is a Jira Epic for it

  • we need to be able to distinguish data resources from auxiliary ones - helpful to DP's work on QAing datasets

  • distinguishing API resources from others

  • resources can't keep growing indefinitely - we need a way to split a resource once it grows beyond a certain size

  • keeping a history of data (versioning) - newly added data may contain errors so it may be helpful to be able to fall back to a previous version of the data eg. if a dashboard cannot load latest/xxx.csv, it could try 1/xxx.csv

  • a service whereby if a contributor uploads data in a particular format/structure that we specify, then the data is served out disaggregated in multiple ways

  • finding data by API needs to be simpler. Currently the limitation there is the capabilities of the CKAN API search. It can be helped by adding more metadata into the dataset for example the list of fields and HXL tags in the data

  • more generally a search tailored to what users want to search for eg. if a user types "wheat price kandahar" they would like to get back that price.

How data is currently structured

...

Are we approaching the stage where we need to break down data by admin 1 rather than country to enable users of HDX to be able to search for data in the UI at that level? How do we make data available in many forms eg. by country, by indicator, by admin 1?

Ideas for HDX

The solutions ideas presented below were created with consideration for what is feasible given the restrictions of CKAN. The intention was to avoid overly complex solutions ideas that might require forking CKAN to make fundamental changes to its architecture and instead to try to come up with something relatively simple to implement given limited development capacity.

...

  • The metadata field that HDX Connect uses should be added to all datasets (field_names)

  • A new metadata field should be added to all datasets: hxl_hashtags

  • In the dataset edit dialog when a contributor adds a resource:

    • Its headers and any HXL tags should be scanned

    • The headers should prepopulate the Field Names text field

    • The contributor can edit the text field

    • The HXL Hashtags should prepopulate the HXL Hashtags (uneditable?) text field

  • CKAN search should be made to use the field_names and hxl_hashtags dataset fields by default

  • The resource_groups field in the dataset should also be included in CKAN search

  • See also “Idea for a Data Cube” later in this page

...

Standardising and Categorising Resources

  • We could categorise resources into Data and Auxiliary resources and then further subdivide into say “tabular “Tabular data”, “Geodata” and “geodata” “Auxiliary data” in the UI

    • The "Data and Resources" tab should be divided into the categories

    • We need to define a list of file extensions that we regard as "tabular data" eg. csv, xls, xlsx etc.

    • We need to define a list of file extensions that we regard as "geodata" eg. geojson, zipped shapefile etc.

  • In the dataset edit UI, the resource name field should be relabelled "Title of Resource"

    • It should not be prepopulated from the filename (as it is currently) because we want contributors to write a descriptive resource title

  • If there is a file extension, it should be used to prepopulate the "File type" field

    • The "File type" should not be allowed to be different to any provided file extension

    • If a file extension is not provided, the resource could be sniffed to try to guess the file type to prepopulate the field eg. using https://github.com/ahupp/python-magic

    • When adding a resource, the "File type" field needs to be locked down to a list of possible formats for the contributor to pick from

  • Consider adding an option to suggest a new format not in the list

  • For one resource, there should be an option to add more file types (ie. upload same data in different formats like csv, xls etc.)

    • One resource with multiple file types would likely translate into separate resources in dataset metadata

    • The "Download" button should be replaced with the "File type" with a small download symbol

    • If more than one resource has the same "Title of Resource", but different "File types" (within the same category), they can be shown as one resource with more than one file type download buttons

    • Example: Missing Migrants     ↓csv  ↓xls

  • Categorisation in the dataset edit UI:

    • The resource metadata should have a new field containing the categorisation eg. category which could be "tabular", "mapgeo" or "auxiliary"

    • Once the "File type" is selected, the dataset edit UI should show how the resource was categorised:

      • "Tabular data"

      • "Map dataGeodata"

      • "Auxiliary data"

    • The contributor should have the option to recategorise "Tabular data" or "Map dataGeodata" to "Auxiliary data"

  • Showcases should just point to websites and visualisations with pdfs being auxiliary resources?

  • The contributor should have the option to flag to us that "Auxiliary data" has been incorrectly categorised - this will alert us to new tabular or map file formats to add to our lists

...

  • We should have a check in the dataset create dialog for a date (eg. a year) being put into a title

  • Some sort of dialog should appear that guides the user away from doing this (by informing them about Resource Groups within datasets)

Blue Sky Ideas

...

The ideas below would require technological leaps and involve creating new systems outside of CKAN (athat could still link back into CKAN). They require research and discussion to flesh out.

Meta Service/API that links to other APIs

The idea is to have a meta service/API that would link to other services/APIs and allow the easy download of data given a standard set of input parameters. On top of such a meta service/API would be a user interface which would allow setting of those parameters and download by non technical users. Using the "wheat price kandahar" example, I could imagine a meta service/API user choosing parameters like "service": "prices", "type": "commodities", "provider": "WFP", "country": "Afghanistan", "adm1": "Kandahar", "Commodity": "wheat", "date": "08/03/2022".

...

  • Resources should have a field "is_queryable" which can be True or False

  • Resources should have a field "queryable_desc_url" which is a link to a webpage that describes the parameters a user can to the URL to filter or transform the data

  • There should be an explanation of what queryable means in the dataset edit dialog

  • There should be a checkbox in the dataset edit dialog to mark a resource as queryable

  • There should be a url field for adding a link to a webpage that describes the parameters a user can to the URL to filter or transform the data

  • Queryable resources should be flagged in the dataset view and search UIs (work has been done on this)

  • The link to a webpage describing the parameters should be next to or below the queryable resource

...

Data Cube

Typically we make data available in different forms like by indicator or by country by making separate datasets with their own data. This makes accessing the data fast but complicates scrapers which must do the breakdowns so mostly data is just broken down by country on HDX. A data cube enables data to be modelled and viewed in multiple dimensions. Is there a way that data could be stored in some sort of data cube so that it can be viewed in different ways like by country, admin 1 or indicator without keeping multiple copies of the data?

  • The data cube should be available from the HDX add dataset dialog, perhaps as a third type?

  • Aggregated data should be provided in standardised form with HXL hashtags (to be defined) 

  • The service potentially generates multiple datasets on HDX

  • General metadata that will be used for all generated datasets should be added using a UI similar to the add public dataset UI on HDX

  • The contributor can select to have the full aggregated dataset put into HDX?

  • The service looks at the HXL hashtags

  • It determines the columns which are suitable for disaggregation by looking at the HXL hashtags

  • It offers them as suggestions to the contributor?

  • If the contributor selects #country or we just disaggregate along every suitable column we detect, we can:

    • It splits the dataset by country, creating a dataset per country in HDX pointing back to the cube data

    • The metadata for each dataset is based on the general metadata

    • It will need to add country information into the dataset title etc.

  • A similar process can be applied for #indicator etc.