Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

There are various enhancements to HDX that we can consider to improve the user experience, simplify the quality assurance work of Data Partnerships and support the development of dashboards and other visualisations. I want to document these enhancements here as we consider if they fit into our plans and where.

...

Issues that were already identified 

The following issues have been identified:

  • archiving of "old" datasets (where data is from a long time ago, but is as up to date as it can be eg. ebola 2013) - this is done

  • tag cleanup - dataset tags should not be freeform, they should be from a fixed list with the facility to request to add more - some of this is done

  • fixing data URLs so that charts, reports etc. don't break - USAID have asked for this and there is a Jira Epic for it (the Fixed Data URLs Idea goes further than what USAID have asked for)

  • a workflow that tries to alert a contributor when an update to a resource they are making has unexpected field names, data type changes, etc. - USAID have asked for this and there is a Jira Epic for it

  • a system whereby automated users (and maybe normal users as well) can register to receive important information about a dataset they are using eg. a breaking change to the format, no longer being updated etc. - USAID have asked for this and there is a Jira Epic for it

  • we need to be able to distinguish data resources from auxiliary ones - helpful to DP's work on QAing datasets

  • distinguishing API resources from others

  • resources can't keep growing indefinitely - we need a way to split a resource once it grows beyond a certain size

  • keeping a history of data (versioning) - newly added data may contain errors so it may be helpful to be able to fall back to a previous version of the data eg. if a dashboard cannot load latest/xxx.csv, it could try 1/xxx.csv

  • a service whereby if a contributor uploads data in a particular format/structure that we specify, then the data is served out disaggregated in multiple ways

  • finding data by API needs to be simpler. Currently the limitation there is the capabilities of the CKAN API search. It can be helped by adding more metadata into the dataset for example the list of fields and HXL tags in the data

  • more generally a search tailored to what users want to search for eg. if a user types "wheat price kandahar" they would like to get back that price.

How data is currently structured

...

  1. Dataset containing data in xlsx and csv formats as separate resources eg. https://data.humdata.org/dataset/afghan-voluntary-repatriation

  2. Dataset with rolling updates of resource (ie. dataset end date should be DATE) eg. https://data.humdata.org/dataset/inso-key-data-dashboardhttps://data.humdata.org/dataset/indonesia-monthly-humanitarian-update

  3. Dataset with metadata in resource eg. https://data.humdata.org/dataset/global-airportshttps://data.humdata.org/dataset/drc-health-data (jpeg has graphical metadata)

  4. Dataset with tiff in a zip: https://data.humdata.org/dataset/malawi_national_vulnerability_index_2015 (note the 2015 in the url is incorrect as it is current)

  5. Dataset with pdfs, zips (on OneDrive and filestore), mbtiles, tiff : https://data.humdata.org/dataset/iom-npm-cox-bazar-uav-imagery

  6. Dataset with JSON feed, HXLated JSON feed and xlsx (from automated output): https://data.humdata.org/dataset/migrant-deaths-by-month

  7. Disaggregate by country into datasets and by indicator into resources eg. https://data.humdata.org/dataset/who-data-for-barbados

  8. Disaggregate by date into datasets  eg. https://data.humdata.org/dataset/syria-idp-flow-and-returnee-data-october-2018https://data.humdata.org/dataset/syria-idp-flow-and-returnee-data-september-2018

  9. Disaggregate by date into resources within one dataset eg. https://data.humdata.org/dataset/nigeria-humanitarian-needs-overview

  10. Disaggregate by indicator into datasets eg. https://data.humdata.org/dataset/gender-development-index-female-to-male-ratio-of-hdihttps://data.humdata.org/dataset/population-in-severe-poverty-headcount

  11. Disaggregate by country into datasets and by date and region into resources eg. https://data.humdata.org/dataset/drc-displacement-data-baseline-assessment-iom-dtm

  12. Disaggregate by country into datasets and by round into resources eg. https://data.humdata.org/dataset/nigeria-baseline-data-iom-dtm

  13. Disaggregate by country and emergency into datasets and by round into resources eg. https://data.humdata.org/dataset/indonesia-displacement-data-sulawesi-earthquake-site-assessment-iom-dtm

  14. Map data for a country at different admin levels for various dates eg. https://data.humdata.org/dataset/administrative-boundaries-of-bangladesh-as-of-2015 (note the 2015 in the url is incorrect as it is current)

  15. Map and population data for a country with varying file formats and metadata in a pdf eg. https://data.humdata.org/dataset/bhutan-administrative-level-0-1-population-statistics

Simon is looking at how to identify data series.

Are we approaching the stage where we need to break down data by admin 1 rather than country to enable users of HDX to be able to search for data in the UI at that level? How do we make data available in many forms eg. by country, by indicator, by admin 1?

...

  • We need to define a list of file extensions that we regard as "tabular data" eg. csv, xls, xlsx etc.

  • We need to define a list of file extensions that we regard as "map data" eg. geojson, zipped shapefile etc.

  • This might be implemented using CKAN tag vocabularies as described above for tag cleanup

  • In the dataset edit UI, the resource name field should be labelled "Title of Resource"

  • It should not be prepopulated from the filename (as it is currently) because we want contributors to write a descriptive resource title

  • If there is a file extension, it should be used to prepopulate the "File type" field

  • The "File type" should not be allowed to be different to any provided file extension

  • If a file extension is not provided, the resource could be sniffed to try to guess the file type to prepopulate the field eg. using https://github.com/ahupp/python-magic

  • When adding a resource, the "File type" field needs to be locked down to a list of possible formats for the contributor to pick from

  • There should be a separate option to add a format not in the list

  • There should be an option to add more file types (ie. upload same data in different formats like csv, xls etc.)

  • One resource with multiple file types will translate into separate resources in dataset metadata

  • The resource metadata should have a new field containing the categorisation eg. category which could be "tabular", "map" or "auxiliary"

  • Once the "File type" is selected, the dataset edit UI should show how the resource was categorised:

    • "Tabular data"

    • "Map data"

    • "Auxiliary data"

  • The contributor should have the option to recategorise "Tabular data" or "Map data" to "Auxiliary data"

  • Showcases should just point to websites and visualisations with pdfs being auxiliary resources?

  • The contributor should have the option to flag to us that "Auxiliary data" has been incorrectly categorised - this will alert us to new tabular or map file formats to add to our lists

  • Yumi's new design has resources in one tab and metadata in another. Building on this, the "Data and Resources" tab should be divided into the categories

  • The "Download" button should be replaced with the "File type" with a small download symbol

  • If more than one resource has the same "Title of Resource", but different "File types" (within the same category), they can be shown as one resource with more than one file type download buttons

  • Example: Missing Migrants     ↓csv  ↓xls

Proposed Solution for Handling Different Structures of data 

...