Table of Contents |
---|
There are various enhancements to HDX that we can consider to improve the user experience, simplify the quality assurance work of Data Partnerships and support the development of dashboards and other visualisations. I want to document these enhancements here as so we consider can think about if and how they might fit into our development plans and wherefor HDX.
Problems
Issues that were already identified
The following issues have been identified:
archiving of "old" datasets (where data is from a long time ago, but is as up to date as it can be eg. ebola 2013) - this is done
tag cleanup - dataset tags should not be freeform, they should be from a fixed list with the facility to request to add more - some of this is done
fixing data URLs so that charts, reports etc. don't break - USAID have asked for this and there is a Jira Epic for it (the Fixed Data URLs Idea goes further than what USAID have asked for)
a workflow that tries to alert a contributor when an update to a resource they are making has unexpected field names, data type changes, etc. - USAID have asked for this and there is a Jira Epic for it
a system whereby automated users (and maybe normal users as well) can register to receive important information about a dataset they are using eg. a breaking change to the format, no longer being updated etc. - USAID have asked for this and there is a Jira Epic for it
we need to be able to distinguish data resources from auxiliary ones - helpful to DP's work on QAing datasets
distinguishing API resources from others
resources can't keep growing indefinitely - we need a way to split a resource once it grows beyond a certain size
keeping a history of data (versioning) - newly added data may contain errors so it may be helpful to be able to fall back to a previous version of the data eg. if a dashboard cannot load latest/xxx.csv, it could try 1/xxx.csv
a service whereby if a contributor uploads data in a particular format/structure that we specify, then the data is served out disaggregated in multiple ways
finding data by API needs to be simpler. Currently the limitation there is the capabilities of the CKAN API search. It can be helped by adding more metadata into the dataset for example the list of fields and HXL tags in the data
more generally a search tailored to what users want to search for eg. if a user types "wheat price kandahar" they would like to get back that price.
How data is currently structured
...
Are we approaching the stage where we need to break down data by admin 1 rather than country to enable users of HDX to be able to search for data in the UI at that level? How do we make data available in many forms eg. by country, by indicator, by admin 1?
...
Ideas for HDX
The solutions ideas presented below were created with consideration for what is feasible given the restrictions of CKAN. The intention was to avoid overly complex solutions ideas that might require forking CKAN to make fundamental changes to its architecture and instead to try to come up with something relatively simple to implement given limited development capacity.
...
Tags metadata
Tag mappings from existing tags to desired ones and deletions of extraneous tags have been previously defined in the Google spreadsheethereThat Google spreadsheet is already used by HDX Python API libraryThere is already a ticket so that any edits made as a result of tag cleanup do not "touch" the datasetsJira Legacy server System JIRA serverId efab48d4-6578-3042-917a-8174481cd056 key HDX-6404 The locked down list of tags from the Google spreadsheet should be integrated into CKANThe list of allowed tags should be put into a CKAN vocabulary eg. using the CKAN API call:vocabulary_createNow might be a good time to categorise tags into multiple vocabularies?
The way to use the tag vocabulary with the metadata field is described in theCKAN documentation for schemas(I think)On the dataset edit page, when a contributor starts to type a tag, tag suggestions from the defined vocabulary(ies) should pop upThis might work using the CKAN calltag_autocompleteOr maybe comes directly from setting the metadata field to use a specific vocabulary(ies)We should consider establishing
An example to look at for UI isstackoverflow
Establish a synonym schema so that if a user types "idp" in the tag search, the "internally displaced persons" tag will come up even though there are no matching strings.
Or if they are trying to tag "creche", "schools" is suggested.An example to look at for UI is stackoverflow
Create a process for suggesting new tags - if a contributor types a tag name that is not in the vocabulary(ies), then:
The contributor should be alerted that the tag is not in the existing list
The contributor should be shown the full list of tags and advised to check if there is an existing tag that matches
If we have categorised tags into multiple vocabularies, that would be quite helpful for the contributor to narrow down their search
If there is none that matches the contributor should be able to click a button to apply for a new tag
If categories are being used, they will need to specify the category?
Should we allow the tag (ie. auto add it) with DP having the option to reject it?
or should the dataset be created minus the tag with DP having to add it later if they approve it?
In either case, the contributor should be allowed to continue creating the dataset
On submit (when all the information about the dataset will be available), DP should receive a mail containing the:
organisation name
contributor username and email
dataset title
tags and proposed tag(s)
There needs to be Provide a way for DP to add, edit and delete tags
Initially this can be by script
In the long run if it happens often that new tags are needed, then there may be some value in either:
making a UI
updating the CKAN tags and vocabulary(ies) from a Google Spreadsheet (which may already have been done in one of the earlier steps to create the locked down list)
The CKAN calls package_create or package_update should fail with an appropriate error message if tags are specified that aren't in the vocabulary(ies) - again this might be a byproduct of defining a vocabulary(ies) on a metadata fieldOnce this is in place, there needs to be a check to see if any new tags have appeared since the tag cleanup spreadsheet was last worked onDemo should get the current prod databaseThe tag cleanup script should be updated to use whatever parameter (or username or whatever) was created so that it does not touch datasetsThe tag cleanup script should be run against demoIf all goes well, the tag cleanup should be run against prod
If
...
Detecting Breaking Changes to Resources
If the generalised solution above is implemented, then according to the expected update frequency:
Compare the data in the resource as of now with the data in the backed up version in the filestore if there is one
If the headers have changed, alert the contributor
Can check other things too like data types
This probably requires alerting by email
This would be an external process like the freshness emailer
This will be very IO intensive
If the filestore solution above is implemented (which is my preferred), then:
Contributor in the dataset edit UI updates a resource
If there is backup, then compare with it
If the headers have changed, display a warning prompt of some sort
Can check other things too like data types
Some of the above is being looked at for USAID
Proposed High Level Solution for Users Registering Interest in Datasets
A user views a dataset on HDXThey should have the option to click a button to register for updates about a datasetThe “Follow” option should also be available through API
They should be able to select what updates they are interested in:
Metadata changes
Tabular or Map Data changes (either filestore or freshness-detected changes)
Dataset being "Reviewed"
Breaking changes (which depends upon which solution from earlier is chosen under "Detecting Breaking Changes to Resources" - this is being looked at for USAID
They should perhaps be able to select For USAID, this is being looked at for filestore resources:
Jira Legacy server System JIRA serverId efab48d4-6578-3042-917a-8174481cd056 key HDX-8280
Users Registering Interest in Datasets
While it is possible to “Follow” datasets in the UI, that option should also be available through API
They should be able to select what updates they are interested in:
Metadata changes
Tabular or Map Data changes (assuming the later data categorisation idea is implemented)
Dataset being "Reviewed"
Breaking changes (USAID:
)Jira Legacy server System JIRA serverId efab48d4-6578-3042-917a-8174481cd056 key HDX-8281
They should perhaps be able to select whether to get emails and the frequency of emails (but this will complicate the logic)
Whenever the dataset is updated and the update corresponds to what they registered for, they should
Get get an email immediately Or or the change should be recorded for sending in a periodic email
Proposed Solution for Data and Auxiliary Resources
...
We need to define a list of file extensions that we regard as "tabular data" eg. csv, xls, xlsx etc.
...
We need to define a list of file extensions that we regard as "map data" eg. geojson, zipped shapefile etc.
...
This might be implemented using CKAN tag vocabularies as described above for tag cleanup
...
In the dataset edit UI, the resource name field should be labelled "Title of Resource"
...
It should not be prepopulated from the filename (as it is currently) because we want contributors to write a descriptive resource title
...
If there is a file extension, it should be used to prepopulate the "File type" field
...
The "File type" should not be allowed to be different to any provided file extension
...
If a file extension is not provided, the resource could be sniffed to try to guess the file type to prepopulate the field eg. using https://github.com/ahupp/python-magic
...
When adding a resource, the "File type" field needs to be locked down to a list of possible formats for the contributor to pick from
...
There should be a separate option to add a format not in the list
...
There should be an option to add more file types (ie. upload same data in different formats like csv, xls etc.)
...
One resource with multiple file types will translate into separate resources in dataset metadata
...
The resource metadata should have a new field containing the categorisation eg. category which could be "tabular", "map" or "auxiliary"
...
Once the "File type" is selected, the dataset edit UI should show how the resource was categorised:
"Tabular data"
"Map data"
"Auxiliary data"
...
The contributor should have the option to recategorise "Tabular data" or "Map data" to "Auxiliary data"
...
Showcases should just point to websites and visualisations with pdfs being auxiliary resources?
...
The contributor should have the option to flag to us that "Auxiliary data" has been incorrectly categorised - this will alert us to new tabular or map file formats to add to our lists
...
Yumi's new design has resources in one tab and metadata in another. Building on this, the "Data and Resources" tab should be divided into the categories
...
Keeping a History of Data
This is about versioning of files on HDX
Assuming we limit it to the filestore, it may be easiest to leverage AWS versioning
If AWS isn’t an option, a solution could be as follows:
For filestore resources, backup when contributor uses dataset edit dialog to update a resource
For external urls, backup according to the expected update frequency into the filestore
Do we ask if contributor wishes to keep a backup? No backup means just do as now (ie. overwrite the resource)
Backups are created in new resource groups
Once there are too many resources, archive the old ones
This will require a lot of storage
Some resources may be too big to keep backing up so regularly
Improving Search
The metadata field that HDX Connect uses should be added to all datasets (field_names)
A new metadata field should be added to all datasets: hxl_hashtags
In the dataset edit dialog when a contributor adds a resource:
Its headers and any HXL tags should be scanned
The headers should prepopulate the Field Names text field
The contributor can edit the text field
The HXL Hashtags should prepopulate the HXL Hashtags (uneditable?) text field
CKAN search should be made to use the field_names and hxl_hashtags dataset fields by default
The resource_groups field in the dataset should also be included in CKAN search
See also “Idea for a Data Cube” later in this page
Standardising and Categorising Resources
We could categorise resources into Data and Auxiliary resources and then further subdivide into say “Tabular data”, “Geodata” and “Auxiliary data” in the UI
The "Data and Resources" tab should be divided into the categories
We need to define a list of file extensions that we regard as "tabular data" eg. csv, xls, xlsx etc.
We need to define a list of file extensions that we regard as "geodata" eg. geojson, zipped shapefile etc.
In the dataset edit UI, the resource name field should be relabelled "Title of Resource"
It should not be prepopulated from the filename (as it is currently) because we want contributors to write a descriptive resource title
If there is a file extension, it should be used to prepopulate the "File type" field
The "File type" should not be allowed to be different to any provided file extension
If a file extension is not provided, the resource could be sniffed to try to guess the file type to prepopulate the field eg. using https://github.com/ahupp/python-magic
When adding a resource, the "File type" field needs to be locked down to a list of possible formats for the contributor to pick from
Consider adding an option to suggest a new format not in the list
For one resource, there should be an option to add more file types (ie. upload same data in different formats like csv, xls etc.)
One resource with multiple file types would likely translate into separate resources in dataset metadata
The "Download" button should be replaced with the "File type" with a small download symbol
If more than one resource has the same "Title of Resource", but different "File types" (within the same category), they can be shown as one resource with more than one file type download buttons
Example: Missing Migrants
Proposed Solution for Handling Different Structures of data
We want to move contributors away from a completely freeform experience of structuring data without discouraging them.
We need to introduce the concept of "Resource Groups"
Resource groups allow resources to be grouped in multiple ways, for example:
By date
By round
The logic for resource groups should enable versioning
Resource groups can have the categories tabular, map and auxiliary within them
There should be a resource metadata field "group"
There should be a dataset metadata field "latest_resource_group"
All resource groups should be maintained at the dataset level in a field eg. "resource_groups" which could be a comma separated list for example
In the dataset view UI, if there is only one group then the "Data and Resources" tab should be displayed as now
If there are multiple resource groups, then the "Data and Resources" tab should have within it either tabs:
Sub-tabs for each group "Resources - groupname" in the order in the dataset's "resource_groups" field
The default one shown should be the one specified in the dataset "latest_resource_group"
Or an alternative to tabs within tabs are folders/shortcuts which may be more intuitive:
The resources of the default group (as specified in the dataset "latest_resource_group") should be shown
Below these should be folder icons for each group "Resources - groupname" in the ↓csv ↓xls
Categorisation in the dataset edit UI:
The resource metadata should have a new field containing the categorisation eg. category which could be "tabular", "geo" or "auxiliary"
Once the "File type" is selected, the dataset edit UI should show how the resource was categorised:
"Tabular data"
"Geodata"
"Auxiliary data"
The contributor should have the option to recategorise "Tabular data" or "Geodata" to "Auxiliary data"
Showcases should just point to websites and visualisations with pdfs being auxiliary resources?
The contributor should have the option to flag to us that "Auxiliary data" has been incorrectly categorised - this will alert us to new tabular or map file formats to add to our lists
Handling Different Structures of data
We want to move contributors away from a completely freeform experience of structuring data without discouraging them. I need to see how this fits with SImon’s ongoign work on data series:
We need to introduce the concept of "Resource Groups"
Resource groups allow resources to be grouped in multiple ways, for example:
By date
By round
The logic for resource groups should enable versioning
Resource groups can have the categories tabular, map and auxiliary within them
There should be a resource metadata field "group"
There should be a dataset metadata field "latest_resource_group"
All resource groups should be maintained at the dataset level in a field eg. "resource_groups" which could be a comma separated list for example
In the dataset view UI, if there is only one group then the "Data and Resources" tab should be displayed as now
If there are multiple resource groups, then the "Data and Resources" tab should have within it either tabs:
Sub-tabs for each group "Resources - groupname" in the order in the dataset's "resource_groups" field (excluding the
The default one shown should be the one specified in the dataset "latest_resource_group")Double clicking a folder will change the resources shown to those in the resource group
Or an alternative to tabs within tabs are folders/shortcuts which may be more intuitive:
The resources of the default group (as specified in the dataset "latest_resource_group") should be shown
Below these should be folder icons for each group "Resources - groupname" in the order in the dataset's "resource_groups" field (excluding the "latest_resource_group")
Double clicking a folder will change the resources shown to those in the resource group corresponding to the folder
In the dataset edit UI, there should either be tabs:
Tabs at the top showing any resource groups that already exist in the order in the dataset's "resource_groups" field
The first and default tab should be what is in latest_resource_group
The resources shown should be whatever are in the resource group specified in latest_resource_group
A "+" tab should allow the creation of a new resource group
Double clicking in an existing tab should allow renaming a resource group
Clicking a not selected resource group tab should cause the resources shown to change to whatever are in that resource group
The tabs should be draggable to change their order which will cause the dataset's "resource_groups" field to update
Or folders/shortcuts:
The resources shown should be whatever are in the resource group specified in latest_resource_group
Below these should be folder icons for any resource groups that already exist in the order in the dataset's "resource_groups" field
A folder with "+" in the icon should allow the creation of a new resource group
Single clicking on a existing folder should allow renaming a resource group
Double clicking a resource group folder should cause the resources shown to change to whatever are in that resource group
The folders should be draggable to change their order which will cause the dataset's "resource_groups" field to update
Each resource should have an option (button with dropdown?) to move it to another resource group
There needs to be an explanation of what a "Resource Group" is and what it is used for on the left
There should be a field "Latest Resource Group" with a dropdown to select from all the created groups
See also Proposed Solution for Improving Search below
...
Fixing URLs
A simpler version of this is being done for USAID whereby there are URLs for resource positions in the dataset.
Based upon the groups and categories outlined in ideas above, the resources should be made available under the urls:
https://data.humdata.org/dataset/mydataset/groupname/tabular/
https://data.humdata.org/dataset/mydataset/groupname/map/
https://data.humdata.org/dataset/mydataset/groupname/auxiliary/The descriptive resource title should be slugified and used as the resource filename
The extension should be taken from the "File type" field
Example: "Title of Resource" = "Missing Migrants", "File type" = "xlsx", group = "2018" → https://data.humdata.org/dataset/mydataset/2018/tabular/missing-migrants.xlsx
If "2018" were set as the latest group, the same file would be accessible from this url → https://data.humdata.org/dataset/mydataset/latest/tabular/missing-migrants.xlsx
...
Archiving Resources
We should determine a sensible maximum number of resources in the UI
When a dataset grows to have that maximum and the contributor tries to add more, he or she they should be prompted to archive or delete resourcesThe performance issue with
Archived resources are not shown by default in the dataset view
Depending upon performance for datasets with many resources needs to be examinedIf it can be resolved:
Each resource should have a new metadata field "archived"
The dataset edit UI should have an archive resource button (or if we have resource groups, on each resource group)
This should prompt the contributor to select the resource(s) (or resource group(s)) to be archived
For each resource (in the resource group(s)), the archived flag is set to True
Archived resources are not shown by default in the dataset view
- If it can't
If performance is an issue:
The dataset edit UI should have an archive resource group button
This should prompt the contributor to select the resource group(s) to be archived
On clicking next, a new dataset should be created with only the resource group(s) that were selected
The metadata for the dataset should be copied from the original
The archived metadata field should be set to True
The dataset title should have "(archived on ISODATETIME)"
The original dataset should have the resource groups (and underlying resources) removed from it
...
Excessively Large Resources
We need to define how big is reasonable (can we?)
When a filestore resource reaches that size:
In the dataset edit UI, the contributor should be prompted
The prompt should contain information or link to info on ways to disaggregate data into Resource Groups
The contributor can either go back to the edit UI (file will not be added as resource)
Or ignore and continue (file will be included as resource)
We could look at http header content length for remote resources and do the same?
Proposed High Level Solutions for Keeping a History of Data
First decision to make is if we only try to solve versioning of files in the filestore as solving it for external URLs is much harder.
A generalised solution for both filestore and external URLs is:
Backup according to the expected update frequency into the filestore
Backups are created in new resource groups
Once there are too many resources, the archiving above is applied
This will require a lot of storage
Some resources may be too big to keep backing up so regularly
This can be done as an external process like freshness
A solution for filestore only (which is my preferred) is:
Contributor in the dataset edit UI updates a resource
Do we ask if contributor wishes to keep a backup?
No backup means just do as now (ie. overwrite the resource)
Backup involves creating a resource group with the current resource group name + "_backup_ISODATETIME"
The resource being updated should be assigned to this resource group
The new resource should be assigned to the original resource group
Proposed Solution for HDX Curation
If HDX wishes to create new curated version(s) of an existing resource(s) in a dataset:
HDX create a new resource group within the dataset eg. "curated"
HDX add curated resource(s) under that group
URL would be something like: https://data.humdata.org/dataset/mydataset/curated/tabular/missing-migrants.csv
Proposed High Level Solution for Discouraging Dated datasets
We want to discourage dated datasets as this leads to new datasets being created for each new update and inconsistent urls
We should have a check in the dataset create dialog for a date (eg. a year) being put into a title
Some sort of dialog should appear that guides the user away from doing this (by informing them about Resource Groups within datasets)
Proposed Solution for Queryable Resources
Queryable resources are ones where if you add some parameters to the URL you can filter or transform the returned file.
Resources should have a field "is_queryable" which can be True or False
Resources should have a field "queryable_desc_url" which is a link to a webpage that describes the parameters a user can to the URL to filter or transform the data
There should be an explanation of what queryable means in the dataset edit dialog
There should be a checkbox in the dataset edit dialog to mark a resource as queryable
There should be a url field for adding a
Centre Curated Resources
The Centre may wish to create new curated version(s) of an existing resource(s) in a dataset:
HDX create a new resource group within the dataset eg. "curated"
HDX add curated resource(s) under that group
URL would be something like: https://data.humdata.org/dataset/mydataset/curated/tabular/missing-migrants.csv
The curated resource is appropriately highlighted in the UI
Discouraging Dated datasets
We want to discourage dated datasets as this leads to new datasets being created for each new update and inconsistent urls
We should have a check in the dataset create dialog for a date (eg. a year) being put into a title
Some sort of dialog should appear that guides the user away from doing this (by informing them about Resource Groups within datasets)
Blue Sky Ideas
The ideas below would require technological leaps and involve creating new systems outside of CKAN (athat could still link back into CKAN). They require research and discussion to flesh out.
Meta Service/API that links to other APIs
The idea is to have a meta service/API that would link to other services/APIs and allow the easy download of data given a standard set of input parameters. On top of such a meta service/API would be a user interface which would allow setting of those parameters and download by non technical users. Using the "wheat price kandahar" example, I could imagine a meta service/API user choosing parameters like "service": "prices", "type": "commodities", "provider": "WFP", "country": "Afghanistan", "adm1": "Kandahar", "Commodity": "wheat", "date": "08/03/2022".
Could this power the HDX UI allowing searches like "wheat price kandahar" to produce helpful results?
Add meta service/API resources as “queryable” resources in HDX - ones where you can add some parameters to filter or transform the returned file:
Resources should have a field "is_queryable" which can be True or False
Resources should have a field "queryable_desc_url" which is a link to a webpage that describes the parameters a user can to the URL to filter or transform the data
There should be an explanation of what queryable means in the dataset edit dialog
There should be a checkbox in the dataset edit dialog to mark a resource as queryable
There should be a url field for adding a link to a webpage that describes the parameters a user can to the URL to filter or transform the data
Queryable resources should be flagged in the dataset view and search UIs (work has been done on this)
The link to the a webpage describing the parameters should be next to or below the queryable resource
Proposed Solution for Improving Search
The metadata field that HDX Connect uses should be added to all datasets (field_names)
A new metadata field should be added to all datasets: hxl_hashtags
In the dataset edit dialog when a contributor adds a resource:
Its headers and any HXL tags should be scanned
The headers should prepopulate the Field Names text field
The contributor can edit the text field
The HXL Hashtags should prepopulate the HXL Hashtags uneditable? text field
CKAN search should be made to use the field_names and hxl_hashtags dataset fields by default
The resource_groups field in the dataset should also be included in CKAN search
See also “Idea for a Data Cube” later in this page
Proposed High Level Solution for an Automated Data Disaggregation Service
The service should reside in the Tools domain
It should be available from the HDX add dataset dialog, perhaps as a third type?
Aggregated data should be provided in standardised form with HXL hashtags (to be defined)
The service potentially generates multiple datasets on HDX
General metadata that will be used for all generated datasets should be added using a UI similar to the add public dataset UI on HDX
The contributor can select to have the full aggregated dataset put into HDX
The service looks at the HXL hashtags
It determines the columns which are suitable for disaggregation by looking at the HXL hashtags
It offers them as suggestions to the contributor?
If the contributor selects #country or we just disaggregate along every suitable column we detect, we can:
It splits the dataset by country, creating a dataset per country in HDX
The metadata for each dataset is based on the general metadata
It will need to add country information into the dataset title etc.
A similar process can be applied for #indicator etc.
Ideas
Idea for a Meta Service/API that links to other APIs
The idea is to have a meta service/API that would link to other services/APIs and allow the easy download of data given a standard set of input parameters. On top of such a meta service/API would be a user interface which would allow setting of those parameters and download by non technical users. Using the "wheat price kandahar" example, I could imagine a meta sercie/API user choosing parameters like "service": "prices", "type": "commodities", "provider": "WFP", "country": "Afghanistan", "adm1": "Kandahar", "Commodity": "wheat", "date": "08/03/2022".
Could this power the UI allowing searches like "wheat price kandahar"?
Idea for a Data Cube
...
Data Cube
Typically we make data available in different forms like by indicator or by country by making separate datasets with their own data. This makes accessing the data fast but complicates scrapers which must do the breakdowns so mostly data is just broken down by country on HDX. A data cube enables data to be modelled and viewed in multiple dimensions. Is there a way that data could be stored in some sort of data cube so that it can be viewed in different ways like by country, admin 1 or indicator without keeping multiple copies of the data?
The data cube should be available from the HDX add dataset dialog, perhaps as a third type?
Aggregated data should be provided in standardised form with HXL hashtags (to be defined)
The service potentially generates multiple datasets on HDX
General metadata that will be used for all generated datasets should be added using a UI similar to the add public dataset UI on HDX
The contributor can select to have the full aggregated dataset put into HDX?
The service looks at the HXL hashtags
It determines the columns which are suitable for disaggregation by looking at the HXL hashtags
It offers them as suggestions to the contributor?
If the contributor selects #country or we just disaggregate along every suitable column we detect:
It splits the dataset by country, creating a dataset per country in HDX pointing back to the cube data
The metadata for each dataset is based on the general metadata
It will need to add country information into the dataset title etc.
A similar process can be applied for #indicator etc.