Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Freshness is exposed in the interface by way of a green leaf symbol which indicates that a dataset is up to date - this means that there has been an update to the metadata or the data in the dataset within the expected update frequency plus some leeway. In producing this document, I have examined whether what our definition of freshness makes sense and looked at how users react to it. In particular, I have identified some cases where the freshness process needs some adjustment in order to avoid misleading users. Below I outline the most pervasive problems with our freshness feature and then give proposals for a solution which includes renaming and clearly defining date of dataset, a new Last Modified metadata field for datasets and resources  and 3 options on how to present freshness in the UI.

...

The following issues were found prior to the start of this investigation:

  • Exclude "Live", "As Needed" and "Never" datasets from no touch if already fresh rule - DONE
  • Discount edits made by HDX (as these edits cause datasets to be marked as fresh)
  • Restrict which metadata changes count as updates for freshness
  • Offer an "archived" icon in addition to "fresh" to indicate a dataset that is old, up-to-date, and no longer being updated. At the moment these are being called fresh, which is technically true, but tends to present a lot of old data to users.
  • Date of Dataset is used in different ways (captured later in this document)

Discount edits made by HDX

Edits by HDX staff are typically to fix issues and have no bearing on the up to dateness of the data, hence they should be ignored by freshness but we need to consider what to do about edits to datasets maintained by HDX.

The edits that have been performed on a dataset can be seen by looking at package_revision_list. One complication is that we must go through the history of edits because someone outside HDX could make an update, followed not long after by someone in HDX. A naive implementation could miss the first edit which should count towards freshness.

Restrict which metadata changes count as updates for freshness

Currently any dataset metadata change counts as an update from a freshness perspective. Our assumption is:

  • Such changes are taken as signifying that the dataset maintainer has thought about the data and checked it
  • If they had newer data, then we would expect them to put it into HDX while updating the dataset metadata
  • The fact they haven't means the data is as up to date as possible

This proposal limits our assumption to certain fields - it becomes:

  • Changes in certain metadata fields are taken as signifying that the dataset maintainer has thought about the data and checked it
  • If they had newer data, then we would expect them to put it into HDX while updating these specific dataset metadata fields
  • The fact they haven't means the data is as up to date as possible

The criteria for choosing the fields should be those that directly affect the underlying data or freshness calculation:

  • Expected update frequency
  • Dataset date
  • Location?
  • Source?

Note that if the number of fields is severely limited, this may render discounting edits by HDX unnecessary.

Points to consider:

  • Expected update frequency is used to calculate freshness, but then if someone changes it from yearly to monthly, that doesn't indicate anything about the data having changed. If the dataset was delinquent with yearly update frequency, it should still be delinquent with monthly.
  • Why should someone changing the dataset description be any less of an update from a freshness perspective than changing the dataset date?
  • There doesn't seem to be a compelling reason to do a partial restriction of metadata changes counting for freshness - it's really all or nothing:
    • either we regard any metadata change as someone indicating that the data is as fresh as it can be (as we originally envisaged)
    • or we simply disregard metadata changes altogether from determination of freshness and rely solely on data changes - note that detecting file store changes specifically would need to be investigated

Offer an "archived" icon in addition to "fresh"

The data in some datasets refers to or covers a date or date period which is far in the past, but the data itself is as up to date as it could be and will not be updated again. For these cases, it makes sense to offer an archived icon instead of fresh (which would be the icon used at present for an expected update frequency of "never"). 

Date of Dataset is used in different ways

...

More on point 1 below in Confusing concepts related to Date of Dataset.

Discovering Other Issues

To discover other possible issues with how freshness is understood, the following strategy was applied:

  1. Take a random sample of datasets ensuring that among them are fresh, due, overdue, and delinquent datasets and that they represent a cross section of different organisations' datasets
  2. Evaluate what fresh and not fresh mean
  3. Determine if it is clear to users
  4. Collect any cases where the fresh label (or lack of it) is misleading
  5. Categorise misleading cases

With an overview of the misleading cases, we can consider what to do about the terminology we use such as fresh and not fresh that accounts for the misleading cases and provides clarity to users.

Misleading cases

The misleading cases are documented in the Google spreadsheet here and the resources for those datasets were all frozen and stored in GitHub for further analysis. From the full analysis, a subset of examples of specific cases were picked and coloured in red.

...

Confusing concepts related to freshness

The following are possible dates freshness could use:

  • What date or date period does the data in the dataset cover
  • The date the data in the dataset was last modified
    • Was the update significant or minor?
  • The date the metadata of the dataset was last modified
    • Was the change significant or relevant to any dates we report?

...

  1. Rename "Date of Dataset" to "Date Coverage" and keep the current intended usage of indicating what date or date period does the data in the dataset cover (which can include being a singular moment in time like a 3W). Does this account for all cases or does "Date Coverage" not make sense for some datasets? 
  2. The underlying "Date Coverage" metadata field needs to allow the date or end date to be the current day (rolling forwards each day) eg. by allowing the value "DATE" - hence data that is being added to with each update can be set to a fixed start date and a floating end date or a download of live current data can just have a floating date. Maybe it is better to take this opportunity to make the dataset_date field into two fields for the start and end rather than messing with the existing? 
  3. Use last_modified metadata field on resources - I tested it and it indicates when the data was updated not the resource's metadata. Add a new last_modified field to the dataset metadata. The latest of the last_modified resource fields should be automatically copied to the a dataset level last_modified metadata field, but not the other way round ie. changing the dataset level last_modified metadata field should not affect the resource level last_modified resource fields.The dataset list/search UI should show this new field not the metadata_modified field it currently shows and this field should be added to the dataset page. Freshness will need to be modified to set this field instead of touching resources.
  4. Introduce the concept of "Reviewed" (or "Data is up to date"?) by having a new button in the contributor's (not users') UI, both inside and outside the dataset form, which the maintainer of the dataset or organisation administrator can click to indicate they have reviewed the dataset's data and agree it is as up to date as it can be. When the pointer hovers over the "Reviewed" button, a popup could ask the contributor to ensure the "Date Coverage" field is correct before clicking the button.
  5. Rather than introduce another new metadata field for the concept of "Reviewed", the dataset level "Last Modified" level last_modified metadata field (eg. data_modified) can be updated when the "Reviewed" button is clicked (regardless of whether any resource's data has actually been modified). Since we have the resource level "Last Modified" level last_modified fields, we can determine if the dataset has been reviewed or data has actually changed. Freshness will need to check this dataset level field.
  6. The "Dataset Created" field already exists in the metadata

...