...
Read the dataset last modified date from HDX (which updates if the data of any resource changes) and the resource last modified dates (which updates if the resource data is changed so should be encapsulated by the dataset last modified date).
Load the previous run's last modified dates and replace dates if more recent.
Check the dataset's resources. If any have not been hashed for 30 days and the count of number of datasets unhashed in 30 days < 1/30 the number of resources, then we mark the resources as to be hashed.
Get the dataset's update frequency if it has one. If that update frequency is Never (-1 in API), Live (0) or Adhoc As needed (-2) then the dataset is always fresh.
If the dataset is not fresh based on metadata, then examine the urls of the resources. If they are internal urls (data.humdata.org - the HDX filestore
,manage.hdx.rwlabs.org- CPS) then there is no further checking that can be done because when the files pointed to by these urls update, the HDX metadata also updates.If the url is externally hosted, open an HTTP GET request to the file and check the header returned for the Last-Modified field. If that field exists, then read the date and time from it and if it is more recent than the resource last modification date, replace it.If the resource is not fresh by this measure then:Download the file and calculate an MD5 hash for it.
Check if the hash has changed compared with hash from the previous run stored in the database.
There are some resources where the hash changes constantly because they connect to an api which generates a file on the fly. To identify these, download and hash again and check if the hash changes in the few seconds since the previous hash calculation. If so, we cannot determine freshness.
If the hash has changed since the last run, but doesn't change constantly (API), store the hash and replace the resource last modified date with the current date (now).
Record the dataset and resources last modified dates.
Calculate freshness by measuring the difference between the current date and last modified date and comparing with the update frequency, setting a status: fresh, due, overdue or delinquent.
...
It produces some simple metrics eg. on from a first production run (empty database):
*** Resources ***
*
total:
...
149308 *,
api: 14,
api,error: 46,
error: 100,
filestore: 2088,
filestore,error: 1,
filestore,hash: 2,
filestore,same hash: 5,
firstrun: 1,
hash: 33,
hash,error: 2,
internal-filestore: 9,
internal-firstrun: 6,
internal-nothing: 7788,
nothing: 138926,
repeat hash: 1,
same hash: 266,
same hash,error: 20
*** Datasets ***
*
total:
...
22160 *,
0: Fresh, Updated
...
filestore:
...
4,
0: Fresh, Updated
...
filestore,
...
error:
...
1,
0: Fresh, Updated
...
filestore,
...
filestore,hash:
...
2,
0: Fresh, Updated
...
filestore,hash: 2,
...
0:
...
Fresh, Updated
...
filestore,script update: 414,
...
0:
...
Fresh, Updated
...
firstrun:
...
1,
...
0:
...
Fresh, Updated
...
hash:
...
22,
...
0:
...
Fresh, Updated
...
eg. a second run one day later:
...
nothing: 16015,
0: Fresh, Updated nothing,error: 19,
0: Fresh, Updated script update: 165,
1: Due, Updated filestore: 2,
1: Due, Updated nothing: 259,
1: Due, Updated nothing,error: 27,
2: Overdue, Updated nothing: 26,
2: Overdue, Updated nothing,error: 1,
3: Delinquent, Updated nothing: 5166,
3: Delinquent, Updated nothing,error:
...
24,
Freshness Unavailable, Updated
...
no resources: 10
1248 datasets have update frequency of Live
3978 datasets have update frequency of Never
2163 datasets have update frequency of As Needed
For more detailed analysis, the database it builds can can be queried eg.
...