Check that the values from the .csv files do indeed conform to the stated rules
The rules to which the values copied in from the .csv
files must conform are described in the section Inspect the COVIDcast .csv files. The form of each rule is the same for each of the three files. But sometimes, specific values that the rules govern are specific to the file. Each of the rules is tested, using the assert
construct, in a single PL/pgSQL stored procedure, created by the script cr-assert-assumptions-ok.sql
. The tests share these common declarations:
-- Each survey date (i.e. "time_value") has exactly the same number of states (i.e. "geo_value").
-- Each state has the same number, of survey date values.
expected_states constant text[] not null := array[
'ak', 'al', 'ar', 'az', 'ca', 'co', 'ct', 'dc', 'de', 'fl', 'ga',
'hi', 'ia', 'id', 'il', 'in', 'ks', 'ky', 'la', 'ma', 'md',
'me', 'mi', 'mn', 'mo', 'ms', 'mt', 'nc', 'nd', 'ne', 'nh',
'nj', 'nm', 'nv', 'ny', 'oh', 'ok', 'or', 'pa', 'ri', 'sc',
'sd', 'tn', 'tx', 'ut', 'va', 'vt', 'wa', 'wi', 'wv', 'wy'
];
expected_state_count constant int := cardinality(expected_states);
actual_states_qry constant text not null :=
'select array_agg(distinct geo_value order by geo_value) from ?';
actual_states text[] not null := '{}';
expected_dates date[] not null := array[start_survey_date];
actual_dates_qry constant text not null :=
'select array_agg(distinct time_value order by time_value) from ?';
actual_dates date[] not null := '{}';
expected_date_count int not null := 0;
names constant covidcast_names[] not null := (
select array_agg((csv_file, staging_table, signal)::covidcast_names) from covidcast_names);
expected_total_count int not null := 0;
r covidcast_names not null := ('', '', '');
d date not null := start_survey_date;
t text not null := '';
n int not null := 0;
b boolean not null := false;
Tests applied in turn to each staging table
The first several tests are applied, in turn, to each staging table. It is therefore convenient to implement them using dynamic SQL in a loop that iterates over the elements in the "names" array, thus:
foreach r in array names loop
...
end loop;
The "signal" column
Check that every value in the "signal" column is the same and that it is equal to the value that is specific to the source csv
file's name.
execute replace('select distinct signal from ?', '?', r.staging_table) into t;
assert t = r.signal, 'signal from '||r.staging_table||' <> "'||r.signal||'"';
The "geo_type" column
Check that every value in the "geo_type" column is the same and that it is equal to 'state':
execute 'select distinct geo_type from '||r.staging_table into t;
assert t = 'state', 'geo_type from '||r.staging_table||' <> "state"';
The "data_source" column
Check that every value in the "data_source" column is the same and that it is equal to 'fb-survey':
execute 'select distinct data_source from '||r.staging_table into t;
assert t = 'fb-survey', 'data_source from '||r.staging_table||' <> "fb-survey"';
The "direction" column
Check that every value in the "direction" columns NULL
:
execute $$select distinct coalesce(direction, '<null>') from $$||r.staging_table into t;
assert t = '<null>', 'direction from '||r.staging_table||' <> "<null>"';
Expected total count(*)
Check that the total number of rows is equal to "(number of distinct states)*(number of distinct survey dates)":
execute 'select count(*) from '||r.staging_table into n;
assert n = expected_total_count, 'count from '||r.staging_table||' <> expected_total_count';
The "geo_value" column
Check that there are 51 distinct values of "geo_value", corresponding to the 51 actual states (including DC):
execute replace(actual_states_qry, '?', r.staging_table) into actual_states;
assert actual_states = expected_states, 'actual_states <> expected_states';
execute 'select count(distinct geo_value) from '||r.staging_table into n;
assert n = expected_state_count, 'distinct state count per survey date from '||r.staging_table||' <> expected_state_count';
The "time_value" column
Check that there are 50 distinct values of "time_value", corresponding to the values in the dense sequence that the download range, "2020-09-13 - 2020-11-01", specified:
execute replace(actual_dates_qry, '?', r.staging_table) into actual_dates;
assert actual_dates = expected_dates, 'actual_dates <> expected_dates';
execute 'select count(distinct time_value) from '||r.staging_table into n;
assert n = expected_date_count, 'distinct survey date count per state from '||r.staging_table||' <> expected_date_count';
Check that each survey date has the same number of states
Because this is a SELECT DISTINCT ... INTO
query, it will cause an error if the number of states per survey date isn't the same for each date.
execute '
with r as (
select time_value, count(time_value) as n from '||r.staging_table||'
group by time_value)
select distinct n from r' into n;
assert n = expected_state_count, 'distinct state count from '||r.staging_table||' <> expected_state_count';
Check that each state has the same number of survey dates
Because this is a SELECT DISTINCT ... INTO
query, it will cause an error if the number of survey dates per state isn't the same for each date.
execute '
with r as (
select geo_value, count(geo_value) as n from '||r.staging_table||'
group by geo_value)
select distinct n from r' into n;
assert n = expected_survey_date_count, 'distinct state count from '||r.staging_table||' <> expected_survey_date_count';
The "value" column
Check that the "value" column is a number in the range 0..100.
execute '
select
max(value) between 0 and 100 and
min(value) between 0 and 100
from '||r.staging_table into b;
assert b, 'max(value), min(value) from '||r.staging_table||' both < 100 FALSE';
Tests applied in a single check across all three staging tables
These tests need to check that sets of values in each of the three staging tables are the same. They therefore use the UNION ALL
of EXCEPT
queries pattern to perform the test in a single SELECT
that is expected to produce no rows.
Check the one-to-one correspondence between "code" and "geo_code"
Each state has a unique numeric "code". The following SELECT
confirms that this is the case. The "code" column can therefore be safely left out of the final "covidcast_fb_survey_results" table.
declare
chk_code_and_geo_values constant text := $$
with
a1 as (
select to_char(code, '90')||' '||geo_value as v from ?1),
v1 as (
select v, count(v) as n from a1 group by v),
a2 as (
select to_char(code, '90')||' '||geo_value as v from ?2),
v2 as (
select v, count(v) as n from a2 group by v),
a3 as (
select to_char(code, '90')||' '||geo_value as v from ?3),
v3 as (
select v, count(v) as n from a3 group by v),
v4 as (select v, n from v1 except select v, n from v2),
v5 as (select v, n from v2 except select v, n from v1),
v6 as (select v, n from v1 except select v, n from v3),
v7 as (select v, n from v3 except select v, n from v1),
r as (
select v, n from v4
union all
select v, n from v5
union all
select v, n from v6
union all
select v, n from v6)
select count(*) from r$$;
begin
execute replace(replace(replace(chk_code_and_geo_values,
'?1', names[1].staging_table),
'?2', names[2].staging_table),
'?3', names[3].staging_table
) into n;
assert n = 0, '(code, geo_value) tuples from the three staging tables disagree';
end;
Check that each staging table has the same set of (geo_value, time_value) primary key tuples
declare
chk_putative_pks constant text := '
with
v1 as (
select geo_value, time_value from ?1
except
select geo_value, time_value from ?2),
v2 as (
select geo_value, time_value from ?2
except
select geo_value, time_value from ?1),
v3 as (
select geo_value, time_value from ?1
except
select geo_value, time_value from ?3),
v4 as (
select geo_value, time_value from ?3
except
select geo_value, time_value from ?1),
v5 as (
select geo_value, time_value from v1
union all
select geo_value, time_value from v2
union all
select geo_value, time_value from v3
union all
select geo_value, time_value from v4)
select count(*) from v5';
begin
execute replace(replace(replace(chk_putative_pks,
'?1', names[1].staging_table),
'?2', names[2].staging_table),
'?3', names[3].staging_table)
into n;
assert n = 0, 'pk values from ' ||
replace(replace(replace('?1, ?2, ?3',
'?1', names[1].staging_table),
'?2', names[2].staging_table),
'?3', names[3].staging_table) ||
' do not line up';
end;
Create the assert_assumptions_ok() procedure and invoke it manually
Copy the "cr_assert_assumptions_ok()" code and paste it into ysqlsh
. Then do this:
call assert_assumptions_ok(
start_survey_date => to_date('2020-09-13', 'yyyy-mm-dd'),
end_survey_date => to_date('2020-11-01', 'yyyy-mm-dd'));
It will finish silently. You can provoke and assert
failure by invoking it with, say:
end_survey_date => to_date('2020-11-02', 'yyyy-mm-dd')
The error is reported thus:
count from cmnty_symptoms <> expected_total_count
The invocation is included in ingest-the-data.sql
.