Eris Measuring Discord Among Multidimensional Data Sources extended Version

Abstract.Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and record merging. To solve the latter, it is mostly assumed that ground truth can be determined, either as master data or from user feedback. However, in many cases, this is not the case because firstly the merging processes cannot be accurate enough, and also the data gathering processes in the different sources are simply imperfect and cannot provide high quality data. Instead of enforcing consistency, we propose to evaluate how concordant or discordant sources are as a measure of trustworthiness (the more discordant are the sources, the less we can trust their data).

Thus, we define the discord measurement problem in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the alignment of different data (for example, cases and deaths), we wish to assess whether the different sources are concordant, or if not, measure how discordant they are. discord server We also define a set of algebraic operators to describe the alignments, together with two alternative relational implementations that reduce the problem to linear or quadratic programming. These are evaluated against both COVID-19 and synthetic data, and our experimental results show that discordancy measurement can be performed efficiently in realistic situations.

spacing=nonfrench

The United Nations Sustainable Development Goals (SDGs)111https://www.un.org/sustainabledevelopment are 17 goals with 169 targets that all UN Member States have agreed to work towards achieving by the year 2030. For example, the third SDG aims at good health and well-being and target 3.3 specifically aims at ending epidemics of different communicable diseases. Thus, the World Health Organization (WHO) and its department of Neglected Tropical Diseases (NTD) has aligned its new road map 2021-2030 with these goals by aiming at the eradication of two NTDs, and the elimination and control of the 18 others by 2030 (World Health Organization, 2020). Disease eradication requires engaging a multi-year administrative process comprising different phases of control and elimination (overall, around 900 independent processes of verification of an NTD in a specific country). The WHO declaration of a country or region being free of a disease (or its transmission) first requires that different actors provide reliable and consistent data, regularly gathered in the corresponding regions for a given period of time. An obvious prerequisite for such a declaration is having high quality data coming from multiple, independent sources that corroborate that the disease is under control. Unfortunately, such idealized, concordant data is not the norm and takes great effort to be produced. Instead, data sources are often discordant for a variety of reasons. For example, different actors often report measures at different granularities that can only be compared after aggregation or cleaning. On doing this, even if the aggregation performed is correct, due to reporting mistakes, mereological discrepancies, or incompleteness, it could happen that the indicator of the whole (e.g., cases at country level) is different from the aggregation of indicators of its parts (e.g., states, districts), if these come from a different source (or even from the same source). This challenge is well-illustrated by COVID-19 data where missing, incomplete, or inconsistent numbers have been blamed for bad decision making and unnecessary suffering (Meyer and Madrigal, 2021).

Scientists often analyse data by placing different indicators (e.g., number of patients or number of deaths) in a multidimensional space (e.g., geography and time). The multidimensional model and OLAP tools (Abelló and Romero, 2018) have been used in the last 30 years with this purpose, as a more powerful and structured alternative to spreadsheets. However, despite these being mature technologies, problems like managing missing and contradictory information are still not solved.

OLAP tools are typically used in data warehousing environments where consistent and well known data go through a well structured cleaning and integration process merging different sources. Nevertheless, in the wild, sources are typically incomplete and not well aligned, and such data cleaning and integration processes are far from trivial, resulting in imperfect comparisons. Like in the parable of the blind men describing an elephant after touching different parts of its body (i.e., touching the trunk, it is like a thick snake; the leg, like a tree stump; the ear, like a sheath of leather; the tail tip, like a furry mouse; etc.), in many areas like epidemiology, different data sources reflect the same reality in slightly different and partial ways.

Thus, in such a complex context, it is necessary to have a tool that measures discrepancies for the available data. Indeed, (Baikousi et al., 2011) and (Golfarelli and Turricchia, 2014) measure the differences in the descriptive multidimensional data and their structure. Instead, we aim at evaluating the reliability of the numerical indicators, given some required alignment declaration (e.g., aggregation or scale correction). At this point, it is important to highlight that, even if some work like (Oukid et al., 2016) proposes to treat textual data as indicators (allowing to aggregate them too), we restrict ourselves to numerical measures, whose discrepancies cannot be evaluated using string similarity metrics like the ones surveyed in (Yu et al., 2016). These would rather be part of a preliminary step of entity matching over dimensional descriptors. inline,color=green!40,size=]Alberto (addressed): R1:”I have a hard time understanding the motivation.”-¿I’d compare here against entity resolution inline,color=green!40,size=]Alberto (addressed): R1:”The introduction should show how the proposed distance metric or calculation is different or better from/than existing dataset similarity algorithms including simply counting the mismatches.” inline,color=green!40,size=]Alberto (addressed): UNrelated papers: https://link.springer.com/chapter/10.1007/978-1-4614-3223-4_4 https://dl.acm.org/doi/10.14778/2732296.2732299 https://dl.acm.org/doi/10.1145/3299869.3300065 inline,color=green!40,size=]Alberto (addressed): R2:”I believe there is no clear explanation why the selected distance (discord) is superior.” inline,color=green!40,size=]Alberto (addressed): R3:”Assuming that discrete attributes are part of the key, and automatically excluding discrete values from the indicators, is a strong assumption which reduces the practicality of the proposed approach. Keys are usually known without needing to be inferred from attributes types, and it can be useful to include some discrete values into a discord measure (e.g., if the severity of a disease is recorded using a code number).”

Contributions.

Incomplete information is typically handled in relational databases by using NULL values. However, it is well known that NULLs are overloaded with different meanings such as nonexisting, unknown or no-information (Abiteboul et al., 1995). Thus, we propose to use NULL only for nonexisting or no-information, and enrich the data model with symbolic variables that allow to represent the partial knowledge we might have about unknown numerical values. While using symbolic variables for NULLs is not a new idea, introduced for example in classical models for incomplete information such as c-tables and v-tables (Imielinski and Jr., 1984) and more recently in data cleaning systems such as LLUNATIC (Geerts et al., 2013), our approach generalizes unknowns to be arbitrary (linear) expressions that in the end define a setting for the evaluation of the trustworthiness of different sources of multidimensional data based on their concordancy/discordancy using standard linear or quadratic programming solvers. More concretely, in this paper we contribute:

(1) A definition and formalization of the problem of discord measurement of databases under some merging processes.

(2) A set of algebraic operations to describe merging processes that allow to manage multidimensional tables with symbolic numerical expressions.

(3) A coalescing operator that generates concordancy constraints over symbolic tables.

(4) A prototype, Eris, of our approach222Eris is the Greek goddess of discord. and an experimental comparison of two alternative implementations of the algebra using a relational DBMS (PostgreSQL) and an off-the-shelf quadratic programming solver (OSQP (Stellato et al., 2020)).

(5) An analysis of discordances in the epidemiological surveillance systems of six European countries during the COVID-19 pandemic based on EuroStats and Johns Hopkins University (JHU) data.

Organization.

Section 2 presents a motivational example that helps to identify the problem formally defined in Section 3, which also presents our solution based on an algebraic query language for symbolic evaluation. Section 4 details two alternative relational implementations, which are then evaluated in Section 5. Our experimental results show that both approaches provide acceptable performance, and illustrate the value of our approach for assessing the discordance of COVID-19 epidemiological surveillance data at different times and in different countries between March 2020 and February 2021. The paper concludes with the related work and conclusions in Sections 6 and 7.

2. Running example

Unfortunately, there is insufficient publicly available data regarding NTDs. Thus, we used COVID-19 data in our experiments and examples, which is widely available and of varying quality, making it a good candidate for discordance evaluation. We consider the scenario where a network of actors (i.e., governmental institutions) take primary measurements of COVID-19 cases and derive some aggregates from those. We illustrate how to model the scenario using database schemas and views, and describe the different problems we need to solve in this scenario.

Example 2.1 (). Discord servers

The statistical institute of 𝑃𝑎𝑛𝑒𝑚𝑃𝑎𝑛𝑒𝑚\mathitPanemitalic_Panem (a fictional country) generates census reports on the weekly excess of deaths (assumed attributable to COVID-19) in the country. Since we are just considering 𝑃𝑎𝑛𝑒𝑚𝑃𝑎𝑛𝑒𝑚\mathitPanemitalic_Panem, we can model this information using a table (primary key underlined) indicating the number declared per week:

Census(week, deaths)

We would have greater trust in multiple consistent reports of the same quantity if they had been obtained independently; if they all came from a single source, we would be more skeptical. Therefore, in an epidemiological verification process, we gather different complementary sources of information providing surrogates or approximations to the desired measurements.

Example 2.2 ().

Suppose that 𝑃𝑎𝑛𝑒𝑚𝑃𝑎𝑛𝑒𝑚\mathitPanemitalic_Panem, as depicted in Fig. 1, comprises thirteen districts (DI,…,DXIIIsubscript𝐷𝐼…subscript𝐷𝑋𝐼𝐼𝐼D_I,\ldots,D_XIIIitalic_D start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT). In each district, there are several hospitals and a person living in Disubscript𝐷𝑖D_iitalic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is monitored by at most one hospital. Hospitals report their number of cases of COVID-19 to their district governments, and each district government reports to the Ministry of Health (MoH).

Given their management autonomy, the different districts in 𝑃𝑎𝑛𝑒𝑚𝑃𝑎𝑛𝑒𝑚\mathitPanemitalic_Panem use different and imperfect monitoring mechanisms and report separately the COVID-19 cases they detect every week. Despite being gathered at health facilities, Panem𝑃𝑎𝑛𝑒𝑚Panemitalic_P italic_a italic_n italic_e italic_m is only reporting to the Centre for Disease Prevention and Control (CDC) partial information at the district level and the overall information of the country. We can model this using relational tables with the weekly district and country information.

ReportedDistrict(district, week, cases)

ReportedCountry(week, cases)

In an idealized setting, we would expect to know all the relationships and have consistent measurements for each primary attribute, and each derived result would be computed exactly with no error. However, some relationships may be unknown and both primary and derived attributes can be noisy, biased, unknown or otherwise imperfect.

Example 2.3 ().

The following view aggregates the district-level for each week, which should coincide with the values per country:

CREATE VIEW AggReported AS

SELECT week, SUM(cases) AS cases

FROM ReportedDistrict GROUP BY week;

Moreover, it is already known that COVID-19 mortality depends on the age distribution of the population, but let us assume an average Case-Fatality Ratio (CFR) of 1.5%percent1.51.5\%1.5 %. In terms of SQL, we would have the following view which estimates the number of deaths based on the number of reported cases in the country.

CREATE VIEW Inferred AS

SELECT week, 0.015*cases AS deaths

FROM ReportedCountry;

Example 2.4 ().

Ideally, if all COVID-19 cases were detected, and we knew the exact CFR as well as the effects of the pandemic in other causes of death, the week should unambiguously determine the number of cases and deaths (i.e., information derived from reported cases, both at district and country levels, and mortality in the census must coincide). In terms of SQL, these constraints could be checked using assertions like the following.

CREATE ASSERTION SumOfCases CHECK (NOT EXISTS

(SELECT * FROM ReportedCountry r JOIN AggReported a

ON r.week=a.week WHERE r.cases<>a.cases));

CREATE ASSERTION NumberOfDeaths CHECK (NOT EXISTS

(SELECT * FROM Census c JOIN Inferred i

ON c.week=i.week WHERE c.deaths<>i.deaths));

Thus, we see that SQL already provides the required mechanisms to freely align the different sources and impose the coincidence of values. Nevertheless, as explained above, achieving exact consistency seems unlikely in any real setting. Indeed, using existing techniques it is possible to check consistency among data sources when there is no uncertainty, but it is not straightforward, in the presence of unknown null values or suspected error in reported values, to determine whether the various data sources are consistent with the expected relationships.

Example 2.5 ().

It is easy to see that the following database is not consistent with our view specification, in part because the cases of a district (i.e., DXIIIsubscript𝐷𝑋𝐼𝐼𝐼D_XIIIitalic_D start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT) are not reported, but also the second assertion is violated (i.e., too many people died -20202020- compared to the inferred number based on the cases reported and the CFR -only 15151515-).

ReportedDistrict(”I”,”2110W25”,75)

ReportedDistrict(”XII”,”2110W25”,75)

AggReported(”2110W25”,900)

ReportedCountry(”2110W25”,1000)

Inferred(”2110W25”,15)

Census(”2110W25”,20)

inline,color=green!40,size=]Alberto (addressed): Explain what is already doable and what is new. Consequently, we aim at extending DBMS functionalities for concordancy evaluation as a pragmatic alternative to simply quantify how far away the data are from being consistent. Given on the one hand the queries and views specifying the expected behavior, and on the other the data corresponding to observations of some of the inputs, intermediate results, or (expected) outputs, is the observed numeric data complete and concordant considering the alignment specification? If there is missing data, can the existing datasets be extended to some complete instance that is concordant? Finally, how far from being fully consistent is the numerical data?

Given such an idealized scenario (specified by its schema and views) and a collection of actual observations (both primary and derived), we can still consider two different problems:

(1) Value estimation: estimate the values of numerical attributes of interest (e.g., the number of cases and deaths across 𝑃𝑎𝑛𝑒𝑚𝑃𝑎𝑛𝑒𝑚\mathitPanemitalic_Panem) that make the system consistent.

(2) Discord evaluation: Evaluate how far is the actual, discordant dataset from an idealized concordant one.

Problem 1 is the well-studied statistical estimation problem. However, many sources behave as black boxes, and it can be very difficult to precisely quantify the uncertainty and underlying assumptions in many situations, especially where the interrelationships among different data sources are complex. Instead, we consider problem 2. Given a (probably incomplete but overlapping) set of instances, we assume only a merging process specification in the form of expectations about their alignment, expressed using database queries and views. Our goal in this paper is not to find a realistic estimate of the true values of unknown or uncertain data, but instead to quantify how close the data are to our expectations under the given alignment. It is important to clarify that while the approach we will adopt does produce estimates for the uncertain values as a side-effect, they are not guaranteed to have any statistical validity unless additional work is done to statistically characterize the sources of uncertainty, which we see as a separate problem.

Therefore, the key contribution of this paper is that both checking concordance and measuring discord can be done by augmenting the data model with symbolic expressions, and this in turn can be done consistently and efficiently in a RDBMS with the right set of algebraic operations. We formalize this intuition in the next section after which we will return to this example. inline,color=green!40,size=]Alberto (addressed): R1:”The running example is not easy to digest. It is not clear how all the different example steps are connected. It is not clear what falls into assumptions and what falls into the solution and what falls into state-of-the-art.” -¿ I do not see any problem in the example, I’d say it is quite clear. inline,color=green!40,size=]Alberto (addressed): R3:”Despite being interesting, I think the running example is excessively detailed while some other parts of the paper would gain in clarity with the introduction of small example (e.g., the definitions in section 3). For this purpose, why not using small samples from your running example (which is, to my opinion, more a motivating example for now despite its use in example 4.3).” -¿ TBD after changing sections 3&4

3. Problem formulation

Our goal is to define and measure the degree of discordance of different data sources with complementary multidimensional information, where uncertainty may arise from NULLs standing for unknown values, or reported measurements that have some unknown error. In this section, we define a variant of relational algebra for queries over (sets of) finite maps represented as symbolic tables, formally define the concordance and discordance problems, and show that they can be solved by reduction to linear or quadratic programming, respectively. We use the following notational conventions for tuples and relations: a tuple t𝑡titalic_t is a finite map from some set of attribute names A1,…,Ansubscript𝐴1…subscript𝐴𝑛\A_1,\ldots,A_n\ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to data values d∈D𝑑𝐷d\in Ditalic_d ∈ italic_D. We use letters as K,U,V,W𝐾𝑈𝑉𝑊K,U,V,Witalic_K , italic_U , italic_V , italic_W, etc. to denote sets of attribute names, and sometimes write U,V𝑈𝑉U,Vitalic_U , italic_V to indicate the union of disjoint attribute sets (i.e., U∪V𝑈𝑉U\cup Vitalic_U ∪ italic_V when U𝑈Uitalic_U and V𝑉Vitalic_V are disjoint) or U,A𝑈𝐴U,Aitalic_U , italic_A to indicate addition of new attribute A𝐴Aitalic_A to attribute set U𝑈Uitalic_U (i.e., U∪A𝑈𝐴U\cup\A\italic_U ∪ italic_A provided A∉U𝐴𝑈A otin Uitalic_A ∉ italic_U). We also write U\V\𝑈𝑉U\backslash Vitalic_U \ italic_V for the difference of attribute sets. Data values D𝐷Ditalic_D include real numbers ℝℝ\mathbbRblackboard_R, and (as discussed below) value attributes are restricted to be real-valued. The domain of a tuple dom(t)dom𝑡\mathrmdom(t)roman_dom ( italic_t ) is the set of attribute names it maps to some value. We write t.Aformulae-sequence𝑡𝐴t.Aitalic_t . italic_A for the value associated with A𝐴Aitalic_A by t𝑡titalic_t, and t[U]𝑡delimited-[]𝑈t[U]italic_t [ italic_U ] for the tuple t𝑡titalic_t restricted to domain U𝑈Uitalic_U. We write t.A:=dformulae-sequence𝑡assign𝐴𝑑t.A:=ditalic_t . italic_A := italic_d to indicate the tuple obtained by mapping A𝐴Aitalic_A to d𝑑ditalic_d and mapping all other attributes B∈dom(t)𝐵dom𝑡B\in\mathrmdom(t)italic_B ∈ roman_dom ( italic_t ) to t.Bformulae-sequence𝑡𝐵t.Bitalic_t . italic_B. Note that dom(t.A:=d)=dom(t)∪Afragmentsdomfragments(t.Aassignd)domfragments(t)fragmentsA\mathrmdom(t.A:=d)=\mathrmdom(t)\cup\A\roman_dom ( italic_t . italic_A := italic_d ) = roman_dom ( italic_t ) ∪ italic_A and this operation is defined even if A𝐴Aitalic_A is not already mapped by t𝑡titalic_t. Furthermore, if V=B1,…,Bn𝑉subscript𝐵1…subscript𝐵𝑛V=\B_1,\ldots,B_n\italic_V = italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an attribute set and u𝑢uitalic_u is a tuple with domain U⊇V𝑉𝑈U\supseteq Vitalic_U ⊇ italic_V, then we write t.V:=uformulae-sequence𝑡assign𝑉𝑢t.V:=uitalic_t . italic_V := italic_u as an abbreviation for (⋯(t.B1:=u.B1).B2:=u.B2⋯).Bn:=u.Bnfragmentsfragments(⋯fragments(t.subscript𝐵1assignu.subscript𝐵1).subscript𝐵2assignu.subscript𝐵2⋯).subscript𝐵𝑛assignu.subscript𝐵𝑛(\cdots(t.B_1:=u.B_1).B_2:=u.B_2\cdots).B_n:=u.B_n( ⋯ ( italic_t . italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := italic_u . italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT := italic_u . italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ ) . italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := italic_u . italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, that is for the result of (re)defining t𝑡titalic_t to match u𝑢uitalic_u on the attributes from V𝑉Vitalic_V. Finally, when the range of t,u𝑡𝑢t,uitalic_t , italic_u happens to be ℝℝ\mathbbRblackboard_R, that is, t𝑡titalic_t and u𝑢uitalic_u are a real-valued vectors, we write t+u𝑡𝑢t+uitalic_t + italic_u for the vector sum, α⋅t⋅𝛼𝑡\alpha\cdot titalic_α ⋅ italic_t for scalar multiplication by α∈ℝ𝛼ℝ\alpha\in\mathbbRitalic_α ∈ blackboard_R, and when Z𝑍Zitalic_Z is a finite multiset of such real-valued vectors we write ∑Z𝑍\sum Z∑ italic_Z for their vector sum.

Definition 3.1 (Finite map schema).

A finite map schema ΣΣ\Sigmaroman_Σ is an assignment of relation names R1,…,Rnsubscript𝑅1…subscript𝑅𝑛R_1,\ldots,R_nitalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to finite map signatures, which are written K▷V▷𝐾𝑉K\triangleright Vitalic_K ▷ italic_V, where K𝐾Kitalic_K and V𝑉Vitalic_V are disjoint. A relation matches signature K▷V▷𝐾𝑉K\triangleright Vitalic_K ▷ italic_V if its attributes are K∪V𝐾𝑉K\cup Vitalic_K ∪ italic_V, the key-fields are elements of the data domain D𝐷Ditalic_D, the value-fields are real numbers ℝℝ\mathbbRblackboard_R, and it satisfies the Functional Dependency (FD) K→V→𝐾𝑉K\to Vitalic_K → italic_V. A database instance I𝐼Iitalic_I matches ΣΣ\Sigmaroman_Σ if each relation R𝑅Ritalic_R of I𝐼Iitalic_I matches the corresponding signature Σ(R)Σ𝑅\Sigma(R)roman_Σ ( italic_R ). We write Inst(Σ)𝐼𝑛𝑠𝑡ΣInst(\Sigma)italic_I italic_n italic_s italic_t ( roman_Σ ) for the set of all instances of a schema ΣΣ\Sigmaroman_Σ.

inline,color=green!40,size=]Alberto (addressed): R1:”3.1 K and V are introduced as ”sets of attribute names”. before it is stated that KUV.. are ”attribute set names”. -¿ Reworded to make this clearer inline,color=green!40,size=]Alberto (addressed): R1:”The notation with the triangle is not explained.” -¿ Clarified that we are introducing notation in definition Notice that discrepancies can only appear when we can have more than one tuple with the same key, but different values for some other attribute. Thus, finite map signatures are essentially relational schemas with a primary key annotation, that is we write R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V for what is conventionally written as R(A1,…,Am¯,B1,…,Bm)𝑅¯subscript𝐴1…subscript𝐴𝑚subscript𝐵1…subscript𝐵𝑚R(\underlineA_1,\ldots,A_m,B_1,\ldots,B_m)italic_R ( under¯ start_ARG italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) when K=A1,…,Am𝐾subscript𝐴1…subscript𝐴𝑚K=A_1,\ldots,A_mitalic_K = italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and V=B1,…,Bm𝑉subscript𝐵1…subscript𝐵𝑚V=B_1,\ldots,B_mitalic_V = italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. However, as we now discuss, we will work in a setting where query results (and view schemas) also need to satisfy such constraints and we now introduce a specialized query language and type system to preserve them.

We consider a variation of relational algebra over finite maps, whose type system ensures that the finite map property is preserved in any query result. The syntax is as follows:

c::=A=B∣A<b∣¬c∣c∧c′e::=α∈ℝ∣a∣e+e′∣e-e′∣e⋅e′∣e e′q::="R∣σc(q)∣π^W(q)∣q⋈q′∣q⊎Bq′∣q\q′∣ρB↦B′(q)∣εB:=e(q)∣γK;V(q)𝑐italic-::=𝐴𝐵delimited-∣∣𝐴bra𝐵𝑐𝑐superscript𝑐′𝑒:absentassign𝛼ℝdelimited-∣∣𝐴𝑒conditional⋅superscript𝑒′delimited-∣∣𝑒superscript𝑒′𝑒superscript𝑒′𝑒superscript𝑒′𝑞italic-::=⋈conditional𝑅delimited-∣∣subscript𝜎𝑐𝑞subscript^𝜋𝑊𝑞𝑞\superscript𝑞′delimited-∣∣subscript⊎𝐵𝑞superscript𝑞′𝑞superscript𝑞′missing-subexpression∣subscript𝜌maps-to𝐵superscript𝐵′𝑞delimited-∣∣subscript𝜀assign𝐵𝑒𝑞subscript𝛾𝐾𝑉𝑞\beginarray[]rclc&\mathrel::=&A=B\mid" astart_array start_row start_cell italic_c end_cell italic_::="end_CELL" italic_a="italic_B" ∣ <italic_b ¬ ∧ start_postsuperscript ′ end_postsuperscript end_row italic_e : italic_α ∈ blackboard_r + - ⋅ italic_q italic_r italic_σ start_postsubscript end_postsubscript ( ) over^ start_arg italic_π end_arg italic_w ⋈ ⊎ italic_b \ italic_ρ ↦ italic_ε italic_γ italic_k ; italic_v end_array< p>

</b∣¬c∣c∧c′e::=α∈ℝ∣a∣e+e′∣e-e′∣e⋅e′∣e>

Queries q𝑞qitalic_q include relation names, selections, projections, set operations and joins. The type system restricts selection criteria to consider only predicates over key values that evaluate to boolean 𝔹=𝑡𝑟𝑢𝑒,𝑓𝑎𝑙𝑠𝑒𝔹𝑡𝑟𝑢𝑒𝑓𝑎𝑙𝑠𝑒\mathbbB=\\mathittrue,\mathitfalse\blackboard_B = italic_true , italic_false , and only allows projection-away of value-fields; this is to ensure the results of these operations are still finite maps, and to avoid comparisons of values which may be symbolic. Likewise, the arguments to a join may have overlapping key-fields but not shared value-fields, to avoid symbolic value comparisons. We do not allow arbitrary unions because the union of two finite maps is not a finite map if the domains overlap; however, we do allow discriminated unions which adjoin a new key-field B𝐵Bitalic_B that distinguishes the tuples coming from the first or second argument. We include a difference operator R\S\𝑅𝑆R\backslash Sitalic_R \ italic_S that removes the keys in S𝑆Sitalic_S from R𝑅Ritalic_R, and a standard renaming operator ρB↦B′subscript𝜌maps-to𝐵superscript𝐵′\rho_B\mapsto B^\primeitalic_ρ start_POSTSUBSCRIPT italic_B ↦ italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Finally, to accommodate arithmetic comparisons and aggregation, we include operators εB:=e(R)subscript𝜀assign𝐵𝑒𝑅\varepsilon_B:=e(R)italic_ε start_POSTSUBSCRIPT italic_B := italic_e end_POSTSUBSCRIPT ( italic_R ) and γK;V(R)subscript𝛾𝐾𝑉𝑅\gamma_K;V(R)italic_γ start_POSTSUBSCRIPT italic_K ; italic_V end_POSTSUBSCRIPT ( italic_R ); the former adds a new value-field B𝐵Bitalic_B which is initialized by evaluating expression e𝑒eitalic_e using the field values in each row, and the latter performs grouping on key-fields K𝐾Kitalic_K and aggregation on value-fields V𝑉Vitalic_V. The constraint that grouping can only be performed on key-fields and aggregation on value-fields is again needed to ensure that the results are still finite maps and to avoid comparisons of symbolic values. These restrictions will allow us to evaluate queries over symbolic values, while avoiding the need for some of the complexities encountered in c-tables (Imielinski and Jr., 1984) or work on provenance for aggregate queries (Amsterdamer et al., 2011).

3.1. Symbolic evaluation

The basic idea is to represent unknown real values with variables x∈X𝑥𝑋x\in Xitalic_x ∈ italic_X. Variables can occur multiple times in a table, or in different tables, representing the same unknown value, and more generally unknown values can be represented by (linear) expressions. However, key values d∈D𝑑𝐷d\in Ditalic_d ∈ italic_D used in key-fields are required to be known. This reflects the assumption that the source database is partially closed (Fan and Geerts, 2009), that is, we assume the existence of master data for the keys (i.e., all potential keys are coincident and known).

Definition 3.2 ().

Let X𝑋Xitalic_X be some fixed set of variables. A symbolic expression over X𝑋Xitalic_X e𝑒eitalic_e is a real-valued expression in ℝ[X]ℝdelimited-[]𝑋\mathbbR[X]blackboard_R [ italic_X ] (e.g., a0+a1x1+⋯+anxnsubscript𝑎0subscript𝑎1subscript𝑥1⋯subscript𝑎𝑛subscript𝑥𝑛a_0+a_1x_1+\dots+a_nx_nitalic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT).

Definition 3.3 ().

A symbolic table, or s-table (over X𝑋Xitalic_X) R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V is a table in which attributes from K𝐾Kitalic_K are mapped to discrete non-null values in D𝐷Ditalic_D (as before) and value attributes in V𝑉Vitalic_V are mapped to symbolic expressions in ℝ[X]ℝdelimited-[]𝑋\mathbbR[X]blackboard_R [ italic_X ].

We define the domain of an s-table dom(R)𝑑𝑜𝑚𝑅dom(R)italic_d italic_o italic_m ( italic_R ) to be the set of values of its key attributes. We say that an s-table is linear if each value attribute is a linear expression in variables X𝑋Xitalic_X.

An s-table is ground if all of the value entries are actually constants, i.e. if it is an ordinary database table (though still satisfying the functional dependency K→V→𝐾𝑉K\to Vitalic_K → italic_V). The restrictions we have placed on symbolic tables are sufficient to ensure that symbolic evaluation is correct with respect to evaluation over ground tables, as we explain below.

We now clarify how real-world uncertain data (e.g., containing NULLs for unknown values instead of variables, or containing values that might be wrong) can be mapped to s-tables suitable as input to our approach.

Suppose we are given an ordinary database instance I:Σ:𝐼ΣI:\Sigmaitalic_I : roman_Σ, which may have missing values (i.e., NULLs) and uncertain values (i.e., reported values which we do not believe to be exactly correct). To use the s-table framework, we need to translate such instances to s-instances I′superscript𝐼′I^\primeitalic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that represent the set of possible complete database instances that match observed data I𝐼Iitalic_I.

To allow for the possibility that some reported values present in the data might be inaccurate, we replace such values with symbolic expressions containing n𝑛nitalic_n variables. This can be done in many ways, with different justifications based on the application domain. In this paper, we replace uncertain values v𝑣vitalic_v with v⋅(1+x)⋅𝑣1𝑥v\cdot(1+x)italic_v ⋅ ( 1 + italic_x ) (or simply x𝑥xitalic_x if v=0𝑣0v=0italic_v = 0) where x𝑥xitalic_x is an error variable. The reason for doing this and not simply v+x𝑣𝑥v+xitalic_v + italic_x is that the weight we associate with an error variable x𝑥xitalic_x is x2superscript𝑥2x^2italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, so the cost of errors is scaled to some extent by the magnitude of the value (e.g., it should be easier to turn 1,000,00010000001,000,0001 , 000 , 000 into 2,000,00020000002,000,0002 , 000 , 000 than 1111 into 10101010). On the other hand, a natural way to handle NULLs in s-tables is to replace each NULL with a distinct variable with no associated cost. In particular scenarios, we might instead prefer to replace each NULL with an expression c⋅(1+x)⋅𝑐1𝑥c\cdot(1+x)italic_c ⋅ ( 1 + italic_x ) where c𝑐citalic_c is an educated guess as to the likely value, but in this paper we consider only the simple approach where a NULL is mapped to a variable.

Example 3.4 ().

It is easy to see that there are many possibilities of assigning cases of COVID-19 to the different districts of Panem𝑃𝑎𝑛𝑒𝑚Panemitalic_P italic_a italic_n italic_e italic_m that add up to 1,000 per week, and consequently improve the consistency of our database, which may be easily represented by replacing constants by symbolic expressions “75(1+xi)751subscript𝑥𝑖75(1+x_i)75 ( 1 + italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )”, where xisubscript𝑥𝑖x_iitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an error parameter representing that cases may be missed or overreported in every district. The cases for district DXIIIsubscript𝐷𝑋𝐼𝐼𝐼D_XIIIitalic_D start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT, that were not reported at all, could then be simply represented by a variable xXIIIsubscript𝑥𝑋𝐼𝐼𝐼x_XIIIitalic_x start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT. On the other hand, we also know that attributing all the excess deaths to COVID-19 involves some imprecision, so we should apply some error term (1+z)1𝑧(1+z)( 1 + italic_z ) to the numbers coming from the census, too. Nevertheless, this may not completely explain the mismatch between cases reported at the country level and deaths, and there might also be some doubly-counted or hidden cases in Panem𝑃𝑎𝑛𝑒𝑚Panemitalic_P italic_a italic_n italic_e italic_m (for example in the Capitol which is assumed not to have any cases), which we represent by variable (1+y)1𝑦(1+y)( 1 + italic_y ). Therefore, s-tables ReportedDistrict:district,week▷cases:𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐷𝑖𝑠𝑡𝑟𝑖𝑐𝑡▷𝑑𝑖𝑠𝑡𝑟𝑖𝑐𝑡𝑤𝑒𝑒𝑘𝑐𝑎𝑠𝑒𝑠ReportedDistrict:\district,week\\triangleright\cases\italic_R italic_e italic_p italic_o italic_r italic_t italic_e italic_d italic_D italic_i italic_s italic_t italic_r italic_i italic_c italic_t : italic_d italic_i italic_s italic_t italic_r italic_i italic_c italic_t , italic_w italic_e italic_e italic_k ▷ italic_c italic_a italic_s italic_e italic_s , ReportedCountry:week▷cases:𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐶𝑜𝑢𝑛𝑡𝑟𝑦▷𝑤𝑒𝑒𝑘𝑐𝑎𝑠𝑒𝑠ReportedCountry:\week\\triangleright\cases\italic_R italic_e italic_p italic_o italic_r italic_t italic_e italic_d italic_C italic_o italic_u italic_n italic_t italic_r italic_y : italic_w italic_e italic_e italic_k ▷ italic_c italic_a italic_s italic_e italic_s and Census:week▷deaths:𝐶𝑒𝑛𝑠𝑢𝑠▷𝑤𝑒𝑒𝑘𝑑𝑒𝑎𝑡ℎ𝑠Census:\week\\triangleright\deaths\italic_C italic_e italic_n italic_s italic_u italic_s : italic_w italic_e italic_e italic_k ▷ italic_d italic_e italic_a italic_t italic_h italic_s would contain:

⬇

ReportedDistrict(”I”,”2110W25”,75*(1+xI)751subscript𝑥𝐼75*(1+x_I)75 * ( 1 + italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ))

……\dots…

ReportedDistrict(”XII”,”2110W25”,75*(1+xXII)751subscript𝑥𝑋𝐼𝐼75*(1+x_XII)75 * ( 1 + italic_x start_POSTSUBSCRIPT italic_X italic_I italic_I end_POSTSUBSCRIPT ))

ReportedDistrict(”XIII”,”2110W25”,xXIIIsubscript𝑥𝑋𝐼𝐼𝐼x_XIIIitalic_x start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT)

ReportedCountry(”2110W25”,1000*(1+y)10001𝑦1000*(1+y)1000 * ( 1 + italic_y ))

Census(”2110W25”,20*(1+z)201𝑧20*(1+z)20 * ( 1 + italic_z ))

We now make the semantics of our query operations over s-tables precise in Fig. 2. A key property of this semantics is that (for both ground and symbolic tables) the typing rules ensure that a well-formed query evaluated on a valid instance of the input schema yields a valid result table, preserving the desired properties, that is, ensuring that the resulting tables are valid s-tables, and moreover ensuring that the semantics of query operations applied to s-tables is consistent with their behavior on ground tables. Moreover, symbolic evaluation preserves linearity, which essential for ensuring that the constrained optimization problems arising from symbolic evaluation fit standard frameworks.

The following paragraphs describe and motivate the behavior of each operator, and informally explain and justify the correctness of the well-formedness rules ensuring that the result of (symbolic) evaluation is a valid (symbolic) table. A formal presentation of the well-formedness rules is in Appendix A.

• Selection (σc(R):K▷V:subscript𝜎𝑐𝑅▷𝐾𝑉\sigma_c(R):K\triangleright Vitalic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_R ) : italic_K ▷ italic_V where R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V). We permit σc(R)subscript𝜎𝑐𝑅\sigma_c(R)italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_R ) when c𝑐citalic_c is a Boolean formula referring only to fields A,B,…∈K𝐴𝐵…𝐾A,B,\ldots\in Kitalic_A , italic_B , … ∈ italic_K. If comparisons involving symbolic values were allowed, then the existence of some rows in the output would depend on the unknown results, so would not be representable just using s-tables.

• Projection-away (π^W(R):K▷V\W:subscript^𝜋𝑊𝑅▷𝐾\𝑉𝑊\hat\pi_W(R):K\triangleright V\backslash Wover^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( italic_R ) : italic_K ▷ italic_V \ italic_W where R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V and W⊆V𝑊𝑉W\subseteq Vitalic_W ⊆ italic_V). The projection operator projects-away only value-fields. Discarding key-fields could break the finite map property by leaving behind tuples with the same keys and different values.

• Join (R⋈S:K1∪K2▷V1∪V2:⋈𝑅𝑆▷subscript𝐾1subscript𝐾2subscript𝑉1subscript𝑉2R\Join S:K_1\cup K_2\triangleright V_1\cup V_2italic_R ⋈ italic_S : italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ▷ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT where R:K1▷V1:𝑅▷subscript𝐾1subscript𝑉1R:K_1\triangleright V_1italic_R : italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ▷ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, S:K2▷V2:𝑆▷subscript𝐾2subscript𝑉2S:K_2\triangleright V_2italic_S : italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ▷ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and V1∩V2=∅subscript𝑉1subscript𝑉2V_1\cap V_2=\emptysetitalic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∅). Joins can only overlap on key-fields, for the same reason that selection predicates can only select on keys: if we allowed joins on value-fields, then the result of a join would not be representable as an s-table.

• Discriminated union (R⊎DS:K,B▷V:subscript⊎𝐷𝑅𝑆𝐾▷𝐵𝑉R\uplus_DS:K,B\triangleright Vitalic_R ⊎ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_S : italic_K , italic_B ▷ italic_V where R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V and S:K▷V:𝑆▷𝐾𝑉S:K\triangleright Vitalic_S : italic_K ▷ italic_V). The union of two finite maps may not satisfy the functional dependency from keys to values. We instead provide a discriminated union that tags the tuples in R𝑅Ritalic_R and S𝑆Sitalic_S with a new key-field B𝐵Bitalic_B to distinguish the origin.

• Renaming (ρB↦B′(R):K[B↦B′]▷V[B↦B′]:subscript𝜌maps-to𝐵superscript𝐵′𝑅▷𝐾delimited-[]maps-to𝐵superscript𝐵′𝑉delimited-[]maps-to𝐵superscript𝐵′\rho_B\mapsto B^\prime(R):K[B\mapsto B^\prime]\triangleright V[B\mapsto B% ^\prime]italic_ρ start_POSTSUBSCRIPT italic_B ↦ italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_R ) : italic_K [ italic_B ↦ italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ▷ italic_V [ italic_B ↦ italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] where R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V). Note that since K𝐾Kitalic_K and V𝑉Vitalic_V are disjoint, the renaming applies to either a key-field or a value-field, but not both. In any case, renaming clearly preserves the finite map property.

• Difference (R\S\𝑅𝑆R\backslash Sitalic_R \ italic_S where R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V and S:K▷∅:𝑆▷𝐾S:K\triangleright\emptysetitalic_S : italic_K ▷ ∅). The difference of two maps discards from R𝑅Ritalic_R all tuples whose key-fields are present in S𝑆Sitalic_S. The result is a subset of R𝑅Ritalic_R hence still a valid finite map. We assume S𝑆Sitalic_S has no value components; if not, this can be arranged by projecting them away in advance.

• Derivation (εB:=e(R):K▷V,B:subscript𝜀assign𝐵𝑒𝑅▷𝐾𝑉𝐵\varepsilon_B:=e(R):K\triangleright V,Bitalic_ε start_POSTSUBSCRIPT italic_B := italic_e end_POSTSUBSCRIPT ( italic_R ) : italic_K ▷ italic_V , italic_B where R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V and e𝑒eitalic_e is a linear expression over value-fields V𝑉Vitalic_V). No new keys are introduced so the finite map property still holds.

• Aggregation (γK′;V′(R)subscript𝛾superscript𝐾′superscript𝑉′𝑅\gamma_K^\prime;V^\prime(R)italic_γ start_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_R ) where R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V and K′⊆Ksuperscript𝐾′𝐾K^\prime\subseteq Kitalic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_K and V′⊆Vsuperscript𝑉′𝑉V^\prime\subseteq Vitalic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_V). We allow grouping on key-fields and aggregation of value-fields (possibly discarding some of each). We consider SUM as the only primitive form of aggregation; COUNT and AVERAGE can be easily defined from it.

Example 3.5 ().

Given the s-tables ReportedDistrict:district,week▷cases:𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐷𝑖𝑠𝑡𝑟𝑖𝑐𝑡▷𝑑𝑖𝑠𝑡𝑟𝑖𝑐𝑡𝑤𝑒𝑒𝑘𝑐𝑎𝑠𝑒𝑠ReportedDistrict:\district,week\\triangleright\cases\italic_R italic_e italic_p italic_o italic_r italic_t italic_e italic_d italic_D italic_i italic_s italic_t italic_r italic_i italic_c italic_t : italic_d italic_i italic_s italic_t italic_r italic_i italic_c italic_t , italic_w italic_e italic_e italic_k ▷ italic_c italic_a italic_s italic_e italic_s , and Census:week▷deaths:𝐶𝑒𝑛𝑠𝑢𝑠▷𝑤𝑒𝑒𝑘𝑑𝑒𝑎𝑡ℎ𝑠Census:\week\\triangleright\deaths\italic_C italic_e italic_n italic_s italic_u italic_s : italic_w italic_e italic_e italic_k ▷ italic_d italic_e italic_a italic_t italic_h italic_s , the SQL views in Section 2 can be expressed in our algebra as:

AggReported𝐴𝑔𝑔𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑\displaystyle AggReporteditalic_A italic_g italic_g italic_R italic_e italic_p italic_o italic_r italic_t italic_e italic_d :=assign\displaystyle:=:= γweek;cases(ReportedDistrict)subscript𝛾𝑤𝑒𝑒𝑘𝑐𝑎𝑠𝑒𝑠𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐷𝑖𝑠𝑡𝑟𝑖𝑐𝑡\displaystyle\gamma_\week\;\cases\(ReportedDistrict)italic_γ start_POSTSUBSCRIPT italic_w italic_e italic_e italic_k ; italic_c italic_a italic_s italic_e italic_s end_POSTSUBSCRIPT ( italic_R italic_e italic_p italic_o italic_r italic_t italic_e italic_d italic_D italic_i italic_s italic_t italic_r italic_i italic_c italic_t )

Inferred𝐼𝑛𝑓𝑒𝑟𝑟𝑒𝑑\displaystyle Inferreditalic_I italic_n italic_f italic_e italic_r italic_r italic_e italic_d :=assign\displaystyle:=:= εdeaths:=0.015*cases(ReportedCountry)subscript𝜀assign𝑑𝑒𝑎𝑡ℎ𝑠0.015𝑐𝑎𝑠𝑒𝑠𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐶𝑜𝑢𝑛𝑡𝑟𝑦\displaystyle\varepsilon_deaths:=0.015*cases(ReportedCountry)italic_ε start_POSTSUBSCRIPT italic_d italic_e italic_a italic_t italic_h italic_s := 0.015 * italic_c italic_a italic_s italic_e italic_s end_POSTSUBSCRIPT ( italic_R italic_e italic_p italic_o italic_r italic_t italic_e italic_d italic_C italic_o italic_u italic_n italic_t italic_r italic_y )

The above discussion gives a high-level argument that if the input tables are finite maps (satisfying their respective functional dependencies as specified by the schema) then the result table will also be a finite map that satisfies its specified functional dependency. More importantly, linearity is preserved: if the s-table inputs to an operation are linear, and all expressions in the operation itself are linear, then the resulting s-table is also linear.

Correctness

We interpret s-tables as mappings from valuations to ground tables, obtained by evaluating all symbolic expressions in them with respect to a global valuation h:ℝX:ℎsuperscriptℝ𝑋h:\mathbbR^Xitalic_h : blackboard_R start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT.

Definition 3.6 ().

A valuation is a function h:ℝX:ℎsuperscriptℝ𝑋h:\mathbbR^Xitalic_h : blackboard_R start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT assigning constant values to variables. Given a symbolic expression e𝑒eitalic_e, we write h(e)ℎ𝑒h(e)italic_h ( italic_e ) for the result of evaluating e𝑒eitalic_e with variables x𝑥xitalic_x replaced with h(x)ℎ𝑥h(x)italic_h ( italic_x ). We then write h(t)ℎ𝑡h(t)italic_h ( italic_t ) for the tuple obtained by replacing each symbolic expression e𝑒eitalic_e in t𝑡titalic_t with h(e)ℎ𝑒h(e)italic_h ( italic_e ) and write h(R)ℎ𝑅h(R)italic_h ( italic_R ) to indicate the result of evaluating the expressions in R𝑅Ritalic_R to their values according to hℎhitalic_h, that is, h(R)=h(t)∣t∈Rℎ𝑅conditional-setℎ𝑡𝑡𝑅h(R)=\h(t)\mid t\in R\italic_h ( italic_R ) = italic_h ( italic_t ) ∣ italic_t ∈ italic_R . Likewise for an instance I𝐼Iitalic_I, we write h(I)ℎ𝐼h(I)italic_h ( italic_I ) for the ground instance obtained by replacing each R𝑅Ritalic_R with h(R)ℎ𝑅h(R)italic_h ( italic_R ). An s-table represents the set ⦇R⦈=h(R)∣h:ℝXfragmentsfragments⦇R⦈fragmentshfragments(R)∣h:superscriptℝ𝑋\llparenthesis R\rrparenthesis=\h(R)\mid h:\mathbbR^X\⦇ italic_R ⦈ = italic_h ( italic_R ) ∣ italic_h : blackboard_R start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT of ground tables obtained by applying all possible hℎhitalic_h to R𝑅Ritalic_R. We write ⦇I⦈delimited-⦇⦈𝐼\llparenthesis I\rrparenthesis⦇ italic_I ⦈ for the set of all ground instances obtainable from an instance I𝐼Iitalic_I by some h∈ℝXℎsuperscriptℝ𝑋h\in\mathbbR^Xitalic_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, defined as ⦇I⦈=h(I)∣h∈ℝXfragmentsfragments⦇I⦈fragmentshfragments(I)∣hsuperscriptℝ𝑋\llparenthesis I\rrparenthesis=\h(I)\mid h\in\mathbbR^X\⦇ italic_I ⦈ = italic_h ( italic_I ) ∣ italic_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , and we write q⦇I⦈fragmentsqfragments⦇I⦈q\llparenthesis I\rrparenthesisitalic_q ⦇ italic_I ⦈ for q(I′)∣I′∈⦇I⦈fragmentsqfragments(superscript𝐼′)∣superscript𝐼′fragments⦇I⦈\q(I^\prime)\mid I^\prime\in\llparenthesis I\rrparenthesis\ italic_q ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∣ italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ⦇ italic_I ⦈ .

Given a query expression in our algebra, we can evaluate it on a ground instance since every ground instance is an s-instance and the s-table operations do not introduce variables that were not present in the input. Further, given a set of ground instances, we can (conceptually) evaluate a query on each element of the set.

Theorem 3.7 ().

Let q𝑞qitalic_q be a well-formed query mapping instances of Σnormal-Σ\Sigmaroman_Σ to relations K▷Vnormal-▷𝐾𝑉K\triangleright Vitalic_K ▷ italic_V, and let I𝐼Iitalic_I be an s-instance of Σnormal-Σ\Sigmaroman_Σ. Then q⦇I⦈=⦇q(I)⦈fragmentsqfragmentsnormal-⦇Inormal-⦈fragmentsnormal-⦇qfragmentsnormal-(Inormal-)normal-⦈q\llparenthesis I\rrparenthesis=\llparenthesis q(I)\rrparenthesisitalic_q ⦇ italic_I ⦈ = ⦇ italic_q ( italic_I ) ⦈.

The proof is in Appendix A, and is similar to correctness proofs for c-tables (Imielinski and Jr., 1984); the main complication is that in s-instances, variables occurring in different tables are intended to refer to the same unknown values, whereas in c-tables such variables are scoped only at the table level. inline,color=green!40,size=]Alberto (addressed): R1:”I would need to have read [18] to understand the correctness of the paper at hand.”

3.2. Fusion, alignment, and coalescing

We now consider how s-tables and symbolic evaluation can be used to reduce concordance checking and discordance measurement to linear programming and quadratic programming problems, respectively, when we find more than one tuple with the same key and multiple (symbolic) values for the same attribute. We first consider a fusion operator:

Definition 3.8 (Fusion).

Given two ground relations R,S:K▷V:𝑅𝑆▷𝐾𝑉R,S:K\triangleright Vitalic_R , italic_S : italic_K ▷ italic_V, their fusion R⊔Ssquare-union𝑅𝑆R\sqcup Sitalic_R ⊔ italic_S is defined as R∪S𝑅𝑆R\cup Sitalic_R ∪ italic_S, provided it satisfies the functional dependency K→V→𝐾𝑉K\to Vitalic_K → italic_V, otherwise the fusion is undefined.

We represent the expected relationships between source and derived data using a generalization of view specification called alignment specifiations. Alignment specifications may define derived tables as the fusion of multiple views.

Definition 3.9 (Alignment Specification).

Let ΣΣ\Sigmaroman_Σ and ΩΩ\Omegaroman_Ω be finite map schemas with disjoint table names R1,…,Rnsubscript𝑅1…subscript𝑅𝑛R_1,\ldots,R_nitalic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and T1,…,Tmsubscript𝑇1…subscript𝑇𝑚T_1,\ldots,T_mitalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, respectively. Let ΔΔ\Deltaroman_Δ be a sequence of view definitions, one for each Tisubscript𝑇𝑖T_iitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, of the form Ti:=q1⊔⋯⊔qkassignsubscript𝑇𝑖square-unionsubscript𝑞1⋯subscript𝑞𝑘T_i:=q_1\sqcup\cdots\sqcup q_kitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊔ ⋯ ⊔ italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where each qisubscript𝑞𝑖q_iitalic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a query over finite maps, that refers only to table names in ΣΣ\Sigmaroman_Σ and T1,…,Ti-1subscript𝑇1…subscript𝑇𝑖1T_1,\ldots,T_i-1italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. The triple 𝑆𝑝𝑒𝑐=[Σ,Ω,Δ]𝑆𝑝𝑒𝑐ΣΩΔ\mathitSpec=[\Sigma,\Omega,\Delta]italic_Spec = [ roman_Σ , roman_Ω , roman_Δ ] is called an alignment process specification.

Example 3.10 ().

Given the s-tables in Example 3.4 and queries in Example 3.5, the SQL assertions in Example 2.4 can be specified as:

SumOfCases𝑆𝑢𝑚𝑂𝑓𝐶𝑎𝑠𝑒𝑠\displaystyle SumOfCasesitalic_S italic_u italic_m italic_O italic_f italic_C italic_a italic_s italic_e italic_s :=assign\displaystyle:=:= ReportedCountry⊔AggReportedsquare-union𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑𝐶𝑜𝑢𝑛𝑡𝑟𝑦𝐴𝑔𝑔𝑅𝑒𝑝𝑜𝑟𝑡𝑒𝑑\displaystyle ReportedCountry\sqcup AggReporteditalic_R italic_e italic_p italic_o italic_r italic_t italic_e italic_d italic_C italic_o italic_u italic_n italic_t italic_r italic_y ⊔ italic_A italic_g italic_g italic_R italic_e italic_p italic_o italic_r italic_t italic_e italic_d

NumberOfDeaths𝑁𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐷𝑒𝑎𝑡ℎ𝑠\displaystyle NumberOfDeathsitalic_N italic_u italic_m italic_b italic_e italic_r italic_O italic_f italic_D italic_e italic_a italic_t italic_h italic_s :=assign\displaystyle:=:= Census⊔Inferredsquare-union𝐶𝑒𝑛𝑠𝑢𝑠𝐼𝑛𝑓𝑒𝑟𝑟𝑒𝑑\displaystyle Census\sqcup Inferreditalic_C italic_e italic_n italic_s italic_u italic_s ⊔ italic_I italic_n italic_f italic_e italic_r italic_r italic_e italic_d

We implement fusion on s-tables using a mild generalization, called coalescing, whose behavior on sets ℛℛ\mathcalRcaligraphic_R of ground tables R:K,B▷V:𝑅𝐾▷𝐵𝑉R:K,B\triangleright Vitalic_R : italic_K , italic_B ▷ italic_V is κB(ℛ)=π^B(R)∈ℛ∣π^B(R) satisfies K→Vsubscript𝜅𝐵ℛconditional-setsubscript^𝜋𝐵𝑅ℛπ^B(R) satisfies K→V\kappa_B(\mathcalR)=\\hat\pi_B(R)\in\mathcalR\mid\text$\hat\pi_% B(R)$ satisfies $K\to V$\italic_κ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( caligraphic_R ) = over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_R ) ∈ caligraphic_R ∣ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_R ) satisfies italic_K → italic_V . Intuitively, coalescing of a set of tables R∈ℛ𝑅ℛR\in\mathcalRitalic_R ∈ caligraphic_R applies a projection π^Bsubscript^𝜋𝐵\hat\pi_Bover^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT to each R𝑅Ritalic_R, and returns those projected tables that still satisfy the FD K→V→𝐾𝑉K\to Vitalic_K → italic_V. To represent the result of coalescing using s-tables we augment them with a set of constraints ϕitalic-ϕ\phiitalic_ϕ. A constraint is simply a conjunction of linear equations; a constrained s-table is a pair (R,ϕ)𝑅italic-ϕ(R,\phi)( italic_R , italic_ϕ ) that represents the set of possible ground tables ⦇R,ϕ⦈=h(R)∣h:ℝX,h⊧ϕ𝑅italic-ϕconditional-setℎ𝑅:ℎmodelssuperscriptℝ𝑋ℎitalic-ϕ\llparenthesis R,\phi\rrparenthesis=\h(R)\mid h:\mathbbR^X,h\models\phi\⦇ italic_R , italic_ϕ ⦈ = italic_h ( italic_R ) ∣ italic_h : blackboard_R start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , italic_h ⊧ italic_ϕ ; finally a constrained s-instance (I,ϕ)𝐼italic-ϕ(I,\phi)( italic_I , italic_ϕ ) likewise represents a set of ground instances of I𝐼Iitalic_I obtained by valuations satisfying ϕitalic-ϕ\phiitalic_ϕ. We can implement coalescing as an operation on constrained s-tables as follows:

κD(R,ϕ)=(T,ϕ∧ψ)R+=t∈R∣∃t′∈R,t[K\D]=t′[K\D]∧t[D]≠t′[D]T=t[K\D,B1:=Lt[K\D],B1,…]∣t∈R+∪t[K\D,V]∣t∈R\R+ψ=⋀t∈R+⋀Bi∈VLt[K\D],Bi=t[Bi]subscript𝜅𝐷𝑅italic-ϕ𝑇italic-ϕ𝜓superscript𝑅conditional-set𝑡𝑅formulae-sequencesuperscript𝑡′𝑅𝑡delimited-[]\𝐾𝐷superscript𝑡′delimited-[]\𝐾𝐷𝑡delimited-[]𝐷superscript𝑡′delimited-[]𝐷𝑇fragmentstfragments[K\D,subscript𝐵1assignsubscript𝐿𝑡delimited-[]\𝐾𝐷subscript𝐵1,…]∣tsuperscript𝑅missing-subexpressionconditional-set𝑡\𝐾𝐷𝑉𝑡\𝑅superscript𝑅𝜓subscript𝑡superscript𝑅subscriptsubscript𝐵𝑖𝑉subscript𝐿𝑡delimited-[]\𝐾𝐷subscript𝐵𝑖𝑡delimited-[]subscript𝐵𝑖\small\beginarray[]rcl\kappa_D(R,\phi)&=&(T,\phi\wedge\psi)\\ R^+&=&\t\in R\mid\exists t^\prime\in R,t[K\backslash D]=t^\prime[K% \backslash D]\wedge t[D] eq t^\prime[D]\\\ T&=&\t[K\backslash D,B_1:=L_t[K\backslash D],B_1,\dots]\mid t\in R^+% \\\ &\cup&\t[K\backslash D,V]\mid t\in R\backslash R^+\\\ \displaystyle\psi&=&\bigwedge_t\in R^+\bigwedge_B_i\in VL_t[K% \backslash D],B_i=t[B_i]\endarraystart_ARRAY start_ROW start_CELL italic_κ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_R , italic_ϕ ) end_CELL start_CELL = end_CELL start_CELL ( italic_T , italic_ϕ ∧ italic_ψ ) end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_t ∈ italic_R ∣ ∃ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_R , italic_t [ italic_K \ italic_D ] = italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_K \ italic_D ] ∧ italic_t [ italic_D ] ≠ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_D ] end_CELL end_ROW start_ROW start_CELL italic_T end_CELL start_CELL = end_CELL start_CELL italic_t [ italic_K \ italic_D , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := italic_L start_POSTSUBSCRIPT italic_t [ italic_K \ italic_D ] , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … ] ∣ italic_t ∈ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∪ end_CELL start_CELL italic_t [ italic_K \ italic_D , italic_V ] ∣ italic_t ∈ italic_R \ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ψ end_CELL start_CELL = end_CELL start_CELL ⋀ start_POSTSUBSCRIPT italic_t ∈ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋀ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_t [ italic_K \ italic_D ] , italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_t [ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_CELL end_ROW end_ARRAY

That is, let R+superscript𝑅R^+italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be the set of tuples in R𝑅Ritalic_R for which there exists another tuple that has the same values on K\D\𝐾𝐷K\backslash Ditalic_K \ italic_D but differs on D𝐷Ditalic_D, and let R\R+\𝑅superscript𝑅R\backslash R^+italic_R \ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be the remaining tuples (for which there are no such sibling tuples). Thus, R+superscript𝑅R^+italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the set of tuples of R𝑅Ritalic_R violating the FD K\D→V→\𝐾𝐷𝑉K\backslash D\rightarrow Vitalic_K \ italic_D → italic_V, and R\R+\𝑅superscript𝑅R\backslash R^+italic_R \ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the largest subset of R𝑅Ritalic_R that satisfies this FD. Then κD(R)subscript𝜅𝐷𝑅\kappa_D(R)italic_κ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_R ) consists of table T𝑇Titalic_T obtained by filling in new variables Lt[K\D],Bisubscript𝐿𝑡delimited-[]\𝐾𝐷subscript𝐵𝑖L_t[K\backslash D],B_iitalic_L start_POSTSUBSCRIPT italic_t [ italic_K \ italic_D ] , italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (“L” standing for LLUN, the reverse of NULL, as in the LLUNATIC system (Geerts et al., 2013)) for attribute values where there may be disagreement, and using the value from R𝑅Ritalic_R otherwise. The constraint ψ𝜓\psiitalic_ψ consists of equations between the observed values of each attribute and the corresponding L𝐿Litalic_L-value. No constraints are introduced for tuples in R\R+\𝑅superscript𝑅R\backslash R^+italic_R \ italic_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, where there is no possibility of disagreement.

3.3. Putting it all together

Having introduced (constrained) s-tables, and evaluation for query operations and coalescing over them, we finally show how these technical devices allow us to define concord and measure discord.

Definition 3.11 (Concordant Instance).

Given a specification [Σ,Ω,Δ]ΣΩΔ[\Sigma,\Omega,\Delta][ roman_Σ , roman_Ω , roman_Δ ], an instance I𝐼Iitalic_I of schema ΣΣ\Sigmaroman_Σ is concordant if there exists an instance J𝐽Jitalic_J of ΩΩ\Omegaroman_Ω such that for each view definition Ti:=q1⊔⋯⊔qnassignsubscript𝑇𝑖square-unionsubscript𝑞1⋯subscript𝑞𝑛T_i:=q_1\sqcup\cdots\sqcup q_nitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊔ ⋯ ⊔ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in ΔΔ\Deltaroman_Δ, we have J(Ti)=q1(I,J)⊔⋯⊔qn(I,J)𝐽subscript𝑇𝑖square-unionsubscript𝑞1𝐼𝐽⋯subscript𝑞𝑛𝐼𝐽J(T_i)=q_1(I,J)\sqcup\cdots\sqcup q_n(I,J)italic_J ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I , italic_J ) ⊔ ⋯ ⊔ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_I , italic_J ) where q(I,J)𝑞𝐼𝐽q(I,J)italic_q ( italic_I , italic_J ) is the result of evaluating q𝑞qitalic_q on the combined instance I,J𝐼𝐽I,Jitalic_I , italic_J and all of the fusion operations involved are defined. The concordant instances of ΣΣ\Sigmaroman_Σ with respect to 𝑆𝑝𝑒𝑐𝑆𝑝𝑒𝑐\mathitSpecitalic_Spec are written 𝐶𝑜𝑛𝑐(𝑆𝑝𝑒𝑐)𝐶𝑜𝑛𝑐𝑆𝑝𝑒𝑐\mathitConc(\mathitSpec)italic_Conc ( italic_Spec ).

inline,color=green!40,size=]Alberto (addressed): R1:”Definitions 3.2-3.4 are given consecutively without providing the context of why they are relevant to understand the paper. I would expect to read how each definition is covering an open aspect how concepts such as concordance instance and alignment specification help to understand the paper.”

Definition 3.12 (Concordance).

Given [Σ,Ω,Δ]ΣΩΔ[\Sigma,\Omega,\Delta][ roman_Σ , roman_Ω , roman_Δ ], let I𝐼Iitalic_I be an s-instance. We say I𝐼Iitalic_I is concordant if there exists a concordant instance C∈⦇I⦈fragmentsCfragments⦇I⦈C\in\llparenthesis I\rrparenthesisitalic_C ∈ ⦇ italic_I ⦈.

Given an alignment specification 𝑆𝑝𝑒𝑐=[Σ,Ω,Δ]𝑆𝑝𝑒𝑐ΣΩΔ\mathitSpec=[\Sigma,\Omega,\Delta]italic_Spec = [ roman_Σ , roman_Ω , roman_Δ ] and an s-instance I𝐼Iitalic_I, we can check concordance by symbolically evaluating ΔΔ\Deltaroman_Δ on I𝐼Iitalic_I to obtain an s-instance J𝐽Jitalic_J as follows. For each view definition Ti:=q1⊔⋯⊔qnassignsubscript𝑇𝑖square-unionsubscript𝑞1⋯subscript𝑞𝑛T_i:=q_1\sqcup\cdots\sqcup q_nitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊔ ⋯ ⊔ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in ΔΔ\Deltaroman_Δ in order, evaluate q1,…,qnsubscript𝑞1…subscript𝑞𝑛q_1,\ldots,q_nitalic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to their s-table results and fuse them using the coalescing operator (repeatedly if n>2𝑛2n>2italic_n >2). This produces a new s-table Ti′superscriptsubscript𝑇𝑖′T_i^\primeitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and a constraint ϕisubscriptitalic-ϕ𝑖\phi_iitalic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Add Ti:=Ti′assignsubscript𝑇𝑖subscriptsuperscript𝑇′𝑖T_i:=T^\prime_iitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to J𝐽Jitalic_J and continue until all of the table definitions in ΔΔ\Deltaroman_Δ have been symbolically evaluated (i.e., J=[T1:=T1,…,Tm=Tm]𝐽delimited-[]formulae-sequenceassignsubscript𝑇1subscript𝑇1…subscript𝑇𝑚subscript𝑇𝑚J=[T_1:=T_1,\ldots,T_m=T_m]italic_J = [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ]). Thus, the constrained s-instance (I′,ϕ)superscript𝐼′italic-ϕ(I^\prime,\phi)( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ϕ ) where ϕ=⋀i=1mϕiitalic-ϕsuperscriptsubscript𝑖1𝑚subscriptitalic-ϕ𝑖\phi=\bigwedge_i=1^m\phi_iitalic_ϕ = ⋀ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT characterizes the set of possible concordant instances based on I′superscript𝐼′I^\primeitalic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and in particular I𝐼Iitalic_I is concordant if ϕitalic-ϕ\phiitalic_ϕ is satisfiable.

Definition 3.13 (Discordance).

Given 𝑆𝑝𝑒𝑐=[Σ,Ω,Δ]𝑆𝑝𝑒𝑐ΣΩΔ\mathitSpec=[\Sigma,\Omega,\Delta]italic_Spec = [ roman_Σ , roman_Ω , roman_Δ ], let δ:Inst(Σ)×Inst(Σ)→ℝ:𝛿→𝐼𝑛𝑠𝑡Σ𝐼𝑛𝑠𝑡Σℝ\delta:Inst(\Sigma)\times Inst(\Sigma)\to\mathbbRitalic_δ : italic_I italic_n italic_s italic_t ( roman_Σ ) × italic_I italic_n italic_s italic_t ( roman_Σ ) → blackboard_R be a measure of distance between pairs of elements of Inst(Σ)𝐼𝑛𝑠𝑡ΣInst(\Sigma)italic_I italic_n italic_s italic_t ( roman_Σ ). The discordance of a (constrained) s-instance I𝐼Iitalic_I is infJ∈⦇I⦈,C∈𝐶𝑜𝑛𝑐(𝑆𝑝𝑒𝑐)δ(J,C)subscriptinfimum𝐽delimited-⦇⦈𝐼𝐶𝐶𝑜𝑛𝑐𝑆𝑝𝑒𝑐𝛿𝐽𝐶\inf_J\in\llparenthesis I\rrparenthesis,C\in\mathitConc(\mathitSpec)% \delta(J,C)roman_inf start_POSTSUBSCRIPT italic_J ∈ ⦇ italic_I ⦈ , italic_C ∈ italic_Conc ( italic_Spec ) end_POSTSUBSCRIPT italic_δ ( italic_J , italic_C ).

Now, let δ𝛿\deltaitalic_δ be the metric defined by summing the squares of the error variables in I𝐼Iitalic_I. Then, the degree of discordance of I𝐼Iitalic_I given the alignment ΔΔ\Deltaroman_Δ and according to δ𝛿\deltaitalic_δ (i.e., discδ(I,ϕ)𝑑𝑖𝑠subscript𝑐𝛿𝐼italic-ϕdisc_\delta(I,\phi)italic_d italic_i italic_s italic_c start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_I , italic_ϕ )) equals the solution to the quadratic programming problem formed by minimizing δ𝛿\deltaitalic_δ subject to ϕitalic-ϕ\phiitalic_ϕ. As a side-product, the values of the LLUNs L𝐿Litalic_L introduced by coalescing can be used to obtain the concordant instance.

The discord is, intuitively, the shortest δ𝛿\deltaitalic_δ-distance between the actual observed, uncertain data (represented as a set of possible instances) and a hypothetical concordant database instance that is consistent with the alignment specification. The more distant from any such concordant instance, the more discordant our data are. inline,color=green!40,size=]Alberto (addressed): R1:”Also from the problem statement it is not crystal clear what exact problem is being solved.” inline,color=green!40,size=]Alberto (addressed): R1:”the problem statement keeps on introducing more notations without pointing out the high-level and low-level problem to be solved.” inline,color=green!40,size=]Alberto (addressed): R2:”I felt the paper is a bit overweight on formalisms.”

Example 3.14 ().

From the specification in Example 3.10, and by means of the corresponding coalescing operation, we get the following system of equations:

1000(1+y)10001𝑦\displaystyle 1000(1+y)1000 ( 1 + italic_y ) =\displaystyle== 75(1+x1)+⋯+75(1+x12)+x13751subscript𝑥1⋯751subscript𝑥12subscript𝑥13\displaystyle 75(1+x_1)+\dots+75(1+x_12)+x_1375 ( 1 + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ⋯ + 75 ( 1 + italic_x start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT

0.015*1000(1+y)0.01510001𝑦\displaystyle 0.015*1000(1+y)0.015 * 1000 ( 1 + italic_y ) =\displaystyle== 20*(1+z)201𝑧\displaystyle 20*(1+z)20 * ( 1 + italic_z )

Obviously, even considering only positive values for the different variables, that system has many solutions. One solution S1subscript𝑆1S_1italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT consists of taking all xisubscript𝑥𝑖x_iitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be zero, y=-0.1𝑦0.1y=-0.1italic_y = - 0.1 and z=-0.325𝑧0.325z=-0.325italic_z = - 0.325. This corresponds to assuming there is no error in the twelve districts’ reports and there are no cases in District XIII. Another solution S2subscript𝑆2S_2italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sets xI=…=xXII=0subscript𝑥𝐼…subscript𝑥𝑋𝐼𝐼0x_I=...=x_XII=0italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = … = italic_x start_POSTSUBSCRIPT italic_X italic_I italic_I end_POSTSUBSCRIPT = 0 and xXIII=100subscript𝑥𝑋𝐼𝐼𝐼100x_XIII=100italic_x start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT = 100, then y=0𝑦0y=0italic_y = 0 and z=-0.25𝑧0.25z=-0.25italic_z = - 0.25 which corresponds to assuming DXIIIsubscript𝐷𝑋𝐼𝐼𝐼D_XIIIitalic_D start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT has all of the missing cases. Of course, whether S1subscript𝑆1S_1italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or S2subscript𝑆2S_2italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (or some other solution) is more plausible depends strongly on domain-specific knowledge. Nevertheless, given a cost function assigning a cost to each solution, we can compare different solutions in terms of how much correction is needed (or discord exists). For example, we might consider a cost function that simply takes the sum of the squares of the variables:

c1(x→,y,z)=(∑i∈I,…,XIIIxi2)+y2+z2subscript𝑐1→𝑥𝑦𝑧subscript𝑖𝐼…𝑋𝐼𝐼𝐼superscriptsubscript𝑥𝑖2superscript𝑦2superscript𝑧2c_1(\vecx,y,z)=(\sum_i\in\I,\ldots,XIII\x_i^2)+y^2+z^2italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG , italic_y , italic_z ) = ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I , … , italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Using this cost function, S1subscript𝑆1S_1italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has cost ≈0.116absent0.116\approx 0.116≈ 0.116 while S2subscript𝑆2S_2italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has cost 10000.062510000.062510000.062510000.0625, so with the above cost function the first solution is much closer to being concordant, because a large change to xXIIIsubscript𝑥𝑋𝐼𝐼𝐼x_XIIIitalic_x start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT is not needed. Alternatively, we might give the unknown number of cases in DXIIIsubscript𝐷𝑋𝐼𝐼𝐼D_XIIIitalic_D start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT no weight, reflecting that we have no knowledge about what it might be, corresponding to the cost function

c2(x→,y,z)=(∑i∈I,…,XIIxi2)+y2+z2subscript𝑐2→𝑥𝑦𝑧subscript𝑖𝐼…𝑋𝐼𝐼superscriptsubscript𝑥𝑖2superscript𝑦2superscript𝑧2c_2(\vecx,y,z)=(\sum_i\in\I,\ldots,XII\x_i^2)+y^2+z^2italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG , italic_y , italic_z ) = ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I , … , italic_X italic_I italic_I end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT that assigns the same cost to S1subscript𝑆1S_1italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT but assigns cost 0.06250.06250.06250.0625 to S2subscript𝑆2S_2italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, indicating that if we are free to assign all unaccounted cases to xXIIIsubscript𝑥𝑋𝐼𝐼𝐼x_XIIIitalic_x start_POSTSUBSCRIPT italic_X italic_I italic_I italic_I end_POSTSUBSCRIPT then the second solution is closer to concordance.

In the rest of this paper we will only use simplistic cost functions like this one, but the underlying quadratic solver we use can accommodate more general quadratic functions of the symbolic variables. We could weight variables considering the reliability of the different districts as well as the central government, and the historical information of the census. However, these values depend on knowledge of the domain and we will leave exploration of more sophisticated cost functions to future work.

4. Implementation

We now describe the techniques employed in Eris, an implementation of our approach. The systems of equations resulting from constraints generated on coalescing tables or instances are linear, so they can be solved using linear algebra solvers. However, it may not be immediately obvious how to evaluate queries over s-tables to obtain the resulting systems of equations efficiently. One strategy would simply be to load all of the data from the database into main memory and evaluate the s-table query operations in-memory. While straightforward, this may result in poor performance or duplication of effort, as many query operations which are efficiently executed within the database (such as joins) may need to be hand-coded in order to obtain acceptable performance. Instead, we consider two in-database representations of s-tables, illustrated in Fig. 3: A denormalized sparse vector representation using nested user-defined data types (NF22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT), and a normalized representation using multiple flat relations (Partitioning).

In the NF22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT approach, we add a user-defined type for the symbolic (linear) expression fields. There are several ways of doing this, like for example using arrays to represent vectors of coefficients for a (small) fixed number of variables, or using a sparse representation that can deal with arbitrary numbers of variables efficiently. Having experimented with several options, we chose a representation in which symbolic expressions ∑iai⋅xi+bsubscript𝑖⋅subscript𝑎𝑖subscript𝑥𝑖𝑏\sum_ia_i\cdot x_i+b∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b are represented as sparse vectors, specifically as a pair of the value of b𝑏bitalic_b and an array of pairs [(ai,xi),…]subscript𝑎𝑖subscript𝑥𝑖…[(a_i,x_i),\ldots][ ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , … ] of coefficients and variable names. We implemented addition, scalar multiplication, and aggregation (sum) of linear expressions using PostgreSQL’s user-defined function facility. With this encoding, the SQL queries corresponding to our query calculus are straightforward to extract by applying the standard translation and inserting user-defined functions. We therefore do not present this translation in detail.

Though many RDBMSs support similar mechanisms to create user-defined functions and aggregates, they are not standard and so NF22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT is not very portable. Thus, the alternative approach we present, Partitioning, relies only on standard SQL:1999. In this approach, we represent an s𝑠sitalic_s-table R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V with n𝑛nitalic_n symbolic value-fields B1,…,Bnsubscript𝐵1…subscript𝐵𝑛B_1,\ldots,B_nitalic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT using n+1𝑛1n+1italic_n + 1 relational tables, as follows:

• R0:K▷V:subscript𝑅0▷𝐾𝑉R_0:K\triangleright Vitalic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_K ▷ italic_V is a ground table collecting all of the constant terms (i.e., the b𝑏bitalic_b-values for each of the Bi∈Vsubscript𝐵𝑖𝑉B_i\in Vitalic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V).

• For each symbolic field Bi∈Vsubscript𝐵𝑖𝑉B_i\in Vitalic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V, RBi:K,X▷C:subscript𝑅subscript𝐵𝑖𝐾▷𝑋𝐶R_B_i:K,X\triangleright Citalic_R start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_K , italic_X ▷ italic_C is a ground table mapping keys K𝐾Kitalic_K and an additional key X𝑋Xitalic_X (corresponding to the variables) to a real value-field C𝐶Citalic_C, so that (k,x,c)∈RBi𝑘𝑥𝑐subscript𝑅subscript𝐵𝑖(k,x,c)\in R_B_i( italic_k , italic_x , italic_c ) ∈ italic_R start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT just in case c𝑐citalic_c is the coefficient of x𝑥xitalic_x in the Bisubscript𝐵𝑖B_iitalic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-attribute of R(k)𝑅𝑘R(k)italic_R ( italic_k ).

We consider the relations corresponding to the symbolic value-fields to be collected in a record R→=(B1=RB1,…)→𝑅subscript𝐵1subscript𝑅subscript𝐵1…\vecR=(B_1=R_B_1,\ldots)over→ start_ARG italic_R end_ARG = ( italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … ), and we write (R0,R→)subscript𝑅0→𝑅(R_0,\vecR)( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) for the full representation. This representation admits relatively simple translations of each of the s-table query operations:

σP(R0,R→)=(σP(R0),σP(R→))π^W(R0,R→)=(π^W(R0),R→[V\W])ρ[B↦B′](R0,R→)=(ρ[B↦B′](R0),R→[B↦B′]))(R0,R→)⊎D(S0,S→)=(R0⊎DS0,(B:=R.B⊎DS.B)B∈V)εB:=c(R0,R→)=(εB:=c(R0),(R→,B:=∅))εB:=Bi+Bj(R0,R→)=(εB:=Bi+Bj(R0),(R→,B:=γ~K,X;C(R.Bi⊎DR.Bj)))εB:=α⋅Bi(R0,R→)=(εB:=α⋅Bi(R0),(R→,B:=ε~C:=α⋅C(RBi)))(R0,R→)\(S0,())=(R0⋈(πK(R0)\S0),(B=R.B⋈πK(R0)\S0)B∈V)(R0,R→)⋈(S0,S→)=(R0⋈S0,(B:=R.BR⋈(πKS(S0)),B′:=S.BS⋈(πKR(R0)))BR∈VR,BS∈VS)γK′,V′(R0,R→)=(γK′,V′(R0),(B:=γ~K′,X;C(R.B))B∈V′)subscript𝜎𝑃subscript𝑅0→𝑅subscript𝜎𝑃subscript𝑅0subscript𝜎𝑃→𝑅subscript^𝜋𝑊subscript𝑅0→𝑅subscript^𝜋𝑊subscript𝑅0→𝑅delimited-[]\𝑉𝑊subscript𝜌delimited-[]maps-to𝐵superscript𝐵′subscript𝑅0→𝑅fragmentsfragments(subscript𝜌delimited-[]maps-to𝐵superscript𝐵′fragments(subscript𝑅0),→𝑅fragments[Bmaps-tosuperscript𝐵′]))subscript⊎𝐷subscript𝑅0→𝑅subscript𝑆0→𝑆fragments(subscript𝑅0subscript⊎𝐷subscript𝑆0,subscriptfragments(BassignR.Bsubscript⊎𝐷S.B)𝐵𝑉)subscript𝜀assign𝐵𝑐subscript𝑅0→𝑅subscript𝜀assign𝐵𝑐subscript𝑅0assign→𝑅𝐵subscript𝜀assign𝐵subscript𝐵𝑖subscript𝐵𝑗subscript𝑅0→𝑅fragments(subscript𝜀assign𝐵subscript𝐵𝑖subscript𝐵𝑗fragments(subscript𝑅0),fragments(→𝑅,Bassignsubscript~𝛾𝐾𝑋𝐶fragments(R.subscript𝐵𝑖subscript⊎𝐷R.subscript𝐵𝑗)))subscript𝜀assign𝐵⋅𝛼subscript𝐵𝑖subscript𝑅0→𝑅subscript𝜀assign𝐵⋅𝛼subscript𝐵𝑖subscript𝑅0assign→𝑅𝐵subscript~𝜀assign𝐶⋅𝛼𝐶subscript𝑅subscript𝐵𝑖\subscript𝑅0→𝑅subscript𝑆0fragments(subscript𝑅0⋈fragments(subscript𝜋𝐾fragments(subscript𝑅0)\subscript𝑆0),subscriptfragments(BR.B⋈subscript𝜋𝐾fragments(subscript𝑅0)\subscript𝑆0)𝐵𝑉)⋈subscript𝑅0→𝑅subscript𝑆0→𝑆fragments(subscript𝑅0⋈subscript𝑆0,fragments(BassignR.subscript𝐵𝑅⋈fragments(subscript𝜋subscript𝐾𝑆fragments(subscript𝑆0)),missing-subexpressionmissing-subexpressionfragmentssubscriptfragmentssuperscript𝐵′assignS.subscript𝐵𝑆⋈fragments(subscript𝜋subscript𝐾𝑅fragments(subscript𝑅0)))formulae-sequencesubscript𝐵𝑅subscript𝑉𝑅subscript𝐵𝑆subscript𝑉𝑆)subscript𝛾superscript𝐾′superscript𝑉′subscript𝑅0→𝑅fragments(subscript𝛾superscript𝐾′superscript𝑉′fragments(subscript𝑅0),subscriptfragments(Bassignsubscript~𝛾superscript𝐾′𝑋𝐶fragments(R.B))𝐵superscript𝑉′)\small\beginarray[]rcl\sigma_P(R_0,\vecR)&=&(\sigma_P(R_0),% \sigma_P(\vecR))\\ \hat\pi_W(R_0,\vecR)&=&(\hat\pi_W(R_0),\vecR[V\backslash W])\\ \rho_[B\mapsto B^\prime](R_0,\vecR)&=&(\rho_[B\mapsto B^\prime](R_% 0),\vecR[B\mapsto B^\prime]))\\ (R_0,\vecR)\uplus_D(S_0,\vecS)&=&(R_0\uplus_DS_0,(B:=R.B\uplus% _DS.B)_B\in V)\\ \varepsilon_B:=c(R_0,\vecR)&=&(\varepsilon_B:=c(R_0),(\vecR,B:=% \emptyset))\\ \varepsilon_B:=B_i+B_j(R_0,\vecR)&=&(\varepsilon_B:=B_i+B_j(R_% 0),(\vecR,B:=\tilde\gamma_K,X;C(R.B_i\uplus_DR.B_j)))\\ \varepsilon_B:=\alpha\cdot B_i(R_0,\vecR)&=&(\varepsilon_B:=\alpha% \cdot B_i(R_0),(\vecR,B:=\tilde\varepsilon_C:=\alpha\cdot C(R_B_i% )))\\ (R_0,\vecR)\backslash(S_0,())&=&(R_0\Join(\pi_K(R_0)\backslash S_% 0),(B=R.B\Join\pi_K(R_0)\backslash S_0)_B\in V)\\ (R_0,\vecR)\Join(S_0,\vecS)&=&(R_0\Join S_0,(B:=R.B_R\Join(\pi_% K_S(S_0)),\\ &&\qquad B^\prime:=S.B_S\Join(\pi_K_R(R_0)))_B_R\in V_R,B_S% \in V_S)\\ \gamma_K^\prime,V^\prime(R_0,\vecR)&=&(\gamma_K^\prime,V^\prime% (R_0),(B:=\tilde\gamma_K^\prime,X;C(R.B))_B\in V^\prime)\endarraystart_ARRAY start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) end_CELL start_CELL = end_CELL start_CELL ( italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over→ start_ARG italic_R end_ARG ) ) end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) end_CELL start_CELL = end_CELL start_CELL ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , over→ start_ARG italic_R end_ARG [ italic_V \ italic_W ] ) end_CELL end_ROW start_ROW start_CELL italic_ρ start_POSTSUBSCRIPT [ italic_B ↦ italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) end_CELL start_CELL = end_CELL start_CELL ( italic_ρ start_POSTSUBSCRIPT [ italic_B ↦ italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , over→ start_ARG italic_R end_ARG [ italic_B ↦ italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ) ) end_CELL end_ROW start_ROW start_CELL ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) ⊎ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_S end_ARG ) end_CELL start_CELL = end_CELL start_CELL ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊎ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_B := italic_R . italic_B ⊎ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_S . italic_B ) start_POSTSUBSCRIPT italic_B ∈ italic_V end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_ε start_POSTSUBSCRIPT italic_B := italic_c end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) end_CELL start_CELL = end_CELL start_CELL ( italic_ε start_POSTSUBSCRIPT italic_B := italic_c end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( over→ start_ARG italic_R end_ARG , italic_B := ∅ ) ) end_CELL end_ROW start_ROW start_CELL italic_ε start_POSTSUBSCRIPT italic_B := italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) end_CELL start_CELL = end_CELL start_CELL ( italic_ε start_POSTSUBSCRIPT italic_B := italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( over→ start_ARG italic_R end_ARG , italic_B := over~ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_K , italic_X ; italic_C end_POSTSUBSCRIPT ( italic_R . italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊎ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_R . italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ) end_CELL end_ROW start_ROW start_CELL italic_ε start_POSTSUBSCRIPT italic_B := italic_α ⋅ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) end_CELL start_CELL = end_CELL start_CELL ( italic_ε start_POSTSUBSCRIPT italic_B := italic_α ⋅ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( over→ start_ARG italic_R end_ARG , italic_B := over~ start_ARG italic_ε end_ARG start_POSTSUBSCRIPT italic_C := italic_α ⋅ italic_C end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ) end_CELL end_ROW start_ROW start_CELL ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) \ ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( ) ) end_CELL start_CELL = end_CELL start_CELL ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋈ ( italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( italic_B = italic_R . italic_B ⋈ italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_B ∈ italic_V end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) ⋈ ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_S end_ARG ) end_CELL start_CELL = end_CELL start_CELL ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋈ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_B := italic_R . italic_B start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ⋈ ( italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := italic_S . italic_B start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ⋈ ( italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ italic_V start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over→ start_ARG italic_R end_ARG ) end_CELL start_CELL = end_CELL start_CELL ( italic_γ start_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( italic_B := over~ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X ; italic_C end_POSTSUBSCRIPT ( italic_R . italic_B ) ) start_POSTSUBSCRIPT italic_B ∈ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY The operations γ~~𝛾\tilde\gammaover~ start_ARG italic_γ end_ARG and ε~~𝜀\tilde\varepsilonover~ start_ARG italic_ε end_ARG are zero-filtering versions of aggregation and derivation respectively, which remove any rows whose C𝐶Citalic_C-value is zero. Filtering out zero coefficients is not essential but avoids retaining unnecessary records since absent coefficients are assumed to be zero. In the rule for selection, recall predicate P𝑃Pitalic_P only mentions key attributes; we write σP(R→)subscript𝜎𝑃→𝑅\sigma_P(\vecR)italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over→ start_ARG italic_R end_ARG ) for the result of applying the selection to each table in R→→𝑅\vecRover→ start_ARG italic_R end_ARG. In the rule for projection-away, we assume R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V and W⊆V𝑊𝑉W\subseteq Vitalic_W ⊆ italic_V, and R→[V\W]→𝑅delimited-[]\𝑉𝑊\vecR[V\backslash W]over→ start_ARG italic_R end_ARG [ italic_V \ italic_W ] is the record resulting from discarding the fields corresponding to attributes in W𝑊Witalic_W. Likewise in the rule for renaming, R→[B↦B′]→𝑅delimited-[]maps-to𝐵superscript𝐵′\vecR[B\mapsto B^\prime]over→ start_ARG italic_R end_ARG [ italic_B ↦ italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] stands for renaming field B𝐵Bitalic_B of R→→𝑅\vecRover→ start_ARG italic_R end_ARG to B′superscript𝐵′B^\primeitalic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if present, otherwise applying ρ[B↦B′](-)subscript𝜌delimited-[]maps-to𝐵superscript𝐵′\rho_[B\mapsto B^\prime](-)italic_ρ start_POSTSUBSCRIPT [ italic_B ↦ italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT ( - ) to each table in R→→𝑅\vecRover→ start_ARG italic_R end_ARG. In the rule for addition, we introduce a dummy discriminant in the union, and just use zero-filtering aggregation γ~K,X;Csubscript~𝛾𝐾𝑋𝐶\tilde\gamma_K,X;Cover~ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_K , italic_X ; italic_C end_POSTSUBSCRIPT to sum coefficients grouped by key and variable name (i.e., getting rid of the dummy discriminant). Likewise, in the case for scalar multiplication, ε~C:=α⋅C(RB0)subscript~𝜀assign𝐶⋅𝛼𝐶subscript𝑅subscript𝐵0\tilde\varepsilon_C:=\alpha\cdot C(R_B_0)over~ start_ARG italic_ε end_ARG start_POSTSUBSCRIPT italic_C := italic_α ⋅ italic_C end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) operation does an in-place update and finally filters out any zero coefficients. The rule for difference is slightly tricky because since S0subscript𝑆0S_0italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT does not have variable attribute in the key, so just subtracting it from each of the R.Bformulae-sequence𝑅𝐵R.Bitalic_R . italic_B would not work. Instead, we compute the set of keys present after the difference and restrict each R.Bformulae-sequence𝑅𝐵R.Bitalic_R . italic_B to that set of keys (using a join). The rule for join is likewise a little more involved: given R:KR▷VR:𝑅▷subscript𝐾𝑅subscript𝑉𝑅R:K_R\triangleright V_Ritalic_R : italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ▷ italic_V start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and S:KS▷VS:𝑆▷subscript𝐾𝑆subscript𝑉𝑆S:K_S\triangleright V_Sitalic_S : italic_K start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ▷ italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, since VRsubscript𝑉𝑅V_Ritalic_V start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and VSsubscript𝑉𝑆V_Sitalic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT are disjoint, it suffices for each field BRsubscript𝐵𝑅B_Ritalic_B start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT of VRsubscript𝑉𝑅V_Ritalic_V start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT to join the corresponding table R.BRformulae-sequence𝑅subscript𝐵𝑅R.B_Ritalic_R . italic_B start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT with the keys of S0subscript𝑆0S_0italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e. πKS(S0)subscript𝜋subscript𝐾𝑆subscript𝑆0\pi_K_S(S_0)italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and symmetrically for S𝑆Sitalic_S’s value-fields BSsubscript𝐵𝑆B_Sitalic_B start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. Finally, for aggregation we assume R:K▷V:𝑅▷𝐾𝑉R:K\triangleright Vitalic_R : italic_K ▷ italic_V with K′⊆Ksuperscript𝐾′𝐾K^\prime\subseteq Kitalic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_K and V′⊆Vsuperscript𝑉′𝑉V^\prime\subseteq Vitalic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_V, and again use γ~~𝛾\tilde\gammaover~ start_ARG italic_γ end_ARG.

Finally, we comment on how we implemented the mappings from raw data to s-tables as well as constraint generation performed by the fusion or coalescing operators. The simple strategy outlined in Section 3.3 materializes a great deal of intermediate data that is not ultimately needed. Instead, we found the most efficient approach to be to define the s-tables as views (typically using SQL:1999 features such as ROW_NUMBER to generate unique ids for variables), and to define coalescing as an operator on constrained s-tables, where the constraints themselves are represented as an extra query (i.e., a query over an s-table is actually a pair of queries, one retrieving the data I′superscript𝐼′I^\primeitalic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and another creating the constraints ϕitalic-ϕ\phiitalic_ϕ). With this approach, coalescing (hence fusion) become first-class operators and can be composed with the other operations, so we can convert a specification to a single composed query (referring to the views that define the s-tables) whose translation generates the equations directly. This is the approach we have evaluated, which significantly improves the naive materialization approach, because it avoids the need to load and scan numerous intermediate s-tables.

5. Evaluation

We evaluated Eris (whose components are depicted in Figure 4) from the perspective of both usefulness and performance. Experiments were run on a workstation equipped with an Intel Xeon E5-1650 with 6 cores, 32 GB RAM, running Ubuntu 16.10, and using a standard installation of PostgreSQL 9.5. Our system was implemented333https://anonymous.4open.science/r/Eris in approximately 4000 lines of Scala code, with approximately 100 lines of SQL defining auxiliary operations, user-defined types, and functions involving sparse vectors. We solved linear and quadratic programming subproblems using version 0.6.1 of OSQP (Stellato et al., 2020), called as a Python library; the Scala code of the main system queries the database, constructs the data needed for an instance of OSQP’s quadratic programming problems, and then invokes Python to run and solve the problem. We used OSQP’s default configuration with no parameter tuning. inline,color=green!40,size=]Alberto (addressed): R1:”The paper would benefit from an overall architecture of the ERIS system, so that it can be seen how all the components really are put together.”-¿Add a figure inline,color=green!40,size=]Alberto (addressed): Add inputs and outputs to the figure. inline,color=green!40,size=]Alberto (addressed): Say really clearly that since we are defining a new problem, we know of no existing solutions to compare with so we compare the two best solutions we have come up with so far

5.1. Performance microbenchmarks

inline,color=green!40,size=]Alberto (addressed): R3:”It might be hard to differentiate the blue and green colours in figures 4 and 5, and this is easily solved through the use of additional textures (dots,…).” We considered the following questions:

(1) How does the time taken for symbolic query evaluation using NF22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT and Partitioning vary depending on data size?

(2) How does the time taken for equation generation vary depending on data size?

(3) How does the time taken by OSQP compare to that needed for equation generation?

The first and second questions measure only the performance of our system and do not consider the time taken by OSQP. The third question is important to determine whether our system produces QP problems that are feasible for OSQP to solve, because such problems could be encoded in several different ways.

Although there are several benchmarks for entity resolution and evaluation of the distance between descriptive data, there is not any available benchmark with multiple sources of overlapping numerical data suitable for our system, so we adopted a microbenchmarking approach with synthetic data and simple queries. We defined a simple schema with tables R:A,B▷C,D:𝑅𝐴▷𝐵𝐶𝐷R:A,B\triangleright C,Ditalic_R : italic_A , italic_B ▷ italic_C , italic_D and S:B▷E,F:𝑆▷𝐵𝐸𝐹S:B\triangleright E,Fitalic_S : italic_B ▷ italic_E , italic_F and a random data generator that populates this schema, for a given parameter n𝑛nitalic_n, by generating n𝑛nitalic_n rows for S𝑆Sitalic_S and for each such row (b,e,f)𝑏𝑒𝑓(b,e,f)( italic_b , italic_e , italic_f ), generating between 0 and n𝑛\sqrtnsquare-root start_ARG italic_n end_ARG rows for R𝑅Ritalic_R with the same B𝐵Bitalic_B field. Thus on average the resulting database contains n+n2n𝑛𝑛2𝑛n+\fracn2\sqrtnitalic_n + divide start_ARG italic_n end_ARG start_ARG 2 end_ARG square-root start_ARG italic_n end_ARG rows in total. We generated databases for n∈100;1,000;10,000𝑛100100010000n\in\100;1,000;10,000\italic_n ∈ 100 ; 1 , 000 ; 10 , 000 ; note that n=10,000𝑛10000n=10,000italic_n = 10 , 000 actually corresponds to approximately 510,000510000510,000510 , 000 rows. For each n𝑛nitalic_n, we randomly generated three data sets and used the same three data sets for all experiments for that size, reporting the average of the three runs. inline,color=green!40,size=]Alberto (addressed): R3:”it is not clear if every performance microbenchmark has been running on different datasets for a given size.” We consider the following queries to exercise the most complex cases of the translation of Section 4:

q1=R′⋈S′q2=εW:=C+D(R′)q3=εX:=W*C(εW:=1(R′))q4=γA;C(R′)q5=γB;C(R′)q6=R′⊎ZR′′q7=κZ(q6)T1=κZ(q1⊎Z(R⋈S))T2=κZ(q2⊎ZεW:=C+D(R))T3=κZ(q3⊎ZεX:=W*C(εW:=1(R)))T4=κZ(q4⊎ZγA;C(R))T5=κZ(q5⊎ZγB;C(R))subscript𝑞1⋈superscript𝑅′superscript𝑆′subscript𝑞2subscript𝜀assign𝑊𝐶𝐷superscript𝑅′subscript𝑞3subscript𝜀assign𝑋𝑊𝐶subscript𝜀assign𝑊1superscript𝑅′subscript𝑞4subscript𝛾𝐴𝐶superscript𝑅′subscript𝑞5subscript𝛾𝐵𝐶superscript𝑅′subscript𝑞6subscript⊎𝑍superscript𝑅′superscript𝑅′′subscript𝑞7subscript𝜅𝑍subscript𝑞6subscript𝑇1subscript𝜅𝑍subscript⊎𝑍subscript𝑞1⋈𝑅𝑆subscript𝑇2subscript𝜅𝑍subscript⊎𝑍subscript𝑞2subscript𝜀assign𝑊𝐶𝐷𝑅subscript𝑇3subscript𝜅𝑍subscript⊎𝑍subscript𝑞3subscript𝜀assign𝑋𝑊𝐶subscript𝜀assign𝑊1𝑅subscript𝑇4subscript𝜅𝑍subscript⊎𝑍subscript𝑞4subscript𝛾𝐴𝐶𝑅subscript𝑇5subscript𝜅𝑍subscript⊎𝑍subscript𝑞5subscript𝛾𝐵𝐶𝑅\small\beginarray[]rclq_1&=&R^\prime\Join S^\prime\\ q_2&=&\varepsilon_W:=C+D(R^\prime)\\ q_3&=&\varepsilon_X:=W*C(\varepsilon_W:=1(R^\prime))\\ q_4&=&\gamma_A;C(R^\prime)\\ q_5&=&\gamma_B;C(R^\prime)\\ q_6&=&R^\prime\uplus_ZR^\prime\prime\\ q_7&=&\kappa_Z(q_6)\endarray\beginarray[]rclT_1&=&\kappa_Z(q_% 1\uplus_Z(R\Join S))\\ T_2&=&\kappa_Z(q_2\uplus_Z\varepsilon_W:=C+D(R))\\ T_3&=&\kappa_Z(q_3\uplus_Z\varepsilon_X:=W*C(\varepsilon_W:=1(R)))% \\ T_4&=&\kappa_Z(q_4\uplus_Z\gamma_A;C(R))\\ T_5&=&\kappa_Z(q_5\uplus_Z\gamma_B;C(R))\endarraystart_ARRAY start_ROW start_CELL italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋈ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_ε start_POSTSUBSCRIPT italic_W := italic_C + italic_D end_POSTSUBSCRIPT ( italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_ε start_POSTSUBSCRIPT italic_X := italic_W * italic_C end_POSTSUBSCRIPT ( italic_ε start_POSTSUBSCRIPT italic_W := 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_A ; italic_C end_POSTSUBSCRIPT ( italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_B ; italic_C end_POSTSUBSCRIPT ( italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊎ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_κ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY start_ARRAY start_ROW start_CELL italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_κ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_R ⋈ italic_S ) ) end_CELL end_ROW start_ROW start_CELL italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_κ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊎ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_W := italic_C + italic_D end_POSTSUBSCRIPT ( italic_R ) ) end_CELL end_ROW start_ROW start_CELL italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_κ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⊎ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_X := italic_W * italic_C end_POSTSUBSCRIPT ( italic_ε start_POSTSUBSCRIPT italic_W := 1 end_POSTSUBSCRIPT ( italic_R ) ) ) end_CELL end_ROW start_ROW start_CELL italic_T start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_κ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ⊎ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_A ; italic_C end_POSTSUBSCRIPT ( italic_R ) ) end_CELL end_ROW start_ROW start_CELL italic_T start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_κ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ⊎ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_B ; italic_C end_POSTSUBSCRIPT ( italic_R ) ) end_CELL end_ROW end_ARRAY inline,color=green!40,size=]Alberto (addressed): Shouldn’t ”q’” be ”T”? If so, it should also be changed in the figure. Given two source tables R,S𝑅𝑆R,Sitalic_R , italic_S, in a database generated as explained above, we create observation tables Rosubscript𝑅𝑜R_oitalic_R start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, Sosubscript𝑆𝑜S_oitalic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT by distorting them as follows: For each row, we randomly replace each value field with NULL with some probability (i.e., p=0.01𝑝0.01p=0.01italic_p = 0.01) and otherwise add a normally-distributed distortion. Next, symbolic views R′,S′superscript𝑅′superscript𝑆′R^\prime,S^\primeitalic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of both distorted tables are defined, as outlined in Section 3.1. Once we have these two versions of the tables (i.e., the source R𝑅Ritalic_R,S𝑆Sitalic_S and the distorted, symbolic one R′superscript𝑅′R^\primeitalic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT,S′superscript𝑆′S^\primeitalic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), we considered two modes of execution of these queries: in the first mode, we simply evaluate the query over a symbolic input (i.e., R′superscript𝑅′R^\primeitalic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and S′superscript𝑆′S^\primeitalic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) and construct the result; in the second mode, we evaluate the result of aligning this symbolic query result with a query result run over the source tables (i.e., R𝑅Ritalic_R and S𝑆Sitalic_S). Thus, for example, for T1subscript𝑇1T_1italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT we generate the equations resulting from the coalescing expression q1⊔(R⋈S)=κZ(q1⊎Z(R⋈S))square-unionsubscript𝑞1⋈𝑅𝑆subscript𝜅𝑍subscript⊎𝑍subscript𝑞1⋈𝑅𝑆q_1\sqcup(R\Join S)=\kappa_Z(q_1\uplus_Z(R\Join S))italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊔ ( italic_R ⋈ italic_S ) = italic_κ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_R ⋈ italic_S ) ). Finally, the resulting system of equations is solved, subject to the metric giving each error variable x𝑥xitalic_x a weight of x2superscript𝑥2x^2italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and each null variable a weight of 0.

For question 1, executions are summarized in Fig. 6, where reported times include the time to receive the symbolic query results. These show that the Partitioning and NF22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT have broadly similar performance; despite NF22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT’s comparative simplicity, its running time is often faster with the exceptions being q4subscript𝑞4q_4italic_q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and q5subscript𝑞5q_5italic_q start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, the two aggregation queries. Particularly for q4subscript𝑞4q_4italic_q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, aggregation can result in large symbolic expressions which are not always handled efficiently by the NF22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT sparse vector operations using PostgreSQL arrays; we experimented with several alternative approaches to try to improve performance without success. Thus, in cases where the symbolic expressions do not grow large, NF22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT seems preferable.

For questions 2 and 3, we measured the time taken for equation generation and for OSQP solving for each query, using different database sizes as described above. The results are shown in Fig. 6. In Fig. 6, the time taken for equation generation, including querying and serializing the resulting OSQP problem instances, is shown (again in logarithmic scale). In Fig. 6, the percentage of time spent on equation generation and on OSQP solving for the largest database instance (n=10,000𝑛10000n=10,000italic_n = 10 , 000) is shown, and we can appreciate that they are always in a similar order of magnitude so neither can be claimed to be a bottleneck in front of the other. The OSQP solving times for Partitioning and NF22^2start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT are coincident and so not shown.

5.2. Case study

We might use our tool to get a best-fit database. However, this would only be useful if sources are close to each other (and hence to reality). If they are relatively discordant (like the blind men describing the elephant), all we can aim at is to measure and study the evolution of such discordancy. Thus, we decided to apply our prototype to the study of challenging COVID-19 data, which is publicly available, and see from that the improvement of reporting in different countries during the pandemic. More specifically we considered two different sources:

Johns Hopkins University (JHU):

The Center for Systems Science and Engineering (CSSE) at JHU has been gathering COVID-19 data since the very beginning of the pandemic and has become a referent worldwide (Dong et al., 2020). On the one hand, we have used its daily time series at country level444https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series containing both cases and deaths. Unfortunately, on the other hand, regional data is scattered in different files in the JHU repository, so we used a more compact version.555https://github.com/coviddata/coviddata

EuroStats:

As second data source for comparison, we used the weekly European mortality by EuroStats,666https://ec.europa.eu/eurostat/databrowser/view/demo_r_mwk2_ts/default/table following the Nomenclature of Territorial Units for Statistics (NUTS).777https://ec.europa.eu/eurostat/web/nuts/background

The weekly mortality per country results appear to be historically quite stable (less than 5.5% coefficient of variation for the six countries of our study). Thus, we took the weekly mortality of the five years previous to the pandemic as ground truth. However, for some countries, most recent figures are either tagged as provisional or estimated. While we considered the former like an administrative issue and still part of the error-free ground truth, we put the latter together with the mortality of 2020/2021 in an s-table, and treated those data in the same way as the ones coming from JHU.

We loaded the different data in a PostgreSQL database with Pentaho Data Integration. These were divided in the six tables shown in Table 1, together with the numbers of different locations and times, number of rows, and first and last time point available. Data was split firstly according to the source (namely EuroStats or JHU). Ground truth mortality (i.e., considered free of errors) is in ground tables EU𝐸𝑈EUitalic_E italic_U, while estimates of previous years and data of 2020/2021 are in s-tables 𝐸𝑈𝑒𝐸𝑈𝑒\mathitEUeitalic_EUe. Different s-tables are also generated for different geographic granularities (namely region r𝑟ritalic_r or country c𝑐citalic_c), and relevantly, data from Eurostats is available per week w𝑤witalic_w, while data from JHU is available daily d𝑑ditalic_d. Both location and temporal dimensions result in different (underlined) key attributes for the corresponding tables. From EuroStats, we only used the number of deaths #d#𝑑\#d# italic_d, while from JHU we took both COVID-19 cases #c#𝑐\#c# italic_c and deaths #d#𝑑\#d# italic_d. Attribute #d#𝑑\#d# italic_d is declared as free of variables in both EU𝐸𝑈EUitalic_E italic_U ground tables and its instances are consequently constants. Values coming from EuroStats correspond exactly to the reported ones, but to mitigate the noise (e.g., cases not reported during weekends being moved to the next week by some regions) in those coming from JHU, we followed the common practice of taking the average in the previous seven days for both cases and deaths.

Fig. 7 shows a logical representation of our alignment of the sources (notational elements are introduced to facilitate the understanding, like “avg” instead of the “sum/count” actually used in the current prototype). Dimensional tables like date and firstadminunit and their corresponding joins to facilitate selections over year and week of year (woy), or the relationships between countries and regions, are omitted for the sake of simplicity. On the first hand, we take EU𝐸𝑈EUitalic_E italic_U and 𝐸𝑈𝑒𝐸𝑈𝑒\mathitEUeitalic_EUe tables and generate the weekly surplus of deaths after the sixth week of 2020 by subtracting from the declared amounts, the average deaths in the last five years for the same week. This is done both per region and country, since these values are not always concordant (even if coming from the same source). Then, regional results are aggregated per country and merged in the same table with the information provided already at that level using a discriminated union to keep track of the different origins. On the other hand, looking now at JHU tables, we aggregate regional data in three different ways: deaths per country and day, also deaths per region and week, and finally cases per region and week with a lag of three weeks (we will empirically justify this concrete value later). Under the assumption of Case-Fatality Ratio of 1.5%percent1.51.5\%1.5 % (observed median on June 22nd, 2021 is 1.7% according to JHU888https://coronavirus.jhu.edu/data/mortality), such transformation is applied to the cases before merging and coalescing the weekly regional cases and deaths. Daily deaths reported per country and those obtained after aggregating regions are also coalesced and then aggregated per week. Both branches of JHU data are finally merged with a discriminated union into a single table. Finally, the four branches (namely EuroStats regional data, EuroStats country data, JHU regional data aligning cases and deaths, and JHU regional data coalesced with JHU country data) are merged into a single table with a discriminated union and finally coalesced to generate the overall set of equations.

We restricted our analysis to only the six countries in Table 2, chosen because of their relevance in the pandemic and availability of regional data in both EuroStats and JHU. Regarding the time, we only considered until February 2021, to avoid the impact of vaccination. For each country and week, our alignment generates a different system of equations, which is solved minimizing the discord (i.e., sum of squared errors). In the table, we can see for each country, the number of systems of equations with the maximum number of variables999We ignored 2020W53, because of its exceptional nature (nonexistent for other years). (i.e., all possible data is available, what happens between weeks 2020W26 and 2021W06, except for UK whose data is only available in EuroStats until 2020W51), as well as the equations and variables per system in those cases. The average time in seconds to generate each system of equations as a Python input to OSQP and solve it are also reported.

The line charts in Fig. 8 and 9 plot discordance divided by the number of variables in the vertical axis. Firstly, Fig. 8 varies the lag between reported cases and deaths, for values from one to eight weeks. We can see that the average discordance is minimized in all cases between two and four weeks (three in the average). Thus, in the other charts, we used a lag of three weeks.

Fig. 9 shows the evolution of the discordance since 2020W26 until the last week being considered. We can appreciate that during the first weeks reporting regional data, countries adjusted and eventually improved their COVID-19 surveillance mechanisms. However, all of them except UK are too sensitive to the increase of cases and their concordancy with real deaths is clearly affected by the arrival of the second wave after summer and the third one at the end of the year (we can clearly appreciate the two peaks in the five other countries). Unfortunately the UK data reporting to Eurostats stopped on December 31, 2021 due to Brexit, so we cannot see from the Eurostats data whether the UK’s error remained low during the rest of the infection wave in early 2021.

Finally, Fig. 9 shows the clearer but less computationally challenging evolution of discordancy without considering regionally reported data (the small pointer in the horizontal axis indicates the alignment of both charts). We can see a clear pick of discordancy during the first wave that eventually improves, just to be more or less affected again by the second and third waves depending on the country. As a derived calculation of the observed discordancy, we can take the Pearson correlation coefficient between those and the running average of the number of cases (i.e., DE: 36%; ES: 80%; IT: 50%; NL: 73%; SE: 23%; UK: 93%). Thus, we can observe that in the case of ES, NL and UK, more than 70% of variation in the discordancy can be explained by changes in the number of cases. inline,color=green!40,size=]Alberto (addressed): R1:”The case study does not show what to do with the results. Apparently, some discrepancies are to be seen. However the authors do not showcase how their solution is useful in dealing with the discrepancies.” inline,color=green!40,size=]Alberto (addressed): R1:”The experiments do not showcase the utility of the proposed technique. I would have liked to see how the proposed discord measure is better than other forms of distance metrics.” inline,color=green!40,size=]Alberto (addressed): R2:”Although the authors have claimed to measure discordance in this work (line 480, 2nd column), the experiments do not reflect that properly. They have put more effort on runtime analysis whereas that is not the main focus. There should be some experiments on how their approach is better from previous approaches for measuring discordance efficiently and precisely.” inline,color=green!40,size=]Alberto (addressed): R2:”Many distance metrics are highly relevant. For example, edit distance, Jaccard are a few of them. In addition to distance metrics, vector space model that is used in IR to compare documents is also highly relevant. Not only the rigorous discussion addressing conceptual difference of the proposed solution is needed, but even more the extensive experimental evaluation section having at least several datasets, each of different scale and heterogeneity.”

6. Related work

The problems described above are related to Consistent Query Answering (CQA) (Chomicki, 2007), which tries to identify the subset of a database that fulfills some integrity constraints, and corresponds to the problem of identifying certain answers under open world assumption (Baader et al., 2003). In CQA, distance between two database instances is captured by symmetric difference of tuples. However, in our case, the effects of an alignment are not only reflected in the presence/absence of a tuple, but also in the values it contains. This leads to the much closer Database Fix Problem (DFP) (Bertossi et al., 2005; Bohannon et al., 2005), which aims at determining the existence of a fix at a bounded distance measuring variations in the numerical values.

Both DFP as well as CQA become undecidable in the presence of aggregation constraints. Nonetheless, these have been used to drive deduplication (Chaudhuri et al., 2007). However, our case is different since we are not questioning correspondences between entities to create aggregation groups, but instead trying to quantify their (in)consistency in the presence of complex transformations.

Another known result in the area of DFP is that prioritizing the repairs by considering preferences or priorities (like the data sources in our case) just increases complexity. An already explored idea is the use of where-provenance in the justification of the new value (Geerts et al., 2013), but with pure direct value imputation (without any data transformation). In contrast, we consider that there is not any master data, but multiple contradictory sources, and we allow aggregates, while (Geerts et al., 2013) only uses pure equalities (neither aggregation nor any real arithmetic) between master and target DBs.

From another perspective, our work is related to incompleteness in multidimensional databases, which has been typically focused on the problems generated by imprecision in hierarchical information (Dyreson et al., 2003), (Baikousi et al., 2011) and (Golfarelli and Turricchia, 2014). Only more recently, attention has shifted to missing values in the measures. Bimonte et al. (Bimonte et al., 2020) presents a linear programming-based framework that imputes missing values under some constraints generated by sibling data at the same aggregation level, as well as parent data in higher levels. We could consider this a special case of our approach, where there is a single data source and alignment is predefined.

The setting we have described shares many motivations in common with previous work on provenance. The semiring provenance model (Green et al., 2007) is particularly related, explaining why why-provenance (Buneman et al., 2001) is not enough (e.g., in the case of alternative sources for the same data) and we need how-provenance to really understand how different inputs contribute to the result. They propose the use of polynomials to capture such kind of provenance. Further, Amsterdamer et al. (Amsterdamer et al., 2011) extended the semiring provenance model to aggregations by mixing together annotations and values, but the fine-grained provenance information may become prohibitively large. However, to the best of our knowledge no practical implementations exist. In contrast, our approach does not have row-level annotations recording the conditions that make a row present in the result, limits aggregation to value fields, and considers only sum and averaging forms of aggregation, but we have provided practical implementations of this more limited model. As noted earlier, our s-tables are similar in some respects to c-tables studied in incomplete information databases (Imielinski and Jr., 1984). Our data model and queries is more restricted in some ways, due to the restriction to finite maps, and the fact that we do not allow for conditions affecting the presence of entire rows, but our approach supports aggregation, which is critical for our application area and which was not handled in the original work on c-tables.

There have been implementations of semiring provenance or c-tables in systems such as Orchestra (Ives et al., 2008), ProQL (Karvounarakis et al., 2010), ProvSQL (Senellart et al., 2018), and Mimir (Nandi et al., 2016), respectively. In Orchestra provenance annotations were used for update propagation in a distributed data integration setting. ProQL and ProvSQL implement the semiring model but do not allow for symbolic expressions in data or support aggregation. Mimir is a system for querying uncertain and probabilistic data based on c-tables; however, in Mimir symbolic expressions and conditions are not actually materialized as results, instead the system fills in their values with guesses in order to make queries executable on standard RDBMSs. Thus, Mimir’s approach to c-tables would not suffice for our needs since we need to generate the symbolic constraints for the QP solver to solve. On the other hand, our work shows how some of the symbolic computation involved in c-tables can be implemented in-database and it would be interesting to consider whether this approach can be extended to support full c-tables in future work.

We have reduced the concordancy evaluation problem to quadratic programming, a well-studied optimization problem. Solvers such as OSQP (Stellato et al., 2020) can handle systems of equations with thousands of equations and variables. However, we have not made full use of the power of linear/quadratic programming. For example, we could impose additional linear inequalities on unknowns for example to constrain that certain error or null values have to be positive or within some range. Likewise, we have defined the cost function in one specific way but quadratic programming permits many other cost functions to be defined, for example with different weights for each variable or with additional linear cost factors.

As noted in Section 2, we have focused on the problem of evaluating concord/discord among data sources and not on using the different data sources to estimate the actual values being measured. It would be interesting to extend our framework by augmenting symbolic tables and queries with a probabilistic interpretation, so that the optimal solution found by quadratic programming produces statistically meaningful consensus values (similarly to the work of Mayfield et al. (Mayfield et al., 2010)).

7. Conclusions

In most data integration and cleaning scenarios, it is assumed that there is some source of ground truth available, such as master data or user input. However, in many realistic settings, such as epidemiological surveillance, ground truth is not known or knowable and we need to integrate discordant data sources with different levels of trustworthiness, completeness and self-consistency. In this setting, we still would like to be able to measure how close the observed data is to our idealized expectations. Thus, we proposed definitions of concordance and discordance capturing respectively when data sources we wish to fuse are compatible with one another, and measuring how far away the observed data is from being concordant. Consequently, we can compare discordance measurements over time to understand whether the different sources are becoming more or less consistent with one another. We showed how to solve this problem by extending multidimensional relational queries with symbolic evaluation, and gave two relational implementations of this approach reducing to linear programming or quadratic programming problems that can be solved by an off-the-shelf QP solver. We explored the performance of the two approaches via microbenchmarks to assess the scalability in data size and number of variables, illustrated the value of this information using a case study based on COVID-19 case and death reporting from 2020-2021, and found that the error calculated for six European countries at different times correlates with intuition.

Our approach to symbolic evaluation of multidimensional queries appears to have further applications which we plan to explore next, such supporting other forms of uncertainty expressible as linear constraints, and adapting our approach to produce statistically meaningful estimates of the consensus values.

Eris Measuring Discord Among Multidimensional Data Sources extended Version

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools