## Are there unexpected patterns in the relationships among variables?

**Reference Script**: comparability\\Comparability_Relationship_among_variables.Rmd

**Description**: Variables which have a predictable relationship with each other are examined to determine whether that relationship is apparent in the data or not. For example, in some benefit programs, the cash assistance received by the family is expected to be directly proportional to the number of beneficiaries in the household. Additionally, the relationships of other variables are assessed to detect unexpected patterns.  

We  use tableplots (Tennekes et al., 2011) to help assess if the relationship between two or more variables exists. A tableplot can aid in the exploration of large data files by summarizing it over a set of variables. Each variable is represented in a column, where each row is an aggregate of a given number of units. 

For each column, all records are sorted by values of a variable in the table. The variable should be chosen for its helpfulness in assessing the relationship with other variables. The default method partitions the sort variable into percentiles and graphs its approximate distribution by percentile. Thus the y axis presents percentiles of the sort variable.


For continuous variables, the length of the light blue bar represents the mean, and the dark blue bars represent the standard deviation. For discrete variables, different colors indicate the frequency of each category. 

We recommend using at least one key variable as the sort variable that the user can specify. By default, this program will choose all numeric key variables and will generate a tableplot sorted by each variable. Additionally, the user can specify subsetting parameters for the data file (in Toolkit.Rmd), such as specific time ranges.  
  
  
### Relationships among key variables

To study the relationship between key variables, the user should select the sorting variable among key variables.

**Select sorting variable:**
```{r rel_among_key_vars1}
##### SELECT SORTING VARIABLE #####
sort_var <- NULL
###################################

# If no sorting variable is specified or if sorting variable is not a key variable,
# we choose all the numeric variables as sorting variable.
# If no numeric variables, we choose all non-numeric key variables.
if (is.null(sort_var) | !all(is.element(sort_var, key_vars)) )  {
  sort_var <- NULL
  for (i in key_vars) {
    if (is.numeric(analysis_file[[i]])) {
      sort_var <- c(sort_var,i)
    }
  }
  if (length(sort_var) == 0) {
    sort_var <- key_vars
  }
}
```

**Tableplots:** We create one tableplot for each sorting variable.
```{r rel_among_key_vars3}
# Tableplot with key variables
if (!is.null(sort_var)) {
  options(ffmaxbytes = min(getOption("ffmaxbytes"),.Machine$integer.max * 12))
  for (i in sort_var) {
    tableplot(subsetted, 
            select_string = key_vars, 
            sortCol = i, 
            scale = "lin", 
            title = paste0("Distribution of key variables. Sorted by: ",i,"\nSubset: ",subset_text))
  }
} else {
  print("No valid sorting variable")
}
```

**What to look for in tableplots**:  
- *Relations between variables*: Use the graph to check relationships between variables there are any unexpected trends that warrant further inquiry. 
- *Discontinuities in sorted variables*: Among sorted variables, look for jumps in the distribution of continuous variables. Note that discrete variables may exhibit discrete jumps which are perfectly fine, such as increasing or decreasing the number of household members by one or more.  
- *Discontinuities in unsorted variables*: Among unsorted variables, look for discontinuities in continuous and categorical variables, and note the direction of the jump. Some jumps are fine, such as jumps in the cash received by the household when the number of recipients increased.
- *Range of values*: Look at the range of values of each variable to detect variables with particularly high or low range that may cast doubt about the validity of the values.  
- *High standard deviation or discrete jumps in the standard deviation*: Look for high standard deviations, which may indicate problems in the validity of the values. For example, in some cases missing values are codified with values such as 999999 or 0 which will increase the variability of the data. Also, look for drastic changes in the standard deviation.  
- *Highest and lowest values*: Check the highest and lowest values of the sorted variable and how they relate to the other variables.  
  
  
### Relationships among key and domain variables

Here we include domain variables to further assess the relationship between key and domain variables. The same sorting variable and subsetting parameters are used.

**Tableplots:**
```{r rel_among_key_domain_vars}
#Tableplot with key and domain variables
if (!is.null(sort_var)) {
  for (i in sort_var) {
    tableplot(subsetted, 
            select_string = c(key_vars, domain_vars), 
            sortCol = i, 
            scale = "lin", 
            title = paste0("Distribution of key/domain variables. Sorted by: ",i,"\nSubset: ",subset_text))
  }
} else {
  print("No valid sorting variable")
}
```

**Interpretation**: In addition to the guidance provided in the previous subsection, check if there are subgroups of units that would be relevant to analyze separately. For example, with administrative data for cash assistance, units with only one recipient may exhibit a different behavior than units with more than one recipient, such as higher average amount received.  
  
  
### Relationships among key and domain variables by subgroup

In some cases, it is relevant to analyze the relationship among variables for different categories of units. This is the case when we expect that data will exhibit different patterns in each subgroup. For example, in the case of benefit receipt, cases that only have children as recipients show certain characteristics that make them different from other cases, such as the exemption from federal time limits. In time-series or longitudinal analysis, it is possible to assess the relationship between variables for different time intervals.

The user should select one categorical variable that identifies the different groups that want to be analyzed. The analysis works best when the variable has less than five categories. By default, the system will use the first domain variable. Tableplots will be sorted by one variable that can be defined by the user. In case it is not specified, it will be the first continuous key variable.

**Select the subgroup variable:**
```{r rel_among_key_domain_vars_bysubgroup1}
##### SELECT SUBGROUP VARIABLE #####
subgroup = domain_vars[1]
####################################

# If no subgroup variable is specified or if subgroup variable is not a key,
# domain, time, or location variable, or if there is more than one subgroup 
# variable, we skip this section.
if (is.null(subgroup) | 
    !all(is.element(subgroup, c(key_vars, domain_vars, location_vars, time_vars))) | 
    !(length(subgroup)==1) )  {
  print("Subgroup variable is not a key,domain, time, or location variable, or there is more than one subgroup variable")
  subgroup <- NULL
} else if (nlevels(as.factor(subsetted[[subgroup]])) >= 15) {
  print("Too many levels in the subgroup variable")
  subgroup <- NULL
}
```

**Tableplots:** We graph as many tableplots as categories of the subgroup variable.
```{r rel_among_key_domain_vars_bysubgroup2}
# Select just the first sorting variable 
sort_var <- sort_var[1]

if (!is.null(subgroup) & !is.null(sort_var)) {
  for ( i in levels(subsetted[[subgroup]]) ) {
    data5 <- subset(subsetted,subsetted[[subgroup]] == i)
    if (nrow(data5) > 0) {
      tableplot(data5, 
                select_string = c(key_vars, domain_vars), 
                sortCol = sort_var, 
                scale = "lin", 
                title = paste0("Tableplot sorted by: ",sort_var, ". Where ",subgroup, "= ", i,"\nSubset: ",subset_text))
    }
  }
} else {
  print("No valid sorting or subgroup variable")
}
```

**Interpretation**: Check whether there is any difference in the relationship patterns between variables across the different subgroups. Also check whether data quality issues are concentrated in certain subgroups.