diff --git a/.gitignore b/.gitignore
index 86d38d2..ddb01cf 100644
--- a/.gitignore
+++ b/.gitignore
@@ -63,6 +63,9 @@ vignettes/*.pdf
*.RDA
*.Rda
!vignettes/*.rda
+!data/outcomesCTN0094.rda
+!data/egOpioidsCTN0094.rda
+
# csv data
*.csv
diff --git a/DESCRIPTION b/DESCRIPTION
index 47fd733..267375a 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -26,9 +26,7 @@ License: GPL-3
Encoding: UTF-8
RoxygenNote: 7.2.1
Depends:
- R (>= 3.1.0),
- ctn0094data,
- ctn0094DataExtra
+ R (>= 3.1.0)
Imports:
stringi,
stringr
diff --git a/R/data_egOpioidsCTN0094.R b/R/data_egOpioidsCTN0094.R
new file mode 100644
index 0000000..30a404e
--- /dev/null
+++ b/R/data_egOpioidsCTN0094.R
@@ -0,0 +1,25 @@
+#' @title Opioid Use by Study Day for Example CTN-0094 Participants
+#'
+#' @description This data set is a table with daily positive opioid use
+#' indicator for 10 participants from the CTN-0094 harmonized data sets. This
+#' subset is to be used as an example of timeline-style opioid use data.
+#'
+#' @details This data is created in the script
+#' `inst/scripts/create_allDrugs_opioid_subset_20220916.R`. The "when" column
+#' measures the number of days after signed consent for the participant that
+#' the opioid-positive urine screen was collected.
+#'
+#' @docType data
+#'
+#' @usage data(egOpioidsCTN0094)
+#'
+#' @format A tibble with `r scales::comma(nrow(egOpioidsCTN0094))` rows and
+#' `r ncol(egOpioidsCTN0094)` columns. These columns include
+#' \describe{
+#' \item{who}{Patient ID}
+#' \item{what}{A factor indicating what substance(s) were present in the
+#' urine on day `when`; trivially, all substances are "opioids" for this
+#' example data.}
+#' \item{when}{The number of days since signed study consent}
+#' }
+"egOpioidsCTN0094"
diff --git a/data/egOpioidsCTN0094.rda b/data/egOpioidsCTN0094.rda
new file mode 100644
index 0000000..f3a5e2e
Binary files /dev/null and b/data/egOpioidsCTN0094.rda differ
diff --git a/data/outcomesCTN0094.rda b/data/outcomesCTN0094.rda
new file mode 100644
index 0000000..9e99fc6
Binary files /dev/null and b/data/outcomesCTN0094.rda differ
diff --git a/inst/scripts/create_allDrugs_opioid_subset_20220916.R b/inst/scripts/create_allDrugs_opioid_subset_20220916.R
new file mode 100644
index 0000000..919161b
--- /dev/null
+++ b/inst/scripts/create_allDrugs_opioid_subset_20220916.R
@@ -0,0 +1,22 @@
+# Excerpt of all_drugs for 10 example participants
+# Gabriel Odom
+# 2022-09-16
+
+library(ctn0094data)
+library(ctn0094DataExtra)
+library(tidyverse)
+
+interestingPeople_int <- c(
+ 163L, 210L, 242L, 4L, 17L, 13L, 1103L, 33L, 233L, 2089L
+)
+
+egOpioidsCTN0094 <-
+ all_drugs %>%
+ filter(who %in% interestingPeople_int) %>%
+ filter(source %in% c("UDS", "UDSAB")) %>%
+ filter(what == "Opioid") %>%
+ select(-source) %>%
+ distinct() %>%
+ arrange(who, when)
+
+usethis::use_data(egOpioidsCTN0094)
diff --git a/man/egOpioidsCTN0094.Rd b/man/egOpioidsCTN0094.Rd
new file mode 100644
index 0000000..a717c45
--- /dev/null
+++ b/man/egOpioidsCTN0094.Rd
@@ -0,0 +1,32 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/data_egOpioidsCTN0094.R
+\docType{data}
+\name{egOpioidsCTN0094}
+\alias{egOpioidsCTN0094}
+\title{Opioid Use by Study Day for Example CTN-0094 Participants}
+\format{
+A tibble with 71 rows and
+3 columns. These columns include
+\describe{
+\item{who}{Patient ID}
+\item{what}{A factor indicating what substance(s) were present in the
+urine on day \code{when}; trivially, all substances are "opioids" for this
+example data.}
+\item{when}{The number of days since signed study consent}
+}
+}
+\usage{
+data(egOpioidsCTN0094)
+}
+\description{
+This data set is a table with daily positive opioid use
+indicator for 10 participants from the CTN-0094 harmonized data sets. This
+subset is to be used as an example of timeline-style opioid use data.
+}
+\details{
+This data is created in the script
+\code{inst/scripts/create_allDrugs_opioid_subset_20220916.R}. The "when" column
+measures the number of days after signed consent for the participant that
+the opioid-positive urine screen was collected.
+}
+\keyword{datasets}
diff --git a/vignettes/CTNote_vignette_20220908.Rmd b/vignettes/CTNote_vignette_20220908.Rmd
index 29ff6e9..1b965ec 100644
--- a/vignettes/CTNote_vignette_20220908.Rmd
+++ b/vignettes/CTNote_vignette_20220908.Rmd
@@ -1,6 +1,6 @@
---
-title: "An n-Ary Word Sufficient Statistic for Sequential Univariate Categorical Values"
-author: "Gabriel Odom, Laura Brandt, Ray Balise, and the CTN-0094 Team"
+title: "COOL TITLE HERE: An n-Ary Word Sufficient Statistic for Sequential Univariate Categorical Values"
+author: "Gabriel Odom, Laura Brandt, Ray Balise, Clinton Castro, and the CTN-0094 Team"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
@@ -30,14 +30,14 @@ knitr::opts_chunk$set(echo = FALSE)
```
```{r, load-packages, include=FALSE}
-library(ctn0094data)
-library(ctn0094DataExtra)
library(CTNote)
library(readxl)
library(kableExtra)
library(tidyverse)
```
+### Abstract
+
# Introduction and Motivation
@@ -45,9 +45,9 @@ library(tidyverse)
## Background
-Over 750,000 Americans have died from a drug overdose [since 1990](https://www.hhs.gov/opioids/about-the-epidemic/opioid-crisis-statistics/index.html) [@us_hhs_opioid_2018]. Even worse, after the onset of the COVID-19 pandemic, millions of Americans have self-reported [increases in their substance use](https://www.samhsa.gov/newsroom/press-announcements/202110260320) "a little more or much more" [@us_samhsa_samhsa_2021], and over 40 million Americans [needed treatment](https://zinniahealth.com/research/substance-use-disorder-treatment-by-state) for a substance use disorder (SUD) in 2020 [@liebhaber_over_2022]. SUDs are treated with a [combination of pharmacological and psychological interventions](https://www.samhsa.gov/medication-assisted-treatment), known as medication assisted treatment [@us_samhsa_medication-assisted_2022]. When evaluating the efficacy of new medication-assisted treatments for SUDs, clinicians require trial participants to provide frequent urine samples, known as urine drug screenings (UDS), which are tested for the presence of substances of misuse.
+Over 750,000 Americans have died from a drug overdose [since 1990](https://www.hhs.gov/opioids/about-the-epidemic/opioid-crisis-statistics/index.html).[@us_hhs_opioid_2018] Even worse, after the onset of the COVID-19 pandemic, millions of Americans have self-reported [increases in their substance use](https://www.samhsa.gov/newsroom/press-announcements/202110260320) "a little more or much more",[@us_samhsa_samhsa_2021] and over 40 million Americans [needed treatment](https://zinniahealth.com/research/substance-use-disorder-treatment-by-state) for a substance use disorder (SUD) in 2020.[@liebhaber_over_2022] SUDs are treated with a [combination of pharmacological and psychological interventions](https://www.samhsa.gov/medication-assisted-treatment), known as medication assisted treatment.[@us_samhsa_medication-assisted_2022] When evaluating the efficacy of new medication-assisted treatments for SUDs, clinicians require trial participants to provide frequent urine samples, known as urine drug screenings (UDS), which are tested for the presence of substances of misuse.
-For a single substance or group of substances, outcomes of these tests are uni-dimensional and categorical. Common UDS categories include "substance positive", "substance negative", "improper urine sample temperature", "missing urine", and others. Further, because these tests are given sequentially over the course of a clinical trial, the tests are correlated within subject and across study week. Because of these properties, single-value summaries of the UDS pattern (such as the proportion of "substance positive" results, or the maximum number of consecutive "substance negative" results) are not *sufficient statistics* (they do not contain all of the information about each trial participants' UDS pattern that was contained in the original UDS data) [@fisher_mathematical_1922].
+For a single substance or group of substances, outcomes of these tests are uni-dimensional and categorical. Common UDS categories include "substance positive", "substance negative", "improper urine sample temperature", "missing urine", and others. Further, because these tests are given sequentially over the course of a clinical trial, the tests are correlated within subject and across study week. Because of these properties, single-value summaries of the UDS pattern (such as the proportion of "substance positive" results, or the maximum number of consecutive "substance negative" results) are not *sufficient statistics* (they do not contain all of the information about each trial participants' UDS pattern that was contained in the original UDS data).[@fisher_mathematical_1922]
## Overview of this Paper
@@ -67,16 +67,16 @@ In this paper, we extend the concept of a binary word to multiple categories. We
We aim to create a sequence of letters and symbols that act as a sufficient statistic for a categorical random variable observed in a longitudinal manner. Our motivating example is a compact and sufficient representation of a subject's weekly UDS results. For laboratory tests which detect a substance of interest in the urine, this statistic is a compact representation of the full pattern of substance use for an individual participant. To be of use, this summary statistic must have the following properties:
-1. (*Machine-Readable*) It can be directly [parsed by a computer](https://opendatahandbook.org/glossary/en/terms/machine-readable/) [@open_data_handbook_machine_2022].
-2. (*Parsimonious*) It can be easily interpreted by a human.
+1. (*Machine-Readable*) It can be directly [parsed by a computer](https://opendatahandbook.org/glossary/en/terms/machine-readable/).[@open_data_handbook_machine_2022]
+2. (*Human-Readable/Parsimonious*) It can be easily interpreted by a human.
3. (*Sufficient*) It represents all of the same information about the sequence of interest that would be present in the full data.
## Binary and $n$-ary "Words"
-[Binary words](https://www.oxfordreference.com/view/10.1093/acref/9780199235940.001.0001/acref-9780199235940-e-293) are a compact representation of a sequence of logical or binary variables [@clapham_binary_2009]. For example, let the random variable $x$ indicate whether or not a clinical trial participant visited their clinic in each week over a four-week period, and assume that we observe the pattern: "visit", "visit", "no visit", "visit". If "1" represents a visit and "0" represents no recorded visit, then we can represent this clinic visit pattern as the binary word "1101". Similarly, we could represent these two categories symbolically as "V" for visit and "_" for no visit, so our binary word becomes "VV_V". Notice that for a short sequence like our 4-week example here, the original data itself is both machine readable and human readable. However, parsimony (human readability) begins to suffer as sequences grow longer.
+[Binary words](https://www.oxfordreference.com/view/10.1093/acref/9780199235940.001.0001/acref-9780199235940-e-293) are a compact representation of a sequence of logical or binary variables.[@clapham_binary_2009] For example, let the random variable $x$ indicate whether or not a clinical trial participant visited their clinic in each week over a four-week period, and assume that we observe the pattern: "visit", "visit", "no visit", "visit". If "1" represents a visit and "0" represents no recorded visit, then we can represent this clinic visit pattern as the binary word "1101". Similarly, we could represent these two categories symbolically as "V" for visit and "_" for no visit, so our binary word becomes "VV_V". Notice that for a short sequence like our 4-week example here, the original data itself is both machine readable and human readable. However, parsimony (human readability) begins to suffer as sequences grow longer.
-Extensions of this concept have been [famously applied to the human genome](https://www.ncbi.nlm.nih.gov/Class/MLACourse/Original8Hour/Genetics/basepair.html) [@geer_genetics_1999]. The four nitrogen bases in DNA are abbreviated by their first letter: A for adenine, T for thymine, G for guanine, and C for cytosine. These four letters are collapsed by their position in the genome into a *quaternary* word (or a *quinary* word if U is included for uracil). For example, the "word" AAACCATTCACAATCAGACA expresses a sequence of 20 nucleic acid bases without loss of information about the bases (the "word" is *sufficient*). The "word" AAACCATTCACAATCAGACA is also much easier to read by a human than the condensed structural formula or a skeletal-structural formula of these 20 compounds (the "word" is *parsimonious*). Furthermore, such a quaternary word can be parsed by a computer with ease (the "word" is *machine readable*).
+Extensions of this concept have been [famously applied to the human genome](https://www.ncbi.nlm.nih.gov/Class/MLACourse/Original8Hour/Genetics/basepair.html).[@geer_genetics_1999] The four nitrogen bases in DNA are abbreviated by their first letter: A for adenine, T for thymine, G for guanine, and C for cytosine. These four letters are collapsed by their position in the genome into a *quaternary* word (or a *quinary* word if U is included for uracil). For example, the "word" AAACCATTCACAATCAGACA expresses a sequence of 20 nucleic acid bases without loss of information about the bases (the "word" is *sufficient*). The "word" AAACCATTCACAATCAGACA is also much easier to understand by a human than the condensed structural formula or a skeletal-structural formula of these 20 compounds (the "word" is *human readable*). Furthermore, such a quaternary word can be parsed by a computer with ease (the "word" is *machine readable*).
## Quinary UDS Pattern Legend
@@ -89,7 +89,7 @@ We now build from the examples above. In order to compactly summarize a particip
- *: inconclusive results or mixed results (e.g. subject provided more than one urine sample in the time interval and they did not agree)
- _: no specimens required (weekends, holidays, pre-randomization period, alternating visit days/weeks)
-Obviously this legend can be modified and extended to better reflect the complexities of individual clinical trials, but we believe that the structure (representing visits / short time intervals as single symbols) will still be beneficial as a quick "snapshot" of the patient's treatment results and adherence to trial protocol, and also as a computable summary of the subject's behavior during the follow-up period and post-trial analysis phase.
+We note that the "+" and "-" symbols have been used in clinical shorthand for decades. Obviously this legend can be modified and extended to better reflect the complexities of individual clinical trials, but we believe that the structure (representing visits / short time intervals as single symbols) will still be beneficial as a quick "snapshot" of the patient's treatment results and adherence to trial protocol, and also as a computable summary of the subject's behavior during the follow-up period and post-trial analysis phase.
## A Use Pattern Example
@@ -97,31 +97,17 @@ Obviously this legend can be modified and extended to better reflect the complex
egWho_int <- 2089L
egUsePattern_char <-
- ctn0094DataExtra::derived_weeklyOpioidPattern %>%
+ CTNote::outcomesCTN0094 %>%
filter(who == egWho_int) %>%
- # select(who, `Weeks of Treatment` = endWeek, `Opioid Use Pattern` = Phase_1)
- pull(Phase_1)
+ pull(usePatternUDS)
```
We will first observe the recorded opioid use data via urine screen for a subject after randomization.
```{r results='asis'}
egTab1_df <-
- ctn0094data::all_drugs %>%
- filter(who == egWho_int) %>%
- filter(source == "UDSAB") %>%
- filter(when >= 0) %>%
- filter(what == "Opioid") %>%
- arrange(when)
+ CTNote::egOpioidsCTN0094 %>%
+ filter(who == egWho_int)
kable(egTab1_df)
-
-# if(!knitr::is_html_output()) {
-# "TABLE OF SUBSTANCE USE FOR EXAMPLE PARTICIPANT HERE; SEE .html VERSION "
-# } else {
-# egTab_df %>%
-# kable() %>%
-# scroll_box(width = "500px", height = "500px")
-# }
-
```
While we observe the study days that this participant used opioids throughout the clinical trial, it's difficult visualize the patient's treatment trajectory using the data in this form. On the one hand, after some preliminary data cleaning, this data can be parsed by a computer (*machine readability*). Additionally, this table represents all of the opioid use for this subject (*sufficiency*). However, data in this form is **not** easy to directly interpret by a clinician or researcher. Can we immediately derive clinical intuition about this patient? Does the treatment appear to be effective?
@@ -130,6 +116,19 @@ While we observe the study days that this participant used opioids throughout th
In contrast, observe the opioid use pattern summary for this same participant:
`++++---+--------------o-`. Using the basic pattern definition legend above, we can clearly see that this subject had a challenging first month (`++++`), improved in the second month (`---+`), and then remained abstinent from opioids for the remainder of the clinical trial (`--------------o-`). Participant substance use pattern data in this form *is* easy to interpret by a clinician. Moreover, this symbolic representation of substance use can be parsed and summarized by a computer. Finally, notice that this pattern represents the entire course of treatment for this subject without loss of any clinically relevant information about their opioid misuse.
+One last major strength is that using these opioid summary words allows for quick inspection of multiple patients' treatment trajectories all at once. Instead of checking multiple charts one-by-one to get a rough assessment of of the treatment efficacy, we can quickly see overall study arm patterns at a glance. The table below is an example of what these urine screen "words" would look like after 6 weeks of the study. This gives us a quick visual summary of which participants have responded well to the treatment so far (4, 13, 33, 163, and 242) and which participants have not (17, 210, 233, 1103, and 2089). Such quick clinical insights would be challenging to achieve from the raw data alone.
+```{r, message=FALSE}
+CTNote::egOpioidsCTN0094 %>%
+ select(who) %>%
+ distinct() %>%
+ left_join(CTNote::outcomesCTN0094, by = "who") %>%
+ select(who, usePatternUDS) %>%
+ mutate(usePatternUDS = str_trunc(usePatternUDS, width = 6, ellipsis = "")) %>%
+ as.data.frame() %>%
+ print(row.names = FALSE)
+```
+
+
*******************************************************************************
@@ -138,7 +137,7 @@ In contrast, observe the opioid use pattern summary for this same participant:
# Results
-We mentioned above that these use pattern summaries are "machine readable" (that a computer can parse them). The software package `CTNote::` is one such tool to enable computers to parse these use patterns [@odom_ctnote_2022]. This software package contains the following groups of routines (also known as functions):
+We mentioned above that these use pattern summaries are "machine readable" (that a computer can parse them). The software package `CTNote::` is one such tool to enable computers to parse these use patterns.[@odom_ctnote_2022] This software package contains the following groups of routines (also known as functions):
- Handle missing UDS data with `recode_missing_visits()` and `impute_missing_visits()`
- Account for the study observation design or trial visit protocol with `collapse_lattice()` and `view_by_lattice()`
@@ -151,7 +150,7 @@ By executing various combinations of these routines, we were able write algorith
## Example: Abstinence Outcome
-One of the [Schottenfeld et al. (2008)](https://doi.org/10.1016/s0140-6736(08)60954-x) treatment endpoint definitions is "the maximum consecutive days abstinent" from opioids, where missing clinic visits are imputed to represent a urine screen positive for the substance of interest [@schottenfeld_maintenance_2008]. Therefore, the algorithm to calculate this definition for our example subject would be:
+One of the [Schottenfeld et al. (2008)](https://doi.org/10.1016/s0140-6736(08)60954-x) treatment endpoint definitions is "the maximum consecutive days abstinent" from opioids, where missing clinic visits are imputed to represent a urine screen positive for the substance of interest.[@schottenfeld_maintenance_2008] Therefore, the algorithm to calculate this definition for our example subject would be:
```{r, echo=TRUE, eval=FALSE}
# The participant's use pattern summary; the %>% symbol is read "and then"
"++++---+--------------o-" %>%
@@ -164,7 +163,7 @@ One of the [Schottenfeld et al. (2008)](https://doi.org/10.1016/s0140-6736(08)60
## Example: Use Reduction Outcome
-The [Haight et al. (2019)](https://doi.org/10.1016/S0140-6736(18)32259-1) treatment endpoint definition is the "percentage of negative UOS [urine opioid screen] from week 5 to week 24" [@haight_efficacy_2019]. Therefore, the code to calculate this definition for our example subject would be:
+The [Haight et al. (2019)](https://doi.org/10.1016/S0140-6736(18)32259-1) treatment endpoint definition is the "percentage of negative UOS [urine opioid screen] from week 5 to week 24".[@haight_efficacy_2019] Therefore, the code to calculate this definition for our example subject would be:
```{r, echo=TRUE, eval=FALSE}
count_matches(
use_pattern = "++++---+--------------o-",
@@ -180,7 +179,7 @@ count_matches(
## Example: Relapse Outcome
-The [Krupitsky et al. (2006)](https://doi.org/10.1016/j.jsat.2006.05.005) treatment endpoint has relapse defined as "3 consecutive positive UOS", where missing clinic visits are imputed to represent a urine screen positive for the substance of interest [@krupitsky_naltrexone_2006]. Therefore, the algorithm to calculate this definition for our example subject would be:
+The [Krupitsky et al. (2006)](https://doi.org/10.1016/j.jsat.2006.05.005) treatment endpoint has relapse defined as "3 consecutive positive UOS", where missing clinic visits are imputed to represent a urine screen positive for the substance of interest.[@krupitsky_naltrexone_2006] Therefore, the algorithm to calculate this definition for our example subject would be:
```{r, echo=TRUE, eval=FALSE}
"++++---+--------------o-" %>%
recode_missing_visits(missing_becomes = "+") %>%
@@ -190,7 +189,7 @@ The [Krupitsky et al. (2006)](https://doi.org/10.1016/j.jsat.2006.05.005) treatm
detect_subpattern(subpattern = "+++")
```
-If we were interested in relapse from a "time-to-event" or reliability perspective [@kleinbaum_introduction_2012], the algorithm above changes only in the last line:
+If we were interested in relapse from a "time-to-event" or reliability perspective,[@kleinbaum_introduction_2012] the algorithm above changes only in the last line:
```{r, echo=TRUE, eval=FALSE}
"++++---+--------------o-" %>%
recode_missing_visits(missing_becomes = "+") %>%
@@ -214,41 +213,40 @@ If we were interested in relapse from a "time-to-event" or reliability perspecti
# Discussion
+## Use Pattern "Word" Strengths
-## Use Pattern "Word" Limitations
+We believe this `CTNote` package and outcomes library will be useful to all future substance use disorder clinical trials in the National Institute of Drug Abuse (NIDA)'s [Clinical Trials Network](https://doi.org/10.1186/s13722-021-00238-6) (CTN) and in other substance use disorder research.[@tai_nida_2021] While developing this construct, our work adheres to the [Open Knowledge](https://doi.org/10.1371/journal.pbio.1001195) philosophy of science,[@molloy_open_2011] so the code for the CTNote package and all related algorithms are open-source.
-While this substance use pattern summary satisfies the three conditions mentioned previously (machine readability, parsimony, and sufficiency), there are three main limitations:
+**HELP!!** I need some help writing this part of the discussion. I think this quinary word structure is an awesome way to represent urine test results over time, but I don't know how to sell a clinician or their statistician on this.
-### Accounting for Multivariate Categorical Variables
-The first limitation surrounds the idea of poly-drug use. These use pattern summaries are substance or substance group specific (*univariate*, statistically speaking). Consider now the poly-drug use for our example trial participant:
-```{r results='asis'}
-egTab2_df <-
- ctn0094data::all_drugs %>%
- filter(who == egWho_int) %>%
- filter(source == "UDSAB") %>%
- filter(when >= 0) %>%
- filter(what %in% c("Opioid", "Cocaine")) %>%
- arrange(when)
-kable(egTab2_df)
-```
-Recall from our examples above that this participant does appear to curb their weekly opioid and heroin use during the course of treatment. However, their cocaine use does *not* decrease over the same time interval. The **opioid** use pattern summary can not display information about concurrent **cocaine** use. While this is a limitation, one of our current areas of research involves expanding the support of substance use pattern "words" to include poly-substance use in a meaningful way (including the ability to preserve cross-substance correlations).
+## Accounting for Multivariate Categorical Variables
-### The Complexity-Parsimony Tradeoff
-The second limitation is one of parsimony. Technically speaking, in order to ensure that use pattern summaries work regardless of a computer's locale or operating system, we limit the symbols in the "word" to proper symbols from the [American Standard Code for Information Interchange](https://www.ascii-code.com/) (ASCII) list [@injosoft_ab_extended_2022], which has been a computing standard [for decades](http://edition.cnn.com/TECH/computing/9907/06/1963.idg/) [@brandel_1963_1999]. Technically speaking, there are 128 [7-bit ASCII](https://www.ibm.com/docs/en/aix/7.1?topic=support-ascii-characters) symbols, but only 96 are visible (printable) characters [@ibm_ascii_2022]. Put simply, these 96 characters are all the symbols that a traditional North American computer keyboard can make. Therefore, we could define a substance use pattern "word" with any combination of these 96 printable symbols.
+These use pattern summaries are substance or substance group specific (*univariate*, statistically speaking). From this opioid use pattern summary word, we can see that our example trial participant quickly reduced their opioid use. However, if we inspect their complete urine screening record, we would see an increase in use of some other substances. For example, this participant does *not* decrease their cocaine use over the same time interval. The **opioid** use pattern summary can not display information about concurrent **cocaine** use. This is one of our current areas of research: expanding the support of substance use pattern "words" to include poly-substance use in a meaningful way (including the ability to preserve cross-substance correlations).
-However, such a visual summary with 96 possible values would be incredibly challenging to interpret, consequently failing to meet the parsimony requirement (that our summary be easily read by a clinician). In the interest of simplicity, we chose the five symbols mentioned above (`+`, `-`, `o`, `*`, and `_`). We recognize that different clinical trial designs may necessitate the introduction of additional symbols to this use pattern summary, but we strongly recommend (due to human [memory constraints](https://doi.org/10.1037/h0043158)) that such lists be kept to seven symbols or fewer [@miller_magical_1956]. As an example, we discussed the benefit of adding special symbols for a "missing but excused" clinic visit (or for "urine sample of improper temperature"), but we ultimately did not because such instances were rare in our data.
+
+
+
+
+
+
+
+
+
+
-### The Inflexibility of Computers
-The third limitation is due to the cold and unfeeling logic of a computer. The computer cannot understand the many extenuating circumstances that pervade our human experience. Worth repeating is the old maxim of computing: "computers give you what you ask for, not what you want." Consider a concrete example: assume a participant was supposed to visit the clinic for a weekly urine test on Thursday, but they came on Wednesday instead due to a conflict at their work; their urine sample from Wednesday came back "clean". What would a nurse at the substance use clinic do in this case? Most people are reasonable; they would count the urine screen for that week as negative for illicit substances. What would a computer do? Without the direct input of a human to override the clinical trial protocol, a computer would count the negative urine on Wednesday but still mark the participant as a "failure to appear" / "missing urine screen" on Thursday. In many clinical trial protocols, missing urine screens are imputed as "positive" for the substance of interest, so now this participant's use pattern summary is marked `*` (mixed results) instead of `-` (negative results) for that week.
+## The Complexity-Parsimony Tradeoff
+We now discuss some of the technical components surrounding parsimony. In order to ensure that use pattern summaries work regardless of a computer's locale or operating system, we limit the symbols in the "word" to proper symbols from the [American Standard Code for Information Interchange](https://www.ascii-code.com/) (ASCII) list,[@injosoft_ab_extended_2022] which has been a computing standard [for decades](http://edition.cnn.com/TECH/computing/9907/06/1963.idg/).[@brandel_1963_1999] Technically speaking, there are 128 [7-bit ASCII](https://www.ibm.com/docs/en/aix/7.1?topic=support-ascii-characters) symbols, but only 96 are visible (printable) characters.[@ibm_ascii_2022] Put simply, these 96 characters are all the symbols that a traditional North American computer keyboard can make. Therefore, we could define a substance use pattern "word" with any combination of these 96 printable symbols.
-## Use Pattern "Word" Strengths
+However, such a visual summary with 96 possible values would be incredibly challenging to interpret, consequently failing to meet the parsimony requirement (that our summary be easily read by a clinician). In the interest of simplicity, we chose the five symbols mentioned above (`+`, `-`, `o`, `*`, and `_`). We recognize that different clinical trial designs may necessitate the introduction of additional symbols to this use pattern summary, but we strongly recommend (due to human [memory constraints](https://doi.org/10.1037/h0043158)) that such lists be kept to seven symbols or fewer.[@miller_magical_1956] As an example, we discussed the benefit of adding special symbols for a "missing but excused" clinic visit (or for "urine sample of improper temperature"), but we ultimately did not because such instances were rare in our data.
-We believe this `CTNote` package and outcomes library will be useful to all future substance use disorder clinical trials in the National Institute of Drug Abuse (NIDA)'s [Clinical Trials Network](https://doi.org/10.1186/s13722-021-00238-6) (CTN) and in other substance use disorder research [@tai_nida_2021]. While developing this construct, our work adheres to the [Open Knowledge](https://doi.org/10.1371/journal.pbio.1001195) philosophy of science [@molloy_open_2011], so the code for the CTNote package and all related algorithms are open-source.
-**HELP!!** I need some help writing this part of the discussion. I think this quinary word structure is an awesome way to represent urine test results over time, but I don't know how to sell a clinician or their statistician on this.
+## The Inflexibility of Computers
+
+When discussing how algorithms make decisions about human behavior, we should never fail to account for the cold and unfeeling logic of a computer. Worth repeating is the old maxim of computing: "computers give you what you ask for, not what you want." The computer cannot understand the many extenuating circumstances that pervade our human experience. Consider a concrete example: assume a participant was supposed to visit the clinic for a weekly urine test on Thursday, but they came on Wednesday instead due to a conflict at their work; their urine sample from Wednesday came back "clean". What would a nurse at the substance use clinic do in this case? Most people are reasonable; they would count the urine screen for that week as negative for illicit substances. What would a computer do? Without the direct input of a human to override the clinical trial protocol, a computer would 1) count the negative urine on Wednesday, and 2) mark the participant as a "failure to appear" / "missing urine screen" on Thursday. Then, because, missing urine screens are imputed as "positive" for the substance of interest in many clinical trial protocols, the computer would 3) write this participant's urine screen as positive for the Thursday, and then 4) mark the use pattern summary as `*` (mixed results) instead of `-` (negative results) for that week. While automation is incredibly powerful when done correctly, we must make every effort to ensure that our code "first does no harm".
+
*******************************************************************************
@@ -256,4 +254,11 @@ We believe this `CTNote` package and outcomes library will be useful to all futu
+# Conclusion
+We have presented a use case of the quinary word structure to summarize clinical trial participants' opioid screening information. Additionally, we gave some code examples of how to use these quinary word summaries in analysis practice, all using the open-source R package `CTNote::`. Finally, we gave some comments on the strengths and limitations of this approach, and some discussion of applying this data summarization technique in practice.
+
+**NEEDS MORE LOVE HERE**
+
+
+
# References