# 8 Analysis dataset preparation

IQR Tools implements functionality to facilitate the creation of analysis or modeling datasets based on a general dataset format. The general dataset format aims to at least contain a minimum if inforamtion required and - at the same time - to be flexible to store all information that is relevant for the analysis.

The functions provided by IQRtools enable a traceable programming of the analysis data set. Metadata (e.g., covariate value units, or covariate category labels) is retained and available for documentation and producing meaningful graphs and tables. When manipulating the data for a particular task, e.g., removal of outliers, imputing covariates, log files are written.

In a first section (Example workflow) of this chapter a basic example for establishing a analysis dataset in the following steps is given:

• Dataset specification of origin
• Import to the IQRdataGENERAL format
• Graphical data exploration
• Cleaning the data to establish analysis dataset
• Export to .csv and .xpt

The second section, (Further options) discusses more features for the following steps with which the basic workflow can be customized.

• Settings for dataset import
• Statistical exploration
• Graphical exploration

## 8.1 Example workflow

### 8.1.1 Original dataset in general row-based format

In our example, the source data contains data from a single ascending dose study in healthy male subjects that should be prepared for a population PK analysis. The dataset is provided as a comma separated file:

# Define dataset location
dataFile <- "material/01-01-DataProgAnal/dataSource01.csv"
sourceData <- read.csv(dataFile)
Table 8.1: Source example dataset 1
USUBJID TRTNAME TIME NT TIMEUNIT NAME VALUE VALUETXT UNIT LLOQ ROUTE CENTER VISIT STUDY
IQ00701-0100-0001 Placebo -167.88 -168.00 Hours Age 34 . Years . . 100 1 FIH
IQ00701-0100-0001 Placebo -167.88 -168.00 Hours Gender 1 male . . . 100 1 FIH
IQ00701-0100-0001 Placebo -1.00 -1.00 Hours Bodyweight 89 . kg . . 100 2 FIH
IQ00701-0100-0001 Placebo -1.00 -1.00 Hours Height 185 . cm . . 100 2 FIH
IQ00701-0100-0001 Placebo 23.00 23.00 Hours Height 200 . cm . . 100 2 FIH
IQ00701-0100-0001 Placebo 39.00 39.00 Hours Height 250 . cm . . 100 2 FIH
IQ00701-0100-0001 Placebo -0.08 -0.08 Hours Plasma concentration IQ0815 0 . ug/mL 0.001 . 100 2 FIH
IQ00701-0100-0001 Placebo 0.00 0.00 Hours Dose IQ0815 0 . mg . oral 100 2 FIH
IQ00701-0100-0001 Placebo 0.08 0.08 Hours Plasma concentration IQ0815 0 . ug/mL 0.001 . 100 2 FIH
IQ00701-0100-0001 Placebo 0.25 0.25 Hours Plasma concentration IQ0815 0 . ug/mL 0.001 . 100 2 FIH
IQ00701-0100-0002 1mg oral single dose -167.35 -168.00 Hours Age 29 . Years . . 100 1 FIH
IQ00701-0100-0002 1mg oral single dose -167.35 -168.00 Hours Gender 1 male . . . 100 1 FIH

The source dataset contains all information (doses, observations, covariates) in a row-based format and has preserved column names used for the data import in IQR Tools (see table below for a description of the columns of the example dataset). Of the columns in the data only seven (written in bold) are required, but there exist more preserved column names such as "STUDY", "PART", or "COMPOUND" to cover the typical information used in pharmacometric or systems pharamacology analyses. The general dataset format is described in detail in the chapter General Dataset Format

An important column is the NAME column identifying which type of event is recorded, e.g., whether the row reports an dosing event, measured plasma concentration value or the age of the subject. Typical entries for these example could be "Aspririn dose", "Aspirin concentration", or "Age at screening visit". The VALUE column exclusively contains numerical values. For categorical covariates the VALUETXT column is used instead or in addition to note the category value, e.g., "yes"/"no" or "red"/"blue"/"green".

Column Description
USUBJID Unique subject identifier
TRTNAME Name of actual treatment given to subject
TIME (NUMERIC) Actual time of event relative to first dose administration
NT (NUMERIC) Nominal event time
TIMEUNIT Unit of all numeric time definitions in the dataset (HOURS, MINUTES, DAYS, SECONDS, WEEKS, MONTHS, YEARS)
NAME Unique short name of event
VALUE (NUMERIC) Value of event defined by NAME
VALUETXT Text version of value (If VALUETXT is defined, VALUE can be undefined and is used to code categorical information
UNIT Unit of the value reported in the VALUE column
LLOQ (NUMERIC) Lower limit of quantification for event defined by NAME (value only interpreted for observation events)
ROUTE Route of administration (e.g., IV,SUBCUT,ORAL,TOPICAL) (value only interpreted for dosing events)
CENTER Center number
VISIT Visit number

### 8.1.2 Import as IQRdataGENERAL format

To create an analysis dataset, we need to reformat the dataset and add columns needed for the model-based analysis:

• Numerical columns need to be created
• to discriminate dosing from observation records,
• to identify different observation and dosing types, and
• to flag whether records are valid, missing or out of the measureable range
• Covariates need to be provided in (numerical) columns

Based on the NAME column, we can specify which records are dosing information, observations, and covariate information. For the dosing and observation records, they only need to be listed in a vector. For the covariate records, we need to have a named vector. The names are used as the column names of the covariate columns that will be created.

# Define the names (NAME column) of the records you want to consider as dose records
doseNAMES <- "Dose IQ0815"
# Define the names (NAME column) of the records you want to consider as observation records
obsNAMES <- "Plasma concentration IQ0815"
# Define the CONTINUOUS covariate columns (time INDEPENDENT)
cov0 <- c(
WT0  = "Bodyweight",          # COLNAME = "NAME of the event"
AGE0 = "Age",
HT0  = "Height"
)
# Define the CATEGORICAL covariate columns (time INDEPENDENT) that you want to generate
# from event records in the general dataset
cat0 <- c(
SEX  = "Gender"
)
# Convert general dataset as an IQRdataGENERAL object
data1 <- IQRdataGENERAL(input=dataFile,
doseNAMES=doseNAMES, obsNAMES=obsNAMES,
cov0=cov0,cat0=cat0)

The resulting data object data1 contains additional columns that are needed for model-based anylsis. Some important columns that have been created are

• the column EVID indicating dosing events (EVID = 1) and observarion records (EVID = 0),
• the column YTYPE discriminating different observation types,
• the column ADM discriminating different dosing types,
• the column AMT to annotate dosing amount,
• the column MDV flagging missing data values (MDV = 1),
• the column CENS flagging data values outside the measurable range (e.g., CENS = 1 for BLQ values)
• the column IXGDF with a unique record number for each row of the dataset.

The covariate columns WT0, HT0, AGE0, and SEX have been added as well containing numerical values.

Table 8.2: Example dataset 1 imported as IQRdataGENERAL (selected columns and first 12 rows)
IXGDF USUBJID TRTNAME TIME TIMEUNIT VALUE AMT EVID YTYPE ADM MDV CENS WT0 HT0 AGE0 SEX
5 7 IQ00701-0100-0001 Placebo -0.08 HOURS 0 0 0 1 0 1 0 89 185 34 1
6 8 IQ00701-0100-0001 Placebo 0.00 HOURS 0 0 1 0 1 1 0 89 185 34 1
7 9 IQ00701-0100-0001 Placebo 0.08 HOURS 0 0 0 1 0 1 0 89 185 34 1
8 10 IQ00701-0100-0001 Placebo 0.25 HOURS 0 0 0 1 0 1 0 89 185 34 1
15 15 IQ00701-0100-0002 1mg oral single dose 0.00 HOURS 1 1 1 0 1 1 0 65 186 29 1
20 20 IQ00701-0100-0003 1mg oral single dose -0.08 HOURS 0 0 0 1 0 1 0 80 180 28 1
21 21 IQ00701-0100-0003 1mg oral single dose 0.00 HOURS 1 1 1 0 1 1 0 80 180 28 1
22 22 IQ00701-0100-0003 1mg oral single dose 0.08 HOURS 0 0 0 1 0 1 0 80 180 28 1
23 23 IQ00701-0100-0003 1mg oral single dose 0.25 HOURS 0 0 0 1 0 0 0 80 180 28 1
24 24 IQ00701-0100-0003 1mg oral single dose 0.50 HOURS 0 0 0 1 0 0 0 80 180 28 1
25 25 IQ00701-0100-0003 1mg oral single dose 1.00 HOURS 0 0 0 1 0 0 0 80 180 28 1
26 26 IQ00701-0100-0003 1mg oral single dose 2.00 HOURS 0 0 0 1 0 0 0 80 180 28 1

The variable data1 adopted the class IQRdataGENERAL that stores metadata in its attributes. Amongst others, the attributes store the information on units, category and covariate names and can be called using the convenience functions covInfo() and catInfo(). Importantly, if present in the source data, the columns STUDY and TRTNAME containing character values covariate columns were used to create covariate columns STUDYN and TRT containing numerical values.

# Display information on continuous covariates
covInfo(data1)
COLNAME NAME UNIT TIME.VARYING
WT0 Bodyweight kg FALSE
AGE0 Age Years FALSE
HT0 Height cm FALSE
# Display information on categorical covariates
catInfo(data1)
COLNAME NAME UNIT VALUETXT VALUES TIME.VARYING
SEX Gender NA male 1 FALSE
STUDYN Study
FIH,FIH extension 1,2 FALSE
TRT TRTNAME
Placebo,1mg oral single dose,2mg oral single dose,5mg oral single dose,10mg oral single dose,20mg oral single dose,50mg oral single dose,100mg oral single dose,200mg oral single dose 9,3,6,8,2,5,7,1,4 FALSE

Using the summary function useful information on the data is displayed. In addition, data integrity checks are performed.

summary(data1)
##    INFO                                          | NAME                        | VALUE
##    ---------------------------------------------------------------------------------------------------------------------------------
##    Dose events                                   | Dose IQ0815                 | Ntotal: 48,  Nindiv (min/median/max): 1/1/1)
##    Observation events (all)                      | Plasma concentration IQ0815 | Ntotal: 543,  Nindiv (min/median/max): 3/12/12)
##    Observation events (MDV=0)                    | Plasma concentration IQ0815 | Ntotal: 410,  Nindiv (min/median/max): 5/10/11)
##    Doses AMT=0 present                           | ALL dose events             | TRUE (N=1)
##    Placebo subjects present (AMT=0 or no doses)  | ALL dose events             | TRUE (N=1)
##    IGNORED (MDV=1) observation records present   | ALL observation events      | TRUE (N=133)
##    Subjects without observations (MDV=0) present | ALL observation events      | TRUE (N=3)
##    Total BLLOQ information                       | Plasma concentration IQ0815 | N=133 / 24.5%
##    Max % BLLOQ values in a subject               | Plasma concentration IQ0815 | 8.33%
##    BLLOQ handling method                         | All observation events      | M1
##    NLME columns containing NA                    | All events                  | WT0, AGE0, HT0, SEX
##    Issues present in the data                    | Minor                       | YES (see text below the table for more information)
##    Issues present in the data                    | Warnings                    | NONE
##    Issues present in the data                    | Errors                      | NONE
##
##
## IQRoutputTable object
##
## MINOR issues in the dataset that might be addressed
## ===================================================
## SUBJECT LEVEL (IQ00701-0100-0005): Subject has records of NAME "Plasma concentration IQ0815" at same TIME points

The summary and checks help to detect issues in the dataset that should be fixed before an analysis starts. In this example, there is only one placebo subject, but there are 3 individuals without any observation records. Thus, we may want to remove these individuals as well as a duplicated plasma record for subject "IQ00701_0100_0005". Also, the covariate columns contain NA values and we probably need to impute values for some individuals.

### 8.1.3 Source data exploration

An recommended step before any data analysis or modeling activity is to explore the data visually. IQR Tools provides a range of standard plotting function. Here, we make use of two of them, the function plotSpaghetti_IQRdataGENERAL() producing line plots per treatment group that is very useful to get an overview on the observation time courses and the function plotIndiv_IQRdataGENERAL() producing much more detailed plots for evry individual in a pdf file.

# Line plots stratified by treatment group
out <- plotSpaghetti_IQRdataGENERAL(data1)
out$unstratified$Plasma concentration IQ0815

# Generation of a pdf file containing detailed plots for each individual
plotIndiv_IQRdataGENERAL(data1, filename = "material/01-01-DataProgAnal/IndivPlots01.pdf")

From the overview plot we can see that one subject who received 5mg of IQ0815 has a implausible PK profile. Also, we can spot some outlying records. The detailed individual plots in the pdf file help to exactly identify the subject with implausible observations and the outlying records: The plots display the unique subject identifiers, all observations and dosing records as well as the treatment group name. The observation records are labelled with the unique record identifier IXGDF.

### 8.1.4 Cleaning to create an analysis dataset

Based on the data exploration, we want to clean the dataset before finally producing the modeling dataset.

#### Specification of records to be removed

Named lists are created containing either unique subject identifiers (USUBJID) for removing all data from specific subjects or unique record identifiers (IXGDF) for removing particular records. The list names can is to annotate the reason for removal.

removeSubjects <- list("No PK data" = c("IQ00701-0100-0002","IQ00701-0100-0008"),
"Nonsense profile" = c("IQ00701-0100-0018"))
removeRecords  <- list("Implausible value" = c(59, 98, 301))

#### Specification of covariate imputation

To impute covariates, we use names vectors for continuous and categorical covaraites respectively. The values given will be imputed for all missing covariates. For the continuous covariates, it is also possible to provide suitable function names like "mean" or "median" to calculat the imputation value based on the available values.

imputeContinuous = c("HT0" = "median", "WT0" = 70)
imputeCategory   = c("SEX"=1)

#### Perform documented dataset cleaning

For generating a cleaned analysis dataset, the function clean_IQRdataGENERAL() is used. It not only accepts the information on data removal or covariate imputation, but has more functionality and options. In this example, the method to handle BLLOQ data is chosen (methodBLLOQ = "M3"), ignored records are decided to be kept in the dataset (FLAGrmIGNOREDrecords = FALSE), but the placebo subjects are removed (FLAGrmPlacebo = TRUE). Very importantly, information on the cleaning process is written to the folder specified as pathname input argument.

data1CleanM3 <- clean_IQRdataGENERAL(data1,
pathname = "material/01-01-DataProgAnal/DataCleaning01",
methodBLLOQ = "M3",
subjects = removeSubjects,
records = removeRecords,
FLAGrmIGNOREDrecords = FALSE,
FLAGrmPlacebo = TRUE,
continuousCovs = imputeContinuous, categoricalCovs = imputeCategory)

### 8.1.5 Export

To make the dataset available for parameter estimation with NONMEM or MONOLIX, it is exported with the function exportNLME_IQRdataGENERAL(). In this step, some adjustments to the data are done, e.g., removing spaces in character strings, such that the data set is accepted by the softwares.

# Export the NLME data set with BLQ method M3
exportNLME_IQRdataGENERAL(data1CleanM3,
filename = "material/01-01-DataProgAnal/dataNLME01/data.csv",
FLAGxpt = TRUE,
FLAGdefine = TRUE)

This function also has some options what export is performed. In any case, there will be a csv-file generated. However, we can use the option FLAGxpt = TRUE to additionally write an xpt-file and the option FLAGdefine = TRUE to produce a define file with data set specifications for the analysis dataset. Click here to download the example define file.

## 8.2 Workflow customization

### 8.2.1 Dataset handling

For showing more examples to import a dataset as IQRdataGENERAL object, we load another dataset from file.

dataSource2 <- IQRloadCSVdata("material/01-01-DataProgAnal/dataSource02.csv")

#### Multiple dose or observation types

This data contains dosings of two different drugs and two observations types. Note that the IQRdataGENERAL() function accepts the path to the source data (see example workflow above) but also the loaded data frame as input.

# Dose records
doseNAMES <- c("Dose Z","Dose X")
# Observation records
obsNAMES <- c("Plasma concentration Z","Efficacy marker")
# Import as IQRdataGENERAL
data2 <- IQRdataGENERAL(dataSource2, doseNAMES = doseNAMES, obsNAMES = obsNAMES)
Table 8.3: Example dataset 2 (selected columns and first 12 rows)
USUBJID TRTNAME NAME TIME VALUE AMT EVID YTYPE ADM
1 ZY1000101066 SD IV 15 mg/kg Efficacy marker -7.1118056 21.152 0 0 2 0
3 ZY1000101066 SD IV 15 mg/kg Dose X 0.0000000 918.000 918 1 0 2
4 ZY1000101066 SD IV 15 mg/kg Plasma concentration Z 0.0826389 998.001 0 0 1 0
6 ZY1000101066 SD IV 15 mg/kg Plasma concentration Z 16.2076389 159.399 0 0 1 0
7 ZY1000101066 SD IV 15 mg/kg Plasma concentration Z 29.2006944 119.400 0 0 1 0
8 ZY1000101066 SD IV 15 mg/kg Plasma concentration Z 42.1701389 42.201 0 0 1 0
9 ZY1000101066 SD IV 15 mg/kg Efficacy marker 112.2256944 22.223 0 0 2 0
10 ZY1000101067 SD IV 15 mg/kg Efficacy marker -13.8138889 23.043 0 0 2 0
12 ZY1000101067 SD IV 15 mg/kg Dose X 0.0000000 1311.000 1311 1 0 2
13 ZY1000101067 SD IV 15 mg/kg Plasma concentration Z 0.0972222 1362.000 0 0 1 0
15 ZY1000101067 SD IV 15 mg/kg Plasma concentration Z 15.0736111 176.199 0 0 1 0
16 ZY1000101067 SD IV 15 mg/kg Plasma concentration Z 29.2506944 100.599 0 0 1 0

#### Time-varying covariates

Time-varying covariates are defined analogue to time-independent covariates. The PD observations are here used once as baseline covariate as well as time-dependent covariate. In the first case, all records are used in the covariate column according to the time of the observation records they are mapped onto. In case it is used as time independent covariate, the baseline value per individual (defined by columns BASE, SCREEN, or pre-first dose records, for details see ?IQRdataGENERAL) will be mapped to the observations.

# Define the CONTINUOUS covariate columns (time INDEPENDENT)
cov0 <- list(
PDbase  = "Efficacy marker"
)
# Define the CONTINUOUS covariate columns (time DEPENDENT)
covT <- list(
PDcont  = "Efficacy marker"
)
# Define the CATEGORICAL covariate columns (time INDEPENDENT)
cat0 <- list(
SEX  = "Gender"
)
# Define the CATEGORICAL covariate columns (time DEPENDENT)
catT <- list(
HSTAT   = "Health status"
)
# Import to IQRdataGENERAL object
data2 <- IQRdataGENERAL(dataSource2, doseNAMES = doseNAMES, obsNAMES = "Plasma concentration Z",
cov0 = cov0, cat0 = cat0,
covT = covT, catT = catT)
Table 8.4: Example dataset 2 with various covariate columns (selected columns and first 12 rows)
USUBJID TRTNAME NAME TIME VALUE AMT EVID YTYPE ADM PDbase PDcont SEX HSTAT
3 ZY1000101066 SD IV 15 mg/kg Dose X 0.0000000 918.000 918 1 0 2 21.152 21.152 NA 3
4 ZY1000101066 SD IV 15 mg/kg Plasma concentration Z 0.0826389 998.001 0 0 1 0 21.152 21.152 NA 3
6 ZY1000101066 SD IV 15 mg/kg Plasma concentration Z 16.2076389 159.399 0 0 1 0 21.152 21.152 NA 3
7 ZY1000101066 SD IV 15 mg/kg Plasma concentration Z 29.2006944 119.400 0 0 1 0 21.152 21.152 NA 3
8 ZY1000101066 SD IV 15 mg/kg Plasma concentration Z 42.1701389 42.201 0 0 1 0 21.152 21.152 NA 3
12 ZY1000101067 SD IV 15 mg/kg Dose X 0.0000000 1311.000 1311 1 0 2 23.043 23.043 NA 2
13 ZY1000101067 SD IV 15 mg/kg Plasma concentration Z 0.0972222 1362.000 0 0 1 0 23.043 23.043 NA 2
15 ZY1000101067 SD IV 15 mg/kg Plasma concentration Z 15.0736111 176.199 0 0 1 0 23.043 23.043 NA 3
16 ZY1000101067 SD IV 15 mg/kg Plasma concentration Z 29.2506944 100.599 0 0 1 0 23.043 23.043 NA 3
17 ZY1000101067 SD IV 15 mg/kg Plasma concentration Z 42.1562500 57.201 0 0 1 0 23.043 23.043 NA 3
21 ZY1000101068 SD IV 15 mg/kg Dose X 0.0000000 0.000 0 1 0 2 23.364 23.364 NA 4
26 ZY1000101069 SD IV 15 mg/kg Plasma concentration Z -0.0013889 0.933 0 0 1 0 16.993 16.993 NA 1

#### Existing covariate columns

Datasets may also contain (numerical) covariate columns instead of row-records for covariates. In this case the user needs to provide the metadata (verbose name, units, mapping of category names and values, ...) such that it is added to the covariate information in the attributes.

# Additional CONTINUOUS covariates
COLNAME      = c("AST0",         "BMI0"),
NAME         = c("Aspartate transaminase","Body mass index"),
UNIT         = c("UI/mL",      "kg/m2"),
TIME.VARYING = c(FALSE,              FALSE)
)
COLNAME      = c("FOOD"),
NAME         = c("Food taken"),
UNIT         = c("y/n"),
VALUETXT     = c("No,Yes"),
VALUES       = c("0,1"),
TIME.VARYING = c(FALSE)
)
# Import to IQRdataGENERAL object
data2 <- IQRdataGENERAL(dataSource2, doseNAMES = doseNAMES, obsNAMES = obsNAMES,
covInfoAdd = covInfoAdd, catInfoAdd = catInfoAdd)
# Extended continuous covariate information
covInfo(data2)
COLNAME NAME UNIT TIME.VARYING
AST0 Aspartate transaminase UI/mL FALSE
BMI0 Body mass index kg/m2 FALSE
# Extended categorical covariate information
catInfo(data2)
COLNAME NAME UNIT VALUETXT VALUES TIME.VARYING
FOOD Food taken y/n No,Yes 0,1 FALSE
STUDYN Study
Y10,Y1,Y3,Y8 2,1,3,4 FALSE
TRT TRTNAME
SD IV 15 mg/kg,SD IV Placebo,SD IV 1.5 mg/kg,MD IV Placebo,MD IV 5mg/kg,SD or MD IV Placebo,MD IV 15mg/kg 5,6,4,3,2,7,1 FALSE

#### Existing NLME columns

The input argument FLAGforceOverwriteNLMEcols defines if "NLME" columns that already might be in the dataset are overwritten (TRUE=default) or not (FALSE). These NLME columns are the following numeric - NLME tool specific columns: ID, TIMEPOS, TAD, DV, MDV, EVID, CENS, AMT, ADM, TINF, RATE, YTYPE, and DOSE. Overwritting is good in a sense that these columns will be well-defined and aligned with the dataspec of IQRtools. Not over-writing them can be useful if the user manually wants to ensure certain things. But in this case the user should now what to do.

If this flag is set to TRUE than all already existing NLME columns will be overwritten. Non-present ones will be (in both cases) generated based on the default spec.

# Overwriting the NLME columns
data2 <- IQRdataGENERAL(dataSource2, doseNAMES, obsNAMES,FLAGforceOverwriteNLMEcols=TRUE)
Table 8.5: Example dataset 2 with overwritten existing NLME columns (selected columns and first 12 rows)
USUBJID TIME EVID YTYPE VALUE ADM AMT
1 ZY1000101066 -7.11 0 2 21.15 0 0
3 ZY1000101066 0.00 1 0 918.00 2 918
4 ZY1000101066 0.08 0 1 998.00 0 0
6 ZY1000101066 16.21 0 1 159.40 0 0
7 ZY1000101066 29.20 0 1 119.40 0 0
8 ZY1000101066 42.17 0 1 42.20 0 0
# Keeping existing NLME columns
data2 <- IQRdataGENERAL(dataSource2, doseNAMES, obsNAMES,FLAGforceOverwriteNLMEcols=FALSE)
Table 8.6: Example dataset 2 with kept existing NLME columns (selected columns and first 12 rows)
USUBJID TIME EVID YTYPE VALUE ADM AMT
1 ZY1000101066 -7.11 0 2 21.15 0 0
3 ZY1000101066 0.00 1 0 918.00 2 918
4 ZY1000101066 0.08 0 1 998.00 0 0
6 ZY1000101066 16.21 0 1 159.40 0 0
7 ZY1000101066 29.20 0 1 119.40 0 0
8 ZY1000101066 42.17 0 1 42.20 0 0

### 8.2.2 Import/export options

#### Export IQRdataGENERAL objects

An IQRdataGENERAL object can be exported with three different functions, export_IQRdataGENERAL(), exportNLME_IQRdataGENERAL(), or exportSYS_IQRdataGENERAL. The first will export the dataset without further modifications, the other apply estimation tool specific modifications (e.g., removal of whitespaces in strings) to be applicable as modeling dataset later on. All export function generate a .csv file and a .atr file storing the metadata and can additionally generate define files and .xpt files.

With all export functions a zipped file instead of the single files (.csv, .atr, ...) can be generated by setting FLAGzip = TRUE.

export_IQRdataGENERAL(data=data2, filename="material/01-01-DataProgAnal/dataGEN", FLAGzip = TRUE)

The exportNLME_IQRdataGENERAL() and exportSYS_IQRdataGENERAL() provide the possibility to subset the data to specific dosing records and observation records (inputs doseNAMES and obsNAMES) and define columns as regressor variables while exporting the data. Regressor columns are ordered in the exported dataset as given in the input regressorNames which is crucial for matching regressor variables between data and model for some estimation tools.

exportNLME_IQRdataGENERAL(data=data2, filename="material/01-01-DataProgAnal/dataNLME", regressorNames = c("AST0", "FOOD"))

The load_IQRdataGENERAL() function is used to reload a dataset that was generated by the export_IQRdataGENERAL(), exportNLME_IQRdataGENERAL(), or exportSYS_IQRdataGENERAL function. In order for the loading to work the dataex.atr file needs to be present in the same folder as the .csv file. xpt files are not loaded. Also zip files can be reloaded.

data1reload <- load_IQRdataGENERAL("material/01-01-DataProgAnal/dataNLME01/data.csv")
data2reload <- load_IQRdataGENERAL("material/01-01-DataProgAnal/dataGeneral02.dat.zip")

### 8.2.3 Cleaning options

The cleaning functions clean_IQRdataGENERAL() is actually a wrapper function for various functions performing different steps during cleaning. They could be called individually. Please refer to the help for each function for detailed information.

Function Description Control in clean_IQRdataGENERAL Logfile written
blloq _IQRdataGENERAL Set the BLLOQ handling method set by methodBLLOQ No
setIGNORErecords _IQRdataGENERAL Set user defined records to IGNORE optional by setting records Yes
rmMissingTIMEobsRecords _IQRdataGENERAL Remove missing observation records with missing TIME always applied Yes
setMissingDVobsRecordsIGNORE _IQRdataGENERAL Set missing observation records with missing DV to IGNORE always applied Yes
rmSubjects _IQRdataGENERAL Remove user defined subjects optional by setting subjects Yes
rmNonTask _IQRdataGENERAL Remove non-dose and non-observation records always applied Yes
rmPLACEBO _IQRdataGENERAL Remove placebo subjects optional by setting FLAGrmPlacebo Yes
rmNOobsSUB _IQRdataGENERAL Removal of subjects without observations always applied Yes
rmAMT0 _IQRdataGENERAL Removal of dose records with AMT=0 always applied Yes
rmIGNOREd _IQRdataGENERAL Removal of ignored record (MDV=1) optional by setting FLAGrmIGNOREDrecords Yes
covImpute _IQRdataGENERAL Imputation of missing covariates optional by setting continuousCovs and categoricalCovs Yes
rmDosePostLastObs _IQRdataGENERAL Removal of doses post last observation optional by setting FLAGrmDosePostLastObs Yes

### 8.2.4 Data exploration

In the Example workflow discussed above some of the IQRtools functions to explore a dataset have been used to visualize and detect issues in the data and get an overview on the contained PK observations. In the following all available data exploration functions are introduced using the the cleaned dataset from the workflow as an example.

#### Summary tables

Summary tables can be generated for observations (summaryObservations_IQRdataGENERAL()) and for categorical or continuous covariates (summaryCat_IQRdataGENERAL() and summaryCov_IQRdataGENERAL). The tables can be stratified by a suitable dataset column (stratifyColumn with "STUDY" as default). Beside the actual content a table title and table footer are defined which can be modified by the user (tableTitle and footerAddText). They can be further customized by setting the number of digits values are rounded to (SIGNIF) and whether individuals should be termed as "subjects" or "patients" (FLAGpatients). If a filename is provided, the table is written to that file, otherwise a IQRoutputTable object is returned.

With this flexibility the summary tables are intended to be readily suitable to use in modeling reports in the data exploration section. The exported tables are prepared to be automatically imported to a report using IQReport (see Reporting in Microsoft Word).

##### Observations

For the observation summary table, numbers of subjects and numbers of observations are listed and subsetted for different criteria (e.g., number of observations below the limit of quantification). The input obsNames is available if only a subset of the contained observations should be summarized. In this example, there is only one observation, i.e., the "Plasma concentration IQ0815". The first two rows give the numbers per study while the last row contains total counts in the entire dataset.

summaryObservations_IQRdataGENERAL(data1CleanM3)
##    Summary of available observations
##    ==================================================================================================================================================================================================================
##
##    Data          | N subjects* | N samples | N BLOQ samples** | N BLOQ samples post first dose* | N missing observations | N missing time information | N total ignored observations | N samples included in analysis
##    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##    FIH           | 32 / 32     | 384       | 117 (30.5%)      | 85 (22.1%)                      | 0 (0%)                 | 0 (0%)                     | 3 (0.781%)                   | 381 (99.2%)
##    FIH extension | 12 / 12     | 144       | 12 (8.33%)       | 0 (0%)                          | 0 (0%)                 | 0 (0%)                     | 0 (0%)                       | 144 (100%)
##    TOTAL         | 44 / 44     | 528       | 129 (24.4%)      | 85 (16.1%)                      | 0 (0%)                 | 0 (0%)                     | 3 (0.568%)                   | 525 (99.4%)
##    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## N: Number of
## * All subjects / subjects with at least one non missing (MDV==0) sample.
## ** These records are not excluded from the analysis but censored (M3 method).
##
##
## IQRoutputTable object

This summary table contains information about samples before and after first dosing regarding BLQ and zero/non-zero values which is mainly of interest for summarizing PK samples. These can be neglected setting FLAGpk = FALSE:

summaryObservations_IQRdataGENERAL(data1CleanM3, FLAGpk = FALSE)
##    Summary of available observations
##    ================================================================================================================================================================================
##
##    Data          | N subjects* | N samples | N BLOQ samples** | N missing observations | N missing time information | N total ignored observations | N samples included in analysis
##    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##    FIH           | 32 / 32     | 384       | 117 (30.5%)      | 0 (0%)                 | 0 (0%)                     | 3 (0.781%)                   | 381 (99.2%)
##    FIH extension | 12 / 12     | 144       | 12 (8.33%)       | 0 (0%)                 | 0 (0%)                     | 0 (0%)                       | 144 (100%)
##    TOTAL         | 44 / 44     | 528       | 129 (24.4%)      | 0 (0%)                 | 0 (0%)                     | 3 (0.568%)                   | 525 (99.4%)
##    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## N: Number of
## * All subjects / subjects with at least one non missing (MDV==0) sample.
## ** These records are not excluded from the analysis but censored (M3 method).
##
##
## IQRoutputTable object
##### Covariates

Continuous covariates are summarized by the mean, standard deviation and range from minimum to maximum value while the number of individuals in therespective and the percent of total is given for categorical covariates. Summaries are given stratified defined in the stratification (e.g., stratifyColumns = "TRTNAME"; default: "STUDY") and a total column can be required (FLAGtotal = TRUE).

summaryCov_IQRdataGENERAL(data1CleanM3, stratifyColumns = "TRTNAME")
##    Summary of demographic and baseline characteristics for continuous information
##    ==============================================================================================================================================================================================================================================================
##
##    Characteristic  | 1mg oral single dose [N=4] | 2mg oral single dose [N=5] | 5mg oral single dose [N=5] | 10mg oral single dose [N=6] | 20mg oral single dose [N=6] | 50mg oral single dose [N=6] | 100mg oral single dose [N=6] | 200mg oral single dose [N=6]
##    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##    Bodyweight (kg) | 76 (8.49) [68-86]          | 80.2 (8.04) [68-88]        | 73.2 (10.1) [65-90]        | 81 (5.1) [75-88]            | 77 (6.2) [68-87]            | 81.3 (7.79) [74-94]         | 85.5 (5.58) [78-92]          | 77.2 (7.91) [67-91]
##    Age (Years)     | 29.2 (4.19) [25-35]        | 29.4 (1.67) [28-32]        | 29.6 (5.32) [24-36]        | 26.3 (5.32) [18-34]         | 28.8 (3.54) [23-32]         | 30.3 (4.8) [23-35]          | 28.8 (5.78) [21-34]          | 30.3 (5.24) [22-38]
##    Height (cm)     | 178 (7.77) [166-183]       | 177 (2.74) [174-181]       | 175 (2.65) [171-178]       | 179 (3.5) [175-183]         | 180 (6.31) [168-186]        | 179 (3.22) [174-182]        | 181 (4.1) [178-188]          | 176 (3.78) [172-183]
##    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## N: Number of subjects
## Entries represent: Mean (Standard deviation) [Minimum-Maximum]
##
## IQRoutputTable object
summaryCat_IQRdataGENERAL(data1CleanM3, FLAGtotal = TRUE, catNames = c("STUDYN", "TRT"))
##    Summary of demographic and baseline characteristics for categorical information
##    ==========================================================================================
##
##    Characteristic | Category               | FIH [N=32] | FIH extension [N=12] | TOTAL [N=44]
##    ------------------------------------------------------------------------------------------
##    Study          | FIH                    | 32 (100%)  | 0 (0%)               | 32 (72.7%)
##                   | FIH extension          | 0 (0%)     | 12 (100%)            | 12 (27.3%)
##    TRTNAME        | Placebo                | 0 (0%)     | 0 (0%)               | 0 (0%)
##                   | 1mg oral single dose   | 4 (12.5%)  | 0 (0%)               | 4 (9.09%)
##                   | 2mg oral single dose   | 5 (15.6%)  | 0 (0%)               | 5 (11.4%)
##                   | 5mg oral single dose   | 5 (15.6%)  | 0 (0%)               | 5 (11.4%)
##                   | 10mg oral single dose  | 6 (18.8%)  | 0 (0%)               | 6 (13.6%)
##                   | 20mg oral single dose  | 6 (18.8%)  | 0 (0%)               | 6 (13.6%)
##                   | 50mg oral single dose  | 6 (18.8%)  | 0 (0%)               | 6 (13.6%)
##                   | 100mg oral single dose | 0 (0%)     | 6 (50%)              | 6 (13.6%)
##                   | 200mg oral single dose | 0 (0%)     | 6 (50%)              | 6 (13.6%)
##    ------------------------------------------------------------------------------------------
## N: Number of subjects
## Number of subjects in each category and percentage within this category
##
## IQRoutputTable object

#### Standard graphs

##### Details on individuals

One example for an indidual detail plot was already shown in the data preparation workflow above. This function creates a pdf with one page per individual and/or a list of graphs (one list element per individual.) It displays the observations along with the dose administrations and gives information on subject ID and treatment groups. Each data point is labeled with the IXGDF number. The data can be inspected in detail to check correctness of the data and problematic data points spotted and identified easily.

Please refer to the help of this function to look up possible customization (e.g., log-scale, selection of observations to include).

plotIndiv_IQRdataGENERAL(data1CleanM3, filename = "material/01-01-DataProgAnal/IndivPlotsClean01.pdf")
##### Dosing

The dosing schedule per individual can be inspected using the plotDoseSchedule_IQRdataGENERAL() function. The individual panels are distributed over multiple pages/graphs accoding to the number of individuals to be plot on one page (NperPage, defaults to 25).

plotDoseSchedule_IQRdataGENERAL(data1CleanM3, filename = "material/01-01-DataProgAnal/Dosing01.pdf")
##### Observations

The sampling schedule per individual can be inspected using the plotSampleSchedule_IQRdataGENERAL() function. The individual panels are distributed over multiple pages/graphs accoding to the number of individuals to be plot on one page (here set by NperPage = 6).

plotSampleSchedule_IQRdataGENERAL(data1CleanM3, filename = "material/01-01-DataProgAnal/Sampling01.pdf", NperPage = 6)

The actual observations can be visualized with plotRange_IQRdataGENERAL() and plotSpaghetti_IQRdataGENERAL(). In both cases, the data will be stratified to panels per treatment group. The first function provides the median and 90% range from the 5th to the 95th percentile of the data per nomnal time point. The second function will plot the data points at the actual times connected by lines per subjects. Besides changing the scale (scale = "log" or "lin", not shown here), the plots can be subsetted by a stratification column and the non-stratified as well as stratified graphs are provided.

# Median and 90% interval per nominal time point
out <- plotRange_IQRdataGENERAL(data1CleanM3)
out$unstratified$Plasma concentration IQ0815

# Lines per idividual
out <- plotSpaghetti_IQRdataGENERAL(data1CleanM3, stratify = "AGE0")

# Unstratified
out$unstratified$Plasma concentration IQ0815

# Stratified
out$stratified$Plasma concentration IQ0815.AGE0CAT::1

out$stratified$Plasma concentration IQ0815.AGE0CAT::2

##### Covariates

The function plotCovDistribution_IQRdataGENERAL() visualizes the distribution of continuous and covariates, thus being a graphical counterpart to the summaryCov_IQRdataGENERAL() and summaryCat_IQRdataGENERAL() functions. Graphically we can explore also the correlations among continuous covariates (plotCorCov_IQRdataGENERAL()), among categorical covariates (plotCorCat_IQRdataGENERAL()), and between continuous and categorical covariates (plotCorCovCat_IQRdataGENERAL()). In our example, we make use of the input arguments covNames/catNames to neglect the SEX covariate in the plots as it contains only one category (see summary table above).

# Distribution of continuous and categorical covariates
out <- plotCovDistribution_IQRdataGENERAL(data1CleanM3, covNames = c("STUDYN", "TRT"))
out\$categorical

# Correlation of continuous covariates
plotCorCov_IQRdataGENERAL(data1CleanM3,covNames = c("AGE0","WT0","HT0"))

# Correlation of categorical covariates
plotCorCat_IQRdataGENERAL(data1CleanM3, catNames = c("STUDYN", "TRT"))

# Correlation of continuous and categorical covariates
plotCorCovCat_IQRdataGENERAL(data1CleanM3, catNames = c("STUDYN", "TRT"))