Cluster Data in Blocks of Similar x Values and Summarize y Values per Block

clusterX.Rd

clusterX(
  x,
  y = NULL,
  groupsize = 5,
  resolution = 0.1,
  lambda = 1,
  iterlim = 100,
  log = FALSE
)

statXY(x, y = NULL, ..., quantiles = c(0.05, 0.95))

Arguments

x: x values or data.frame of x and y values
y: y values or NULL, if x is a data.frame of x and y values
groupsize: smallest expected group size
resolution: gaps between groups of data points greater than resolution lead to separation of groups.
lambda: penalization of intra-group variance, set to 1 to have more groups and set to 0 to get less but larger groups.
iterlim: maximum number of iterations the algorithm takes.
log: cluster on log(x) or on x. Does not change the value of resolution.
...: arguments going to clusterX().
quantiles: the requested quantiles, usually 0.05 and 0.95. Quantiles are returned in columns named PX.VALUE, where X = round(100*quantiles).

Value

clusterX() returns a data.frame with x, y and group values. Group is returned as a factor with numerically sorted levels.

statXY() returns summary information as a data frame. The output of clusterX() is returned in the attribute "clusterOut".

Details

Data points are sorted by increasing x value and assigned into groups of size groupsize. Next, groups separated by less than resolution are merged. In the following iterative algorithm, the L1-distance of each data point to each of the groups is computed and wheighted by the groups geometric standard deviation. Data points are then reassigned to the closest group. The procedure is repeated until group membership does not change any more.

statXY() computes a data.frame with the following columns

GROUP = group, group identifier
TIME = mean(x), mean group x value, usually time
MEAN.VALUE = mean(y), mean group y value
MEDIAN.VALUE = median(y), median of group y values
SD.VALUE = sd(y), standard deviation of group y values
SE.VALUE = sd(y)/sqrt(length(y)), standard error of group y values
GEOMMEAN.VALUE = exp(mean(log(y))), geometrical mean of y values
GEOMSD.VALUE = exp(sd(log(y))), geometrical standard deviation of y values
PX.VALUE = quantile(y, probs = X/100), X\

Other Auxiliary: IQRloadCSVdata(), IQRsaveCSVdata(), and(), aux_explode(), aux_explodePC(), aux_fileparts(), aux_fileread(), aux_filewrite(), aux_getRelPath(), aux_mkdir(), aux_na_locf(), aux_postFillChar(), aux_preFillChar(), aux_quantilenumber(), aux_rmdir(), aux_simplifypath(), aux_splitVectorEqualPieces(), aux_strFindAll(), aux_strrep(), aux_strtrim(), aux_unlevel(), aux_version(), calcAICBIC(), compare_IQRmodel_IQRsysModel_simulation(), fit_EmaxModel(), format_GUM(), ge(), gen_aux_version(), geocv(), geomean(), geosd(), ginv(), gt(), interp0(), interp1(), interpcs(), inv_logit(), le(), logit(), lt(), mod(), mvrnorm(), norm_M3(), or(), piecewise(), progressBar(), remove_duplicates(), run_silent_IQR(), stopIQR(), tempdirIQR(), tempfileIQR(), warningIQR()

Examples

if (FALSE) { # \dontrun{

library(ggplot2)

# Center timers for simulation and function to produce nice curve
timesD <- c(2, 10, 15, 30, 60, 120)
myfn <- function(x) 100*(1-exp(-.03*x))*exp(-.1*x)
# Randomly sampled times around center times with variable sample size
times <- unlist(lapply(timesD, function(x) stats::rnorm(runif(1, 2, 10), x, 0.2*x)))
# Simulate noisy data
x <- data.frame(
  TIME = times,
  VALUE = stats::rnorm(length(times), myfn(times), 0.1*myfn(times) + 1)
)
x <- subset(x, TIME > 0)

# Run cluster algorithm and get statistics
stat <- statXY(x, groupsize = 5, resolution = .01)
out <- attr(stat, "clusterOut")


# Plot result
P <- ggplot(out, aes(x = TIME, y = VALUE, color = block, pch = block)) + geom_point() +
  annotate("line", x = stat$TIME, y = stat$MEDIAN.VALUE) +
  annotate("line", x = stat$TIME, y = c(stat$P5.VALUE), lty = 2) +
  annotate("line", x = stat$TIME, y = c(stat$P95.VALUE), lty = 2)
print(P)
print(P + scale_x_log10())


} # }

Cluster Data in Blocks of Similar x Values and Summarize y Values per Block

Arguments

Value

Details

See also

Examples