\documentclass[nojss]{jss}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
%\usepackage[backend=bibtex]{biblatex}


\title{boxcoxmix: An R Package for Response Transformations for Random Effect and Variance Component Models}
\Shorttitle{The \pkg{boxcoxmix} package}
\author{Amani Almohaimeed and Jochen Einbeck\\
Qassim University and Durham University}
%\maketitle

%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\SweaveOpts{prefix.string=text}

\Abstract{
Random effect models have become a mainstream statistical technique over the last decades, and the
same can be said for response transformation models such as the Box-Cox transformation.
The latter ensures that the assumptions of normality and of homoscedasticity of the response distribution are fulfilled, which
are essential conditions for the use of a linear model or a linear mixed model.  However, methodology for response transformation and {\it simultaneous} inclusion of 
random effects has been
developed and implemented only scarcely, and is so far restricted to Gaussian random effects.   
 In this vignette, we introduce a new \textbf{\textsf{R}} package, \pkg{boxcoxmix}, that aims to ensure the validity of a 
normal response distribution using the Box-Cox power transformation in the presence of random effects, thereby not requiring 
parametric
assumptions on their distribution.  This is  achieved by extending the  ``Nonparametric Maximum Likelihood"  towards a ``Nonparametric Profile Maximum Likelihood" 
technique.  The implemented techniques allow to deal with overdispersion as well as two--level data scenarios. 
}
\Keywords{Box--Cox transformation, mixed model, nonparametric maximum likelihood, {EM} algorithm}

\Address{
  Amani Almohaimeed\\
  Department of Mathematical Sciences\\
  Qassim University\\
  Qassim, KSA \\
and \\
  Department of Mathematical Sciences\\
  Durham University\\
  Durham, UK \\
  E-mail: \email{ama.almohaimeed@qu.edu.sa}\\
\\\, \\
  Jochen Einbeck\\
  Department of Mathematical Sciences\\
  Durham University\\
  Durham, UK \\
  E-mail: \email{jochen.einbeck@durham.ac.uk}
  }

\begin{document}
\SweaveOpts{concordance=TRUE}
%\SweaveOpts{concordance=TRUE}
\baselineskip 15pt
%\VignetteIndexEntry{boxcoxmix: An R Package for Response Transformations for Random Effect and Variance Component Models}
%\VignetteDepends{stats, graphics, tools, utils}

<<echo=FALSE>>=
options(width=60)
@ 

\section{Introduction}
In regression analysis, the data needs to achieve normality and homoscedasticity of the response distribution in order to enable access to linear model theory and 
associated inferential tools such as confidence intervals and hypothesis tests.
This often requires transforming the response variable. \cite{BoxCox1964} proposed a parametric power transformation technique for transforming the response in univariate linear models. This transformation has been intensively studied by many researchers. \cite{sakia1992box} briefly reviewed the work relating to this transformation. \cite{Solomon1985} studied the application of the Box-Cox transformations to simple variance component models. 
The extension of the transformation to the linear mixed effects
 model was proposed by \cite{Gurka2006}, in the case of a Gaussian random effect distribution. An obvious concern of assuming 
a normal random effect distribution is whether there are any harmful effects of misspecification.
\cite{BockAitkin1981} showed that there is no need to make an assumption about the distribution of the random effects and it can 
be estimated as a discrete mixing distribution. \cite{aitkin1996}, \cite{heckman1984} and \cite{davies1987} showed that the parameter 
estimation is sensitive to the choice of the mixing distribution specification. The problem of estimating the mixing distribution 
using a specific parametric form (e.g. normal) can be overcome by the use of non-parametric maximum likelihood (NPML) estimation; 
the NPML estimate of the mixing distribution is known to be a discrete distribution involving a finite number of mass-points and corresponding 
masses \citep{Laird1978,Lindsay1983}. An Expectation-Maximization (EM) algorithm is used for fitting the finite mixture distribution, 
each iteration of this algorithm is based on two steps: the expectation step (E-step) and the maximization step (M-step); see 
\cite{Aitkin1999, Aitkin2009} and \cite{Einbeck2007} for details. The maximum likelihood (ML) estimate via the EM algorithm is a 
preferable approach due to its generality and simplicity; when the underlying complete data come from an exponential family whose 
ML estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed \citep{Dempster1977}. 
For both overdispersed and variance component models, the EM algorithm for NPML estimation of the mixing distribution was regarded as ``very stable and converged in every case'' \citep{Aitkin1999}.

A particular appealing aspect of the NPML approach is that the posterior probability that a certain unit belongs to a certain cluster corresponds  to the weights in the final iteration of the EM algorithm \citep{sofroniou2006analyzing}. Another benefit of this approach is that increasing the number of mass points requires little computational effort and that the mass--points are not restricted to lie on a grid \citep{aitkin1996}. Aitkin concluded that 
``the simplicity and generality of the non--parametric model and the EM algorithm for full NPML estimation 
in overdispersed exponential family models make them a powerful modelling tool''. The ability of the EM algorithm to locate the global maximum in fewer iterations can be affected by the choice of initial values; several methods for choosing initial values for the EM algorithm in the case of finite mixtures are discussed by \cite{KarlisXekalaki2003}. 
A grid search for setting the initial values was suggested by \cite{Laird1978}. \cite{Hou2011} found limited difference from subsequent 
test of structural effects if the factors with structural effects were omitted during the estimating process for the Box-Cox power 
transformation parameter. They noted that the Box-Cox transformation works better only if the cluster sizes are very large; and it 
is necessary to run a grid search of the transformation in order to determine the parameter estimate that maximizes the residual (or profile) 
likelihood during the optimization process both under the linear and the mixed model settings. \cite{Nawata1994} proposes a scanning 
Maximum likelihood method. Basically one conducts the entire methodology on a grid of fixed values of the transformation parameter $\lambda$ and then optimizes 
over this grid. \cite{nawata2013} used this method to calculate the maximum likelihood estimator of the Box-Cox transformation model. 
\cite{Gurka2006} noted that it is necessary to discuss how the estimation of $\lambda$ affects inference about the other model 
parameters when one extends the Box-Cox transformation to the linear mixed model. 

This vignette introduces (an implementation of) a transformation approach by extending the Box-Cox transformation to overdispersion and two–level variance 
component models. It aims to ensure the validity of a normal response distribution using the Box--Cox power transformation in the presence of random effects, 
thereby not requiring parametric assumptions on their distribution. This is achieved by extending the  ``Nonparametric Maximum Likelihood''  
towards a ``Nonparametric Profile Maximum Likelihood (NPPML)'' technique. To the best of our knowledge, the approach turns out to be the only one of its 
kind that has implemented the Box-Cox power transformation of the linear mixed effects model with an unspecified random effect distribution. 


For an existing implementation of the Box-Cox transformation 
for the univariate linear model in \textbf{\textsf{R}}, we mention the \code{boxcox()} function in the \pkg{MASS} package \citep{VenablesRipley2002}. 
Essentially, \code{boxcox()} calculates 
and plots the profile log-likelihood for the univariate linear model against a set of $\lambda$ values, in order to locate 
the transformation parameter under which the log-likelihood is maximized (yielding, after transformation, data that follow a normal distribution 
more closely than the untransformed data). In turn, the NPML methodology is  implemented in the \pkg{npmlreg} package \citep{Aitkin2009,Einbeck2014},
which provides functions  \code{alldist()} and \code{allvc()} for simple overdispersion models and variance component models, respectively.  In this article,  we introduce the \pkg{boxcoxmix} package which can be considered as a comnination of the Box--Cox and NPML concepts and which
implements transformation models for random effect and variance component models using the NPPML technique.   The package is available from the Comprehensive \textbf{\textsf{R}} Archive Network (CRAN) at 
\url{https://cran.r-project.org/package=boxcoxmix}. 


The remainder of the article is organized as follows. Section \ref{boxcoxran} begins by providing a general introduction to the Box-Cox transformation 
for the linear model, as well as the theory and methodology underlying random effect models with unspecified random effect distribution.  It proceeds with using the ``Nonparametric Profile Maximum Likelihood'' technique to combine these two methods. It also explains the basic usages of \pkg{boxcoxmix}'s 
main functions with a real data example. In Section \ref{boxcoxvc}, the Box--Cox transformation is extended to the two-level variance component model, along with 
some examples. The article concludes with a discussion in Section \ref{discuss}.


% !TeX root = RJwrapper.tex

\section{Box-Cox transformation in random effect models}\label{boxcoxran}

\subsection{Box-Cox transformation}
The Box--Cox transformation \citep{BoxCox1964} has been widely used in applied data analysis. The objective of the transformation is to 
select an appropriate parameter $\lambda$ which is then used to transform data such that they follow a normal distribution more closely 
than the untransformed data. The transformation of the responses $y_{i}$, $i=1,\ldots,n$, takes the form:

\begin{equation}
y_{i}^{(\lambda)}= \left\{\begin{array}{cc}
        \frac{y_{i}^{\lambda}-1}{\lambda}  & (\lambda\neq 0), \\
        \log  y_{i}  & (\lambda= 0), 
\end{array}\right.
\end{equation}

where the restriction $y_{i}>0$ applies. The response variable transformed by the Box--Cox transformation is assumed to be linearly related to its covariates and the errors normally distributed with constant variance. 

\subsection{Random effects}

In the linear model, it is assumed that a set of explanatory variables $x_{i}$, \, $i=1,\ldots,n$, and a response variable $y_{i}$ 
are linearly related such that $ y_{i}=x_{i}^{T}\beta +  \epsilon_{i}$ where $\epsilon_{i}$ is an error term which is usually assumed to be Gaussian and homescedastic.
%came from an exponential family distribution $f(y_{i}\vert \beta)$. 
If the population from which the data are sampled consists of heterogeneous, unknown subpopulations, then the linear model described above will not fit well.
In such cases, the presence of further unknown variability can be 
accommodated by adding a random effect $z_{i}$ with density $g(z)$ to the linear predictor,
\begin{equation}
y_{i}=x_{i}^{T}\beta + z_{i} + \epsilon_{i}.
\end{equation} 
The responses $y_{i}$ are independently distributed with 
mean function $E(y_{i}\vert z_{i}) = x_{i}^{T}\beta+z_{i}$, conditionally on the random effect $z_{i}$.  
Let $\phi(y;\cdot, \cdot)$ denote
the univariate Gaussian probability density function, with mean and variance specified in the remaining two function arguments. The conditional probability density 
function of $y_{i}$ given $z_{i}$ is given by 
\begin{equation}
f(y_{i}\vert z_{i})= \phi (y_{i}; x_{i}^{T}\beta+z_{i} ,\sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}}\exp \left[ -\frac{1}{2\sigma^{2}} (y_{i} - x_{i}^{T} \beta-z_{i})^{2}\right]. 
\end{equation}
Note that under the presence of a random effect, the parametric intercept term can be omitted from $x_{i}^{T}\beta$. Under the NPML estimation approach, 
the distribution of the random effect will be approximated by a discrete distribution at mass points $z_1, \ldots, z_K$, which can be considered as 
intercepts for the different unknown subgroups. This will be explained in detail in the following subsection, under inclusion of the Box--Cox transformation.

\subsection{Extending the Box-Cox transformation to random effect models}
In this section, the Box-Cox transformation is extended to the random effects model. In this case, it is assumed that there is a value of $\lambda$ for which \\
\begin{equation}
y_{i}^{(\lambda)}\vert z_{i} \sim N(x_{i}^{T}\beta+z_{i},\sigma^{2}),
\end{equation} \\
where $z_{i}$ is a random effect with an unspecified density  $g(z_{i})$. Taking account of the Jacobian of the transformation from $y$ to $y^{(\lambda)}$, 
the conditional probability density function of $y_{i}$ given $z_{i}$ is \\
\begin{equation}
\label{fik}
f(y_{i}\vert z_{i})= \frac{y_{i}^{\lambda -1}}{\sqrt{2\pi\sigma^{2}}}\exp \left[ -\frac{1}{2\sigma^{2}} (y_{i}^{(\lambda)} - x_{i}^{T} \beta-z_{i})^{2}\right] ,
\end{equation} \\
hence, the likelihood in relation to the original observations is
\begin{equation}
\label{lik}
L(\lambda, \beta, \sigma^{2},g)=\prod^{n}_{i=1} \int f(y_{i}\vert z_{i}) g(z_{i})dz_{i}.
\end{equation}  

Under the non-parametric maximum likelihood (NPML) approach,  the integral over the (unspecified) mixing distribution $g(z)$ is approximated by a discrete distribution 
on a finite number $K$ of mass-points $z_{k}$, with masses $\pi_{k}$ \citep{Aitkin2009}. The approximated likelihood is then \\
\begin{equation}
\label{lik_discrete}
L(\lambda, \beta, \sigma^{2},z_{1},....,z_{k},\pi_{1},.....,\pi_{k})=\prod^{n}_{i=1} \sum_{k=1}^{K} \pi_{k}  f_{ik}
\end{equation}
where $f_{ik}= f(y_{i}\vert z_{k})$. Defining indicators
\begin{equation}
  G_{ik}= \left\{\begin{array}{cc}
   1 & \mbox{if observation } y_{i} \mbox{ comes from cluster }k,\\
  0 & \mbox{otherwise,}
  \end{array}\right.
\label{eq: indicator1}
\end{equation}
the complete likelihood would be \\
\begin{equation}
L^{*}=\prod^{n}_{i=1} \prod_{k=1}^{K} (\pi_{k}  f_{ik})^{G_{ik}},
\end{equation} \\
so that the complete log-likelihood takes the shape 
\begin{equation}\label{eq: lr}
\ell^{*}=\log L^{*}=\sum^{n}_{i=1} \sum_{k=1}^{K} \left[ G_{ik} \log \pi_{k} + G_{ik} \log f_{ik}\right]. 
\end{equation}
If $K=1$, the log-likelihood would be the usual log-likelihood of the Box--Cox model without random effects. 

We now apply the expectation-maximization (EM) approach to find the maximum likelihood estimate (MLE) of the model parameters. 
Given some starting values $\beta^{0}, \sigma^{0}$, $z_{k}^{0}$, and $\pi^{0}_{k}$ (discussed in a separate subsection below),
set  $\hat{\beta}=\beta^{0}$, $\hat{\sigma}= \sigma^{0}$, $\hat{z}_{k}=  z_{k}^{0}$, $\hat{\pi}_{k} = \pi^{0}_{k}$,  $k=1,2,\ldots ,K$, and iterate between

\textbf{E-step:} Estimate $G_{ik}$ by its expectation
\begin{equation} 
\label{wik}
w_{ik}= \frac{\hat{\pi}_{k} f_{ik}}{\sum_{\ell}\hat{\pi}_{\ell} f_{i\ell}}
\end{equation} \\
which is the posterior probability that observation $y_{i}$ comes from cluster $k$. Note that $f_{ik}$ depends via equation (\ref{fik}) implicitly on the current values of $\hat{z}_{k}$, $\hat{\beta}$ and $\hat{\sigma}^{2}$.\\
\\
\textbf{M-step:} The estimators $\hat{\beta}, \hat{\sigma}^{2}, \hat{z}_{k}$ and $ \hat{\pi}_{k} $ can be obtained 
 using the current $w_{ik}$, via the following four equations which were obtained through manual derivation of the NPML estimators for fixed $K$:\\
\begin{eqnarray}
\hat{\beta}&=&\left(  \sum^{n}_{i=1}  x_{i} x_{i}^{T} \right)^{-1} \sum^{n}_{i=1} x_{i} \left(  y_{i}^{(\lambda)} -\sum_{k=1}^{K} w_{ik} \hat{z}_{k} \right),\label{eq: betar1}\\
\hat{\sigma}^{2}  &=& \sum^{n}_{i=1} \sum_{k=1}^{K} \frac{w_{ik} (y^{(\lambda)}_{i} - x^{T} \hat{\beta}-\hat{z}_{k})^{2}}{n}, \\
\hat{z}_{k} &=& \frac{\sum^{n}_{i=1} w_{ik}(y_{i}^{(\lambda)} - x_{i}^{T} \hat{\beta})}{\sum^{n}_{i=1} w_{ik}}, \label{eq: z1}\\
 \hat{\pi}_{k}&=&\frac{ \sum^{n}_{i=1} w_{ik}}{n}.\label{eq: p1}
\end{eqnarray}
We see from this that $\hat{\pi}_k$ is the average posterior probability for component $k$.  

Replacing the results into Equation (\ref{lik_discrete}) we get the non-parametric profile likelihood function $L_P (\lambda)$, or its logarithmic version 
\begin{equation}
\label{nppll}
\ell_{P} (\lambda)= \log \Big( \sum_{k=1}^{K} \hat{\pi}^{(\lambda)}_{k}   \hat{f}^{(\lambda)}_{ik} \Big) .
 \end{equation}
The non-parametric profile maximum likelihood (NPPML) estimator is therefore given by \\
\begin{equation} \label{eq: NPPML1}
\hat{\lambda} = arg  \, \max_\lambda \: \ell_{P} (\lambda).
\end{equation} 

In practice, the EM--algorithm needs to be stopped after a certain number of iterations when it has reached its point of convergence. \cite{Polanska2003} defined this convergence criterion as the absolute change in the successive 
log-likelihood function values being less than an arbitrary parameter such as $\delta = 0.0001$. 


In package \pkg{boxcoxmix}, the main function for fitting random effect models with response transformations is \code{optim.boxcox()}, which performs a grid search of (\ref{nppll}) over the parameter $\lambda$ and then optimizes over this grid, in order to calculate the maximum likelihood estimator $\hat{\lambda}$ of the transformation. It produces a plot of the non-parametric profile likelihood function that summarises information concerning $\lambda$, including a vertical line indicating the best value of $\lambda$ that maximizes the non--parametric profile log--likelihood.  
In order to fit models with fixed value of $\lambda$, one can use function \code{np.boxcoxmix()}.
%that is designed to account for `overdispersed' linear models and variance component models.  
When $\lambda$ =1 (no transformation), the results of the proposed approach will be very similar to that of the \pkg{npmlreg} 
function \code{alldist()}. However, the function \code{np.boxcoxmix()} is not a copy or extension of the \code{alldist()} function; 
the implementation is based on directly computing (\ref{eq: betar1})-(\ref{eq: p1}) rather than relying on the output of the \code{glm()} function. 

Beside the parameter estimates, the function produces the standard errors of the estimates and the 
log--likelihood value. Further, the the Akaike's Information Criterion (AIC) and Bayesian Information Criteria (BIC) are calculated to find the best fitting line for the data, using the expressions
\begin{equation} \label{aic}
\mbox{AIC} = -2 \ell_{P}(\lambda) +2\times (p+2K-1+c)
\end{equation}
\begin{equation} \label{bic}
BIC = -2 \ell_{P}(\lambda) +\log(n) \times (p+2K-1+c)
\end{equation}
where $\ell_{P}(\lambda)$ is the profile log-likelihood function given in (\ref{nppll}) which is obtained by substituting the maximum likelihood estimators of the model parameters (i.e. $z=\hat{z}$, $\pi=\hat{\pi}$, $\beta=\hat{\beta}$ and $\sigma=\hat{\sigma}$), and the second part of the AIC and BIC equations computes the number of parameters estimated in the model. $p$ is the number of regression parameters in $\hat{\beta}$, $K$ is the number of mixture classes, $c$ is the value $1$ if the transformation parameter is estimated and zero otherwise, and $n$ is the number of observations.

To support diagnostics and model checking, a plot of the disparity with the iteration number on the x-axis and the mass points on the y-axis,  as well as 
normal Q-Q plots to determine how well a set of values follow a normal distribution, can be obtained. Furthermore, control charts of the 
residuals of the data before and after applying the transformation  can be produced to detect special causes of variation. 
There are many possible causes of an out--of--control point, 
including non-normal data and the number of classes, $K$. 


\subsection{Starting point selection and the first cycle} 

In the first cycle of the algorithm, the model is fitted initially without random effect, given some starting values $\beta^{0}$ and $\sigma^{0}$. It remains to choose the starting mass points $z_{k}^{0}$ and corresponding masses $\pi^{0}_{k}$, for which the implementation 
of \pkg{boxcoxmix} provides two different methods as outlined below: 
\begin{itemize}
\item Gauss-Hermite quadrature points \citep{EinbeckHinde2006}:
\begin{equation}
z_{k}^0= \hat{\beta}_{0}+ \emph{tol} \times s \times g_{k}
\end{equation}
where $\beta_{0}$ is the intercept of the fitted model, \emph{tol} is a scaling parameter restricted to the choice $0 \leq $ \emph{tol} $\leq 2$, $g_{k}$ are Gauss-Hermite quadrature points, 
and $s$ is the standard deviation of residuals defined as, 
\begin{equation}
s=\sqrt{\frac{1}{n-p}\sum_{i=1}^{n} \hat{\varepsilon_{i}}^{2}}
\end{equation}
where $n-p$ is the degrees of freedom for $\hat{\varepsilon}_{i}$, $n$ is the sample size, $p$ represents the number of parameters used 
to fit the model $y_{i}^{(\lambda)}=x_{i}^{T}\beta + \varepsilon_{i}$ and $\hat{\varepsilon}_{i}$ is the difference between the observed 
data of the dependent variable $y_{i}^{(\lambda)}$ and the fitted values $\hat{y_{i}}^{(\lambda)}$(i.e. $\hat{\varepsilon}_{i}=y_{i}^{(\lambda)}-\hat{y_{i}}^{(\lambda)}$).
\item Quantile-based version 
\begin{equation}
z_{k}^0= \bar{y}^{(\lambda)}+\emph{tol}  \times q_{k}^{(\lambda)}
\end{equation}
where $\bar{y}^{(\lambda)}$ is the mean of the responses $y_{i}^{(\lambda)}$ and $q_{k}^{(\lambda)} =\frac{k}{K}-\frac{1}{2K}$ are quantiles 
of the empirical distribution of $y_{i}^{(\lambda)}-\bar{y}^{(\lambda)}$.
\end{itemize}
(For either case, \pkg{boxcoxmix} provides the functions \code{Kfind.boxcox} and \code{tolfind.boxcox()} to identify optimal values of $K$ and \emph{tol}, respectively.) 
 
From this one obtains the extended linear predictor for the $k$-th component $E(y_{i}^{(\lambda)}\vert z_{k}^0)=x_{i}^{T}\beta + z_{k}^0$. Using formula (\ref{wik}) 
with current parameter estimates, one gets an ``initial E-step" and in the subsequent M-step one obtains the parameter estimates by solving the score equations. 
From the resulting estimates of this cycle, one gets an updated value of the weights, and so on.


\sloppy

\subsection{Generic functions}
\pkg{boxcoxmix} supports generic functions such as \code{summary()}, \code{print()} and  \code{plot()}. Specifically, \code{plot()}
can be applied on the output of \code{np.boxcoxmix()}, \code{optim.boxcox()}, \code{Kfind.boxcox} and \code{tolfind.boxcox()}.
% and to produce some useful diagnostic plots of objects generated by the functions \code{np.boxcoxmix()}, 
%\code{optim.boxcox()} and \code{tolfind.boxcox()}, 
%using the generic function \code{plot()}. 
The plots to be printed depend on the choice of the argument \code{plot.opt}, \\
\begin{itemize}
\item 1, the disparities with the iteration number against the mass points;
\item 2, the fitted values against the response of the untransformed and the transformed data; 
\item 3, probability plot of residuals of the untransformed against the transformed data;
\item 4, individual posterior probabilities;
\item 5, control charts of residuals of the untransformed against the transformed data;
\item 6, the histograms of residuals of the untransformed against the transformed data; 
\item 7, plots the specified range of \emph{tol} against the disparities (works only for the \code{tolfind.boxcox()} function);
\item 8, gives the profile likelihood function that summarises information concerning $\lambda$ (works only for the \code{optim.boxcox()} function);
\item 7, plots the specified range of $K$ against the aic or bic values (works only for the \code{Kfind.boxcox} function).
\end{itemize}
%Other generic functions for \pkg{boxcoxmix} are \code{summary()} and \code{print()} that print output summaries of fitted models.\\

\fussy

\subsection{Application to the strength data}
In this section we analyze the \textbf{strength} data from the \textbf{\textsf{R}} library \pkg{mdscore} \citep{daSilva2014} which is a 
subsample of the 5 x 2 factorial experiment given by \cite{ostle1954}. The objective here is to investigate the effects of the 
covariates \verb@lot@ and \verb@cut@ on the impact strength, where \verb@lot@ denotes the lot of the material (I, II, III, IV, V) 
and \verb@cut@ denotes the type of specimen cut (Lengthwise, Crosswise). The model presented is a two-way \verb@lot@ $\times$ \verb@cut@ 
interaction model. For the $i$--th \verb@cut@ and $j$--th \verb@lot@, we have\\
\begin{equation} \label{eq:np1}
y_{ij}=  \gamma_{i} + \beta_{j} + \delta_{ij} + z, \;\qquad i=1,2, \quad\: j= 1,2,..,5, 
\end{equation}
where $\gamma_{1}=0$, $\beta_{1}=0$, $\delta_{1,1}=\delta_{1,2}=\cdots=\delta_{1,5}=\delta_{2,1}=0$, and $z$ is the random effect with 
an unspecified mixing distribution.\\

\cite{shuster1972} considered the Inverse Gaussian distribution as an adequate distribution in modelling strength data. We therefore 
suggest to fit a number of models including the Inverse Gaussian model and compare the results below. 
%We use $K=3$ for all fitted models, which we justify 
%lateron.

For a fixed value of $\lambda$, we fit the model with settings $\lambda=-1$,  $tol=1.8$ and $K=3$ (the latter two choices to be justified below), so the response will be 
transformed as 
\[
y^{(\lambda)}=(y^{-1} -1)/-1 = -y^{-1} +1.
\]
Using \code{np.boxcoxmix()},

<<>>=
library(boxcoxmix)
data(strength, package="mdscore")
test.inv <- np.boxcoxmix(y ~ cut *lot, data = strength, K = 3,  
                         tol = 1.8,  start = "gq", lambda = -1,
                         verbose=FALSE) 
test.inv
@


For comparison, we also fit the same model without transformation, using function \code{np.boxcoxmix()} with setting $\lambda=1$, $tol=1.8$ and $K=3$:

<<>>=
test.gauss <- np.boxcoxmix(y ~ cut *lot, data = strength, K = 3,  
                           tol = 1.8,  start = "gq", lambda = 1,
                           verbose=FALSE) 
test.gauss
@

Using now our grid search method \code{optim.boxcox()} that calculates and plots the profile log-likelihood 
values for the fitted model (\ref{eq:np1}) against a set of $\lambda$ values, and locates the MLE $\hat{\lambda}$ (see Fig. \ref{fig:str3}):


<<eval=FALSE>>=
test.optim <- optim.boxcox(y ~ cut*lot, data = strength,  K = 3,
                           tol = 1.8, start = "gq", find.in.range = c(-3, 3),
                           s = 60) 
plot(test.optim, 8)
@


\begin{figure} 
\centering
 \includegraphics[width=.60\textwidth]{maxnp.png} 
   \caption{A grid search over $\lambda$, using $K=3$ and $tol=1.8$ } \label{fig:str3}
 \end{figure} 

Figure \ref{fig:str3} shows that the best value of $\lambda$ that maximizes the profile log-likelihood is $0.1$ 
which is close to zero, suggesting that some transformation need to be carried out to make the data distribution appear more normal. 


We also fit the model shown in (\ref{eq:np1}) with an Inverse Gaussian distribution using the \pkg{npmlreg} function \code{alldist()}, 
using $tol=0.45$ and $K=3$. 
<<>>=
library(npmlreg)
inv.gauss <- alldist(y ~ cut*lot, data = strength, k = 3, tol = 0.45,
                     verbose=FALSE,  family = "inverse.gaussian")
inv.gauss
@

For the starting point selection, the optimal value of $tol$ can be selected prior to this analysis using a grid search over $tol$ 
using \pkg{boxcoxmix} function \code{tolfind.boxcox()} (see Fig. \ref{fig:tolf}). 

\begin{verbatim}
> tol.find <- tolfind.boxcox(y ~ cut*lot,  data = strength,  K = 3,     
      start = "gq", lambda = 1,  find.in.range = c(0, 2),  s = 20)
\end{verbatim}

\begin{figure}
\centering
 \includegraphics[width=.60\textwidth]{tolf.png} 
   \caption{For the strength data, a grid search over $tol$, using $K=3$ and $\lambda=1$ } \label{fig:tolf}
\end{figure} 

Similarly, the value $tol=0.45$ used by \code{alldist()} has been selected as the optimal value of \emph{tol} using the \pkg{npmlreg} function \code{tolfind()}.

 The Akaike Information Criteria (AIC) defined in (\ref{aic}), is used as a criterion for choosing amongst 
the models. The model with the lowest AIC value is considered as the best model.
Table \ref{table:str3} displays summary statistics for the Inverse Gaussian distribution model (Inv.Gauss), 
transformed models using $\lambda =-1$ and $\hat{\lambda}=0.1$, and the untransformed model ($\lambda =1$). 

\begin{table}
\centering
\begin{tabular}{ ccccc  }
  &Inv.Gauss & $\lambda =-1$  &$\hat{\lambda}=0.1$  &$\lambda=1$ \\
   \hline
   $\gamma_{2}$ & 0.3611 &-0.4174& -0.2943& -0.2555   \\
  $\beta_{2}$ & -0.3280  & -0.1310 & -0.0887& -0.0801 \\
    $\beta_{3}$  &0.4435 &-0.4522& -0.3175& -0.2722  \\
    $\beta_{4}$ &0.0857 &-0.0338& -0.2383&  -0.2203  \\
    $\beta_{5}$  &2.2516&-0.8161&-0.6845&  -0.5401  \\
$\delta_{2,2}$ & -0.5111&0.4965& 0.3715& 0.3323  \\
$\delta_{2,3}$ & 0.5146&0.1813&0.1141& 0.1554   \\
$\delta_{2,4}$& -0.1999&0.3404&0.4604 & 0.4070  \\
$\delta_{2,5}$ & -0.1923&0.2595&0.3378& 0.3536   \\
 $\sigma$ & 0.3966&0.06169&0.0207& 0.0206   \\
 $-2 \log L$ &-68 &-73.70853&-98.02242 & -86.61931\\
   AIC & -40&-45.7085  &-68.02242&-58.6193  \\
 \hline
\end{tabular}
\caption{Comparison of results from original \& transformed data, using $K=3$.}
\label{table:str3}
\end{table}

 The Inverse Gaussian model gives the worst AIC. Better AIC values are given by the transformed model using $\lambda =-1$, the 
Gaussian ($\lambda =1$) and $\hat{\lambda}$. The lowest AIC found was for the transformed model using $\hat{\lambda}$ with -68.0224. 
The parameter estimates of the untransformed and the Box--Cox--transformed model using $\hat{\lambda}$ are broadly in agreement but the latter has better 
disparity and AIC values. However, the results from the other models are quite different and the worst disparity was found for the Inverse 
Gaussian model. Among the four models, the one with  $\hat{\lambda}=0.1$ provides the best fit of the data,  
which does not necessarily support the model choice taken in \cite{shuster1972}.

\begin{table}
\centering
\begin{tabular}{ cccc  }
 $K$ & $\lambda =-1$   &$\hat{\lambda}=0.1$ &$\lambda=1$  \\
   \hline
 1 & -30.01438 &  -33.57915 &  -29.45051\\
   2   &  -50.10725 & -56.71019 &-44.64449 \\
  3 &  -45.70853 &-70.02242&-58.61931 \\
  4 &  -50.42968 &  -59.40018& -52.4271\\
5& -57.4437 &  -60.17015 & -49.17725\\
6&  -64.53892 &  -51.40021 & -44.42724\\
7&  -49.44363 &   -52.17016 & -54.39248\\
 \hline
\end{tabular}
\caption{Comparison of AIC values}
\label{table:strAIC}
\end{table}

The appropriate number of classes $K$ could be obtained by comparing the AIC from fitting several mixture models with 
different numbers of classes $K$, as illustrated in Table \ref{table:strAIC}.

 \section{Box-Cox transformation in variance component models}\label{boxcoxvc}
\subsection{Variance component model}
\setcounter{equation}{0}
We now consider the two-level variance component model. An unobserved random effect $z_{i}$ with upper-level indexed by $i=1 \ldots,r$, 
and lower-level indexed by $j=1,\ldots,n_{i}$, $\sum n_{i} = n$ is added to the linear predictor $x_{ij}^{T}\beta $. The responses $y_{ij}$ 
are independently distributed with conditional mean function
 \begin{equation}
E(y_{ij}\vert z_{i}) = x_{ij}^{T}\beta+z_{i}
\end{equation} 
where the distribution of the $z_{i}$ is again unspecified. The conditional probability density function of $y_{ij}$ given $z_{i}$ is given by 
 \begin{equation}
f(y_{ij}\vert z_{i})= \phi (y_{ij}; x_{ij}^{T}\beta+z_{i} ,\sigma^{2}) =
 \frac{1}{\sqrt{2\pi\sigma^{2}}}\exp \left[ -\frac{1}{2\sigma^{2}} (y_{ij} - x_{ij}^{T} \beta-z_{i})^{2}\right]. 
\end{equation} 
\subsection{Extending the Box-Cox Transformation to variance component models}
For the two-level variance component model with responses $y_{ij}$, the Box-Cox transformation \citep{BoxCox1964} 
can be written as 
\begin{equation}
y_{ij}^{(\lambda)}=
   \left\{\begin{array}{cc}
        \frac{y_{ij}^{\lambda}-1}{\lambda}\, & \lambda\neq 0, \\
         \log  y_{ij} \, & \lambda= 0
         \end{array}\right.
\end{equation}
for $y_{ij}>0$, $i=1,...,r$, $j=1,....,n_{i}$, and $\sum n_{i} = n$. 
It is assumed that there is a value of $\lambda$ for which 
\begin{equation}
y_{ij}^{(\lambda)}\vert z_{i} \sim N(x_{ij}^{T}\beta+z_{i},\sigma^{2})
\end{equation} 
 where $z_{i}$ is a random effect with an unspecified mixing distribution $g(z_{i})$. The likelihood can now be approximated as \citep{Aitkin2009}
\begin{equation} \label{eq: 55}
L(\lambda, \beta, \sigma^{2},g)=\prod^{r}_{i=1} \int \left[\prod^{n_{i}}_{j=1} f(y_{ij} \vert z_{i}) \right] g(z_{i})dz_{i}\approx \prod^{r}_{i=1} \sum_{k=1}^{K}  \pi_{k} m_{ik},
\end{equation}  
where $m_{ik} = \prod^{n_{i}}_{j=1} f(y_{ij}\vert z_{k})$. The complete log-likelihood is thus \\
\begin{equation}\label{eq: 9}
\ell^{*}=\log L^{*} =\sum^{r}_{i=1} \sum_{k=1}^{K} \left[ G_{ik} \log \pi_{k} + G_{ik} \log m_{ik}\right] 
\end{equation}
where $L^{*}=\prod^{r}_{i=1} \prod_{k=1}^{K} (\pi_{k}  m_{ik})^{G_{ik}}$.\\

We apply the expectation-maximization (EM) approach similar as before, with the following adjustments:

\textbf{E-step:} This is identical to (\ref{wik}), but with $f_{ik}$ replaced by $m_{ik}$.\\
\textbf{M-step:} 
Using the current $w_{ik}$, the four estimators are now:
\begin{eqnarray*}
\hat{\beta}&=&\left(  \sum^{r}_{i=1}  \sum_{j=1}^{n_{i}} x_{ij} x_{ij}^{T} \right)^{-1} \sum^{r}_{i=1} \sum_{j=1}^{n_{i}} x_{ij} 
\left(  y_{ij}^{(\lambda)} - \sum_{k=1}^{K} w_{ik} \hat{z}_{k} \right),\label{eq: betav1}\\
\hat{\sigma}^{2} &=& \frac{\sum^{r}_{i=1} \sum_{k=1}^{K} w_{ik} \left[\sum_{j=1}^{n_{i}}(y_{ij}^{(\lambda)} - x_{ij}^{T} \hat{\beta}-\hat{z}_{k})^{2}\right]}{\sum^{r}_{i=1} n_{i}}, \\
\hat{z}_{k} &=& \frac{\sum^{r}_{i=1} w_{ik}\left[ \sum_{j=1}^{n_{i}} (y_{ij}^{(\lambda)} - x_{ij}^{T} \hat{\beta})\right]}{\sum^{r}_{i=1} n_{i} w_{ik}}, \label{eq: zv1}\\
 \hat{\pi}_{k} &=&\frac{\sum^{r}_{i=1} w_{ik}}{r},
\end{eqnarray*}
where $\hat{\pi}_{k}$ is the average posterior probability for component $k$.  Substituting the results into Equation (\ref{eq: 55}) we get the non-parametric profile likelihood function $L_P (\lambda)$, or its logarithmic version $\ell_P(\lambda)=\log(L_P(\lambda))$. The non--parametric profile maximum likelihood (NPPML) estimator is therefore given by
\begin{equation}
\hat{\lambda} = arg  \, \max_\lambda \: \ell_{P} (\lambda).
\end{equation} 

For fixed $\lambda$, such variance component models under response transformations are again estimated using the function \code{np.boxcoxmix()}. When $\lambda$ =1 (no transformation), the results of the proposed 
approach will be similar to that of the \pkg{npmlreg} function \code{allvc()}. 

\subsection{Application to the heights of boys in Oxford data}
In order to demonstrate how the \code{optim.boxcox()} function may be used effectively, we consider 
a data set giving the heights of boys in Oxford. The data set is part of the \textbf{\textsf{R}} package nlme \citep{nlme2016} and consists of measurements of
 \code{age} and \code{height} for 26 boys, yielding a total of 234 observations. 
The response variable \verb@height@ is defined as the height of the boy in (cm), associated with the covariate \verb@age@ 
that is the standardized age (dimensionless). The results were obtained by fitting the variance component model 
\begin{equation} \label{eq: vcm}
E(y_{ij}|z_i)= {\tt age}_{j}+ z_{i}
\end{equation}
where 
$z_{i}$ is boy--specific random effect and ${\tt age}_{j}$ is the $j$-th standardized age measurement, $j=1, \ldots, 9$, 
which is equal for all boys for fixed $j$. A model with $K=6$ mass points without response transformation can be fitted 
using the \code{np.boxcoxmix()} function setting $\lambda=1$, 

<<>>=
data(Oxboys, package="nlme")
Oxboys$boy <- gl(26,9)
Oxboys$boy 
testox <- np.boxcoxmix(height ~  age, groups = Oxboys$boy,
                       data = Oxboys, K = 6, tol = 1, start = "gq",
                       lambda=1, verbose=FALSE)
@
<<eval=FALSE>>=
plot(testox, 1)
@ 
\begin{figure}
\includegraphics[width=.60\textwidth]{testox.png} 
\caption{For the Oxboys data, estimated mass points versus EM iterations.}
\label{fig:massvc}
\end{figure}

The manual specification of \code{Oxboys$boy <- gl(26,9)} is necessary since the second argument of \code{np.boxcoxmix} 
requires a vector of group labels in order to work correctly.

The  function \code{optim.boxcox()} can again  be used to perform a grid search over $\lambda$ to obtain the optimum: 

<<eval=FALSE>>=
testo <- optim.boxcox(height ~ age, groups = Oxboys$boy, data = Oxboys, 
         K = 6, tol =1, start = "gq", find.in.range = c( -1.2, 0.1), s=15)
plot(testo, 8)
@

\begin{figure}
\centering
 \includegraphics[width=.60\textwidth]{oxboys.png} 
   \caption{\label{fig:boyoptim} For the Oxboys data, a grid search over $\lambda$, with $K=6$ and $tol=1$.}
\end{figure} 

From Figure \ref{fig:boyoptim}, it can be seen that the best estimate of $\lambda$ that maximizes the non-parametric profile log-likelihood is $-0.25$, 
suggesting that some transformation need to be carried out to make the data distribution more normal. 
The results before and after applying the response transformation shown in Table  \ref{table:boy} 
prove that the decision of transforming the response is reasonable.
\begin{table}
\centering
\begin{tabular}{ cc cc cc cc c  }
  &\multicolumn{2}{c}{$K=4$   }&\multicolumn{2}{c}{$K=5$   }& \multicolumn{2}{c}{$K=6$   }  &\multicolumn{2}{c}{$K=7$   }  \\
 & $\hat{\lambda}=0.1$   &$\lambda =1$& $\hat{\lambda}=-0.51$   &$\lambda =1$ & $\hat{\lambda}=-0.25$   &$\lambda =1$  & $\hat{\lambda}=-0.25$   &$\lambda =1$\\
   \hline
$\hat{\beta}$ 	&0.0716	&6.5264 & 0.0034 &6.5218 &0.0126 & 6.5245  &0.0082 &6.5218\\
$SE(\hat{\beta})$ &0.0031 &0.2841 & 0.0001 &0.2367 &0.0004 & 0.1918  &0.0002&0.2367\\
 $\hat{\sigma}$ &0.0310 & 2.806 & 0.0012 &2.341  &0.0035 &  1.903   &0.0023 & 2.341\\
 $-2 \log L$ 	&1211.8 &1212.7 & 1119.3 &1132.8 &1026.2 & 1048.3   &1022.3   &1132.8\\
   AIC		&1229.8 &1228.659 & 1141.324  &1152.849   &1052.2 &   1072.27   &1052.302  & 1160.849 \\
 \hline
\end{tabular}
\caption{Comparison of results from original \& transformed data, using $K=4,5,6$ and $7$}
\label{table:boy}
\end{table}

As can be seen from Table \ref{table:boy}, comparing the Akaike Information Criterion (AIC) values of the untransformed model fit 
($\lambda =1$) and our method using $K=4,5,6$ and $7$, respectively, showed a slightly better performance of the NPPML approach.
In other words, using the response data after applying the response transformation leads to a better fitting model than the 
original data. This gives further support to the decision of using the transformation. 

Concerning the choice of $K$, it is transparent from  Table  \ref{table:boy} that there is no gain in going from $K=6$ to $K=7$ as the AIC values in fact increase 
when doing so.  There is a consistent improvement, however, when increasing the number of mass points from $K=4$ over $K=5$ to $K=6$, and it
is also clear from Figure \ref{fig:massvc} that the six estimated mass points are distinct and identifiable. For the untransformed model,
\cite{Aitkin2009} recommend the use of $K=8$ mass points. 

\section{Discussion}\label{discuss}

We have introduced a new \textbf{\textsf{R}} package, \pkg{boxcoxmix}, that identifies the appropriate power transformation 
for achieving normality of the response distribution in random effect models. To the best of our knowledge, there is no other widely available statistical package 
that has implemented the Box-Cox power transformation of the linear mixed effects model with an unspecified random effect distribution. \pkg{boxcoxmix} 
is able to fit random effect and variance component models, and estimates the transformation and regression parameters simultaneously through its main function 
\code{optim.boxcox()}. This function operates similarly to the existing R function \code{boxcox()}, by creating a profile likelihood and carrying out a grid search over
the transformation parameter $\lambda$. It is noted that, just as in \code{boxcox()}, this procedure cannot make use of built--in \textbf{\textsf{R}} optimization routines such as 
\code{optim()} or \code{optimize()} since the profile likelihood itself depends on estimated parameters, estimation of which involves a full EM algorithm. 

In addition, \pkg{boxcoxmix} also can be used to fit models with fixed value of $\lambda$  
using function \code{np.boxcoxmix()}, and to perform a grid search over $tol$ using the 
function \code{tolfind.boxcox()} to identify optimal starting values for the mass points.
Our package provides some further diagnostic tools, such as a QQ--plot and a control chart of residuals, which help validating
the need for transformation.

\sloppy

In this paper we have shown how \pkg{boxcoxmix} can successfully fit models through response transformation rather than adjustment of the response distribution. 
The examples have demonstrated that the \pkg{boxcoxmix} function \code{optim.boxcox()} works well in finding the model with maximum likelihood. All transformed 
models using $\hat{\lambda}$ that were obtained by the \code{optim.boxcox()} function gave substantially better fits than the untransformed model, when 
considering the AIC criterion or the disparity ($-2\log L$).
Also, in all considered scenarios, the estimated value of $\hat{\lambda}$ was quite far away from the value $\lambda=1$. 
However, it should be added that it is not possible to report a simple likelihood--based standard error for $\hat{\lambda}$ as in \textbf{\textsf{R}} 
function \code{boxcox()}, the reason being that the likelihood in the considered model class is highly non--concave, as  visible for instance from 
Fig. \ref{fig:str3}. Hence, when faced with the decision of whether or not needing to transform the response, not only the value of $\hat{\lambda}$ 
but also relevant model selection criteria such as AIC should be taken into account. It is then essential that these are always based on 
likelihoods which are reported on the original response scale, as in models (\ref{lik}) and (\ref{lik_discrete}) --- of course, this is the 
case for the values $-2 \log L$ and $AIC$ provided in our summary  output. The experimental results verify the accuracy and the efficiency of the \pkg{boxcoxmix} package, which is 
available from the Comprehensive \textbf{\textsf{R}} Archive Network (CRAN) at \url{https://cran.r-project.org/package=boxcoxmix}.\\

\fussy

\section*{Acknowledgements}
This \textbf{\textsf{R}} package has made use of R package statmod \citep{giner2016statmod} to obtain the function \code{gqz}. 
\bibliography{boxcoxmix}
 

\end{document}