A Brief Guide To The BPST
Package
in R
Introduction
Splines can be considered piecewise polynomials that possess some overall continuity and smoothness criteria, especially at the divisions or knots (created to fit a local polynomial through the given data points on each division) over the entire domain.
The picture is from Friedman et. al.
Splines have proven to be an extremely powerful tool for smoothing observed data in the areas of mathematics, nonparametric statistics, functional data analysis, computer graphics, etc. Splines are the popular choice in these fields of study because of the simplicity of their construction, ease, accuracy of evaluation, and capacity to approximate the underlying functional component of interest over domains under complicated smoothness constraints.
The usual spline-based approaches to multivariate estimation problems involving tensor product spaces are most useful when the data are observed in a regular (e.g., rectangular) domain. However, the form of tensor products is undesirable when the data locations are spread in a general bounded domain of arbitrary shape (with complex boundaries and holes).
Triangulation is then the most appropriate tool to partition the domain (we will focus only on the 2D spaces in this blog) into segments and considering bivariate splines, the smooth piecewise polynomial functions over triangulation are then the most natural extensions of univariate spline functions over subintervals.
In this blog, we will discuss the application of bivariate penalized
splines over triangulations through different functions in the
BPST
R package for approximating the patterns of observed
data points over the domain.
How to install
The BPST
package is available on GitHub. To install
the package directly from GitHub
, we need another R
package, devtools
. It can be installed as follows:
install.packages("devtools")
Now, the BPST
package can be installed using the
following code:
library(devtools)
::install_github("FIRST-Data-Lab/BPST")
devtools::install_github("FIRST-Data-Lab/Triangulation") devtools
Next, we need to load the BPST
package to use:
library(BPST)
Since our interest lies in the bivariate spline functions over
triangulation for the domain under consideration, it is necessary to
install the Triangulation
package. A brief guide can be
found here. Once installed, we
need to load the Triangulation
package using
library(Triangulation)
Different functions available in BPST
The package contains several functions contributing to the main purpose of fitting bivariate penalized splines over triangulation and can be shortlisted for a brief introduction as follows:
inVT
: decides whether a point is inside a given triangulation.basis
: generates the basis for bivariate spline over triangulation.data.BPST
: generates the testing datasets for bivariate spline smoothing.fit.BPST
: conducts the model fitting via bivariate penalized spline over triangulation.plot.BPST
: produces the contour map for the estimated surface of a fitted “BPST” object.predict.BPST
: provides predictions of a fitted BPST object.
How to use
In this section, we will mainly concentrate on the implementation of
BPST
with some examples which are supposed to mimic
different real-life scenarios that we come across involving both simple
and complex domains.
Let \((Z_{i,1}, Z_{i,2})^{\top}\) represent the location of the \(i\)th observation, \(i = 1, \ldots, n\) ranging over a bounded domain \(\Omega\) of arbitrary shape. At the \(i\)th location, \(Y_i\) is the response variable. We consider the following nonparametric regression model: \[ Y_i=\mu(z_{i,1},z_{i,2})+\sigma\varepsilon_i, \] where \(\mu(\cdot)\) is an unknown but smooth bivariate function to be estimated, and \(\varepsilon_i\)’s are the random noises with mean 0 and variance 1.
1. A simulated example on a rectangular domain
The “BPST” package offers several simulation-based examples, and the
value of the response variable in those examples can be generated using
the data.BPST()
function, in which we can specify the
options using data.BPST(Z, V, Tr, func, sigma, seed)
.
Domain and its triangulation
To generate data, we consider a basic rectangular domain defined on \([0, 2] \times [0, 1]\)
<- rbind(c(0,0), c(2,0), c(2,1), c(0,1), c(0,0))
bb plot(bb, type = "l", col = 2)
The TriMesh()
function from the package
Triangulation
can then be used to triangulate this
rectangular domain.
<- TriMesh(Pt = bb, n = 3, H = NULL) VT
cat("The number of vertices in this triangulation is ", nrow(VT$V), " and the number of triangles is", nrow(VT$Tr), "\n")
## The number of vertices in this triangulation is 28 and the number of triangles is 36
Data generation
Next, we generate \(101\times 101\) grid points over the domain \([0, 2] \times [0, 1]\).
# Number of grid division along X axis
<- 101
n1_grid # Number of grid division along Y axis
<- 101
n2_grid # Total number of grid points over [0, 2] X [0, 1]
<- n1_grid * n2_grid
n_grid
<- seq(0, 2, length.out = n1_grid)
u_grid <- seq(0, 1, length.out = n2_grid)
v_grid <- rep(u_grid, each = n2_grid)
uu_grid <- rep(v_grid, times = n1_grid)
vv_grid <- as.matrix(cbind(uu_grid, vv_grid))
Z_grid #plot(Z.grid)
Given the data points Z_grid
, we can consider 8
different choices of \(\mu(\cdot)\)
(func=1
, …, func=8
) and \(\sigma\) (sigma
). For example,
if we take func = 1
and sigma = 0.1
, then the
value of the response at the \(i\)th
point (\(i = 1, 2, \ldots,\) n_grid)
becomes:
= Z_grid[i, 1]^2 + 3 * Z_grid[i, 2]^2 + 4 * Z_grid[i, 1] * Z_grid[i, 2] +
Y[i] rnorm(1, mean = 0, sd = sigma)
In the following, we generate the response values at the grid points
defined by Z_grid
.
# argument of `data.BPST()`, denoting the choice of test function;
# possible choices are 1, 2, ..., 8.
<- 1
func
# argument of `data.BPST()`, denoting the noise we add for the
# generation of response values
<- 0.1
sigma
<- 2022
seed
<- data.BPST(Z = Z_grid, V = as.matrix(VT$V),
gridpoints Tr = as.matrix(VT$Tr), func, sigma, seed)
# Response value at the grid points over [0, 2] X [0, 1]
<- gridpoints$Y
Y_grid
# value of the true underlying function at the grid points
<- gridpoints$mu
mu_grid
# Identify the points that are outside the triangulated domain
<- gridpoints$ind
ind <- (1:n_grid)[ind == 1]
ind_grid
## Plot of the true underlying function on the domain
<- matrix(mu_grid, length(u_grid), length(v_grid), byrow = T)
true_func image(u_grid, v_grid, true_func, col=heat.colors(100), xlab="x", ylab="y")
lines(bb[ , 1], bb[ , 2], lwd = 1)
contour(u_grid, v_grid, true_func, add=TRUE)
title("Plot of the true function on the domain")
We have generated the population data with coordinates and corresponding response values at those coordinates. Next, we will now consider a random sample of observations from the population.
# sample size
<- 200
n
# sampling from only those population points which are inside the
# triangulated domain
<- sort(sample(ind_grid, n))
ind_sam
## In this case, even though all the grid points will be within the
## rectangular domain, this step is for any generic case where the domain
## may contain some holes in it and we might have some of the grid points
## lying outside the triangulated domain
<- as.matrix(gridpoints$Z[ind_sam, ]) # sampled data coordinates within the domain
Z head(Z)
## [,1] [,2]
## [1,] 0.00 0.27
## [2,] 0.00 0.31
## [3,] 0.00 0.56
## [4,] 0.00 0.69
## [5,] 0.02 0.07
## [6,] 0.02 0.74
<- as.matrix(gridpoints$Y[ind_sam]) # response value at the points `Z`
Y head(Y)
## [,1]
## [1,] 0.25150591
## [2,] 0.37145185
## [3,] 0.94311833
## [4,] 1.47059731
## [5,] 0.04710159
## [6,] 1.72930570
# boundary of the rectangular domain in 2D
plot(bb, type = "l", col = 2)
points(Z, col = 4, pch = 20)
title(main = "Data on the rectangular domain")
Now we have defined the rectangular domain \([0, 2] \times [0, 1]\) and generated the sample containing the coordinates and response value on 200 points on the domain. Let us try to estimate the regression function based on the sampled data using the bivariate smoothing over triangulation (BPST) method in Lai and Wang (2013).
Bivariate spline smoothing
Now to fit a model on the above simulated dataset using BPST
smoothing, we use fit.BPST()
:
<- fit.BPST(Y = Y, Z = Z, V = VT$V, Tr = VT$Tr, d = 4,
mfit r = 1, lambda = 10^seq(-6, 6, by=0.5))
# Y: The response variable associated with the observed data points.
# Z: The coordinates of observed data points (dimension 200 X 2).
# Each row is the coordinates of a data point.
# V: The N by 2 matrix of vertices of a triangulation,
# where N is the number of vertices.
# Each row is the coordinates for a vertex.
# Tr: The triangulation matrix of dimension nT by 3,
# where nT is the number of triangles in the triangulation.
# Each row is the indices of vertices in V.
# d: The degree of piecewise polynomials.
# r: The smoothness parameter
# lambda: The tuning parameter
# For more details of the arguments and outputs of `fit.BPST()`,
# use ?BPST::fit.BPST()
Once we get the model fitted in the above step based on the observed data, we can now use this fitted model to predict the value of the response variable on each of the grid points of that rectangular domain as follows
# prediction of the response values on the
# entire grid over the domain based on the
# model fitted from observed data
<- predict.BPST(mfit, Z_grid)
mpred
# The fitted surface of the response over the entire domain
<- matrix(mpred$Ypred, length(u_grid), length(v_grid), byrow = T)
pred_func image(u_grid, v_grid, pred_func, col = heat.colors(100), xlab = "x", ylab = "y")
lines(bb[ , 1], bb[ , 2], lwd = 1)
contour(u_grid, v_grid, pred_func, add = TRUE)
title("Plot of the estimated test function via BPST")
Now here, since we are showcasing a simulation study, we do have the knowledge regarding the true function dictating the response values over the domain.
We can find the root mean square error (RMSE) for the fitted model as:
<- (mean((Y - mfit$Yhat)^2, na.rm = TRUE))
rmse cat("RMSE =", rmse, "\n")
## RMSE = 0.007467668
Also, we can examine the root mean square prediction error (RMSPE for predicting the response values over the entire domain based on the fitted model can be obtained.
<- sqrt(mean((Y_grid - mpred$Ypred)^2, na.rm = TRUE))
rmspe cat("RMSPE =", rmspe,"\n")
## RMSPE = 0.108686
2. A simulation example on a complex horseshoe domain
In this example, we consider a test function defined on a Horseshoe
domain (Wood et al., 2008) for our demonstration of the
BPST
function involving complex domains. The test function
and domain are well documented in the R package mgcv
. The
horseshoe domain reflects an example that the domain is almost divided
into two parts by a strict concavity with a very narrow gap between the
two segments of the domain containing a very high contrast of low and
high-density regions.
First let us plot the boundary of the horseshoe domain:
<- Triangulation::horseshoe
bb plot(bb, type = "l", col = 2)
The triangulation of the domain can be obtained as:
<- TriMesh(Pt = bb, n = 5) VT
cat("The number of vertices in this triangulation is ", nrow(VT$V), " and the number of triangles is", nrow(VT$Tr), "\n")
## The number of vertices in this triangulation is 90 and the number of triangles is 112
Similar to the previous example, a fine grid over the domain needs to be established.
if(!require('pracma')) install.packages('pracma')
<- seq(-1, 3.5, length = 101)
xm <- seq(-1, 1, length = 101)
yn <- pracma::meshgrid(xm, yn)
xy_grid <- c(xy_grid$X)
xx <- c(xy_grid$Y)
yy <- as.matrix(cbind(xx, yy)) gridpoints
The response value at the grid points can then be obtained using the
test function given in the mgcv
package
if(!require('mgcv')) install.packages('mgcv')
## Put response values = NA for points outside the
## horseshoe domain
<- mgcv::fs.test(xx, yy)
Y_grid
## Plot of the true test function over domain
<- matrix(fs.test(xx,yy), length(xm), length(yn), byrow = T)
true_func image(xm, yn, true_func, col=heat.colors(100), xlab="x", ylab="y")
lines(bb$V1, bb$V2, lwd = 1)
contour(xm, yn, true_func, add=TRUE)
title("Plot of the true test function on the horseshoe domain")
Now to draw a sample of some certain size, say 200, we will first sort out which of the grid points are within the triangulated horseshoe domain; and then we will randomly select 200 sample points from that population.
## Checking which of the grid values are within
## the triangulated horseshoe domain
<- BPST::inVT(V0 = as.matrix(VT$V), Tr0 = as.matrix(VT$Tr),
ind
xx, yy)<- (1:nrow(gridpoints))[ind$ind == 1]
ind_grid
## Drawing a sample from horseshoe domain
<- 200
n <- sort(sample(ind_grid, n))
ind_sam
## sampled data and the response values at the sampled points
<- gridpoints[ind_sam, ]
samp_data <- Y_grid[ind_sam]
samp_res
## Plot of the sampled data on the horseshoe domain
plot(bb, type = "l", col = 2)
points(samp_data, col = 4, pch = 20)
title(main = "Sampled data on the horseshoe domain")
Given the sampled data, our focus is now to fit the model using BPST smoothing and use the fitted model to predict the value of the response on the grid points.
## Model fitted based on the observed data
<- fit.BPST(Y = samp_res, Z = samp_data, V = VT$V, Tr = VT$Tr, d = 3,
mfit r = 1, lambda = 10^seq(-6,6,by=0.5))
# prediction of the response values on the
# grid points over the domain based on the
# fitted model from observed data
<- predict.BPST(mfit, gridpoints)
mpred
## Plot of the estimated test function via BPST
<- matrix(mpred$Ypred, length(xm), length(yn), byrow = T)
pred_func image(xm, yn, pred_func, col=heat.colors(100), xlab="x", ylab="y")
lines(bb$V1, bb$V2, lwd = 1)
contour(xm, yn, pred_func, add=TRUE)
title("Plot of the estimated test function via BPST")
The RMSE and the RMSPE for this model fit can be obtained by:
<- (mean((samp_res - mfit$Yhat)^2, na.rm = TRUE))
rmse cat("RMSE =", rmse, "\n")
## RMSE = 1.370137e-07
<- sqrt(mean((Y_grid - mpred$Ypred)^2, na.rm = TRUE))
rmspe cat("RMSPE =",rmspe,"\n")
## RMSPE = 0.01280794
3. Animage recovering example with missing pixels
Next, we consider smoothing and retrieving of an image, the standard test image Lena Sjööblom, as an example. The full noise free image is a \(155 \times 94\) pixel picture over the domain \([0, 1] \times [0, 1]\).
<- readRDS("lena.full.rds")
lena_full image(lena_full, col = gray.colors(256))
Among the total 14,570 pixels on the image, we randomly selected
\(n = 7000\) (almost \(50 \%\)) pixels and substituted them with
NA
values, depicting them as missing pixels of a distorted
image.
<- c(lena_full)
img_vec <- sample(1:length(img_vec), size = 7000)
mis_pos <- NA
img_vec[mis_pos] <- matrix(img_vec, nrow = nrow(lena_full))
img_mis_pos.mtx image(img_mis_pos.mtx, col = gray.colors(256))
title("Lena Image with 7000 Missing pixels at random")
To retrieve the original image from the given distorted one, we consider a three-step method as follows:
First, we triangulate the domain appropriately (according to some suitable choice for parameter \(n\) in
TriMesh
). The choice of \(n\) was made after considering several constraints like computation time and accuracy.Then, we conduct a BPST smoothing on the triangulated domain with the available observations/pixels (\(14570 - 7000 = 7570\) pixels). For this example, we will use bivariate splines with degree \(d=5\). With a choice of a higher degree, we may get more accurate approximations, but at the expense of computation time.
Once the BPST model is fitted, we can use the model to predict the grayscale value of the rest of the pixels (i.e., over the entire grid of 14,570 pixels) as in the original image.
# Boundary & Grid points
<- rbind(c(0, 0), c(1, 0), c(1, 1), c(0, 1), c(0, 0))
bb
<- dim(lena_full)[1]
nx <- dim(lena_full)[2]
ny
<- seq(min(bb[, 1]), max(bb[, 1]), length = nx)
xm <- seq(min(bb[, 2]), max(bb[, 2]), length = ny)
yn <- expand.grid(xm, yn)
xy_grid <- xy_grid$Var1
xx <- xy_grid$Var2
yy <- cbind(xx, yy)
Z
# Triangulation
<- TriMesh(bb, n = 7) VT
<- VT$V; Tr <- VT$Tr
V
# Fitted model
<- fit.BPST(Y = img_vec[-c(mis_pos)], Z = Z[-c(mis_pos), ],
mfit V = VT$V, Tr = VT$Tr, d = 5, r = 1)
# Prediction over the entire pixels
<- predict.BPST(mfit, Zpred = Z)
mpred <- mpred$Ypred
Yhat
# Predicted image over the entire pixels
<- matrix(Yhat, nrow = nrow(lena_full))
Yhat_mtx image(xm, yn, Yhat_mtx, col = gray.colors(256), xlab = "", ylab = "")
lines(bb, col = 4)
<- paste()
ti title("Lena image recovered by BPST with degree = 5 and n = 7")
4. Example of estimating a bivariate function using BPST with 3D visualization
Let us first generate the data from a rectangular domain \([0, 1] \times [0, 1]\) using the regression model \(Y = \mu(x_1, x_2) + \epsilon\).
We consider two different signal function for the regression :
- Linear: \(\mu_1(x_1, x_2) = 10x_1 + x_2 + 19\);
- Sinusoidal: \(\mu_2(x_1, x_2) = 24 + 5\sin\{\pi(x_1^2 + x_2^2)\}\).
The error (\(\epsilon\)) is generated from a standard normal distribution.
# Linear signal function
<- function(x){
mu1 return(10*x[1] + x[2] + 19)
}
# Sinusoidal signal function
<- function(x){
mu2 return(24 + 5*sin(pi*(x[1]^2 + x[2]^2)))
}
Then, we create a \(101 \times 101\)-point grid with values evenly spaced between 0 and 1. We obtained the true signal and noisy observation for each coordinate pair \((x_1, x_2)\) lying on the grid in the unit square.
# Boundary & Grid points
<- rbind(c(0, 0), c(1, 0), c(1, 1), c(0, 1), c(0, 0))
bb plot(bb, type = "l")
<- 101
nx <- 101
ny
<- seq(min(bb[, 1]), max(bb[, 1]), length = nx)
xm <- seq(min(bb[, 2]), max(bb[, 2]), length = ny)
yn <- expand.grid(xm, yn)
xy_grid <- xy_grid$Var1
xx <- xy_grid$Var2
yy <- cbind(xx, yy)
Z
# True signal
<- apply(Z, 1, mu1)
tru_sig1 <- apply(Z, 1, mu2)
tru_sig2
# signal + noise
<- tru_sig1 + rnorm(length(tru_sig1))
nois_resp1 <- tru_sig2 + rnorm(length(tru_sig2)) nois_resp2
Next, we take a random sample of size \(n = 2000\) from the grid of \(101 \times 101\) points.
# sample of obs val
<- sample(1:length(nois_resp1), size = 2000)
samp <- nois_resp1[samp]
Y1 <- nois_resp2[samp] Y2
Now we triangulate the domain using TriMesh()
<- TriMesh(bb, n = 3) VT
<- VT$V
V <- VT$Tr
Tr cat("The number of vertices in this triangulation is ", nrow(V),
" and the number of triangles is", nrow(Tr), "\n")
## The number of vertices in this triangulation is 16 and the number of triangles is 18
Applying BPST based on this triangulation, we fit our model (BPST with degree \(d=3\)) with \(n = 2000\) observations, and then, with the help of the fitted model, we predict the value of response \(Y\) over the entire grid on the domain.
library(plotly)
# Model fit
<- 3
deg
<- fit.BPST(Y = Y1, Z = Z[samp, ], V = V, Tr = Tr, d = deg)
mfit1 <- fit.BPST(Y = Y2, Z = Z[samp, ], V = V, Tr = Tr, d = deg)
mfit2
# Predictions
<- predict.BPST(mfit1, Zpred = Z)
pred1 <- predict.BPST(mfit2, Zpred = Z)
pred2
# predicted surface through BPST along with observed sample points
<- matrix(pred1$Ypred, nrow = 101, byrow = T)
z1 <- matrix(pred2$Ypred, nrow = 101, byrow = T)
z2
<- data.frame(Z[samp, 1], Z[samp, 2], Y1)
df1
<- plot_ly(
plotly_figure1 x = unique(Z[, 1]),
y = unique(Z[, 2]),
z = z1,
type = "surface"
%>%
) add_trace(data = df1, x = df1[ , 1], y = df1[ , 2], z = df1[ , 3],
mode = "markers", type = "scatter3d",
marker = list(size = 1, color = "black"))%>%
layout(scene = list(camera = list(
eye = list(
x = -1.25,
y = -1.25,
z = 1.25
),center = list(x = 0,
y = 0,
z = 0)
)))
subplot(plotly_figure1)
<- data.frame(Z[samp, 1], Z[samp, 2], Y2)
df2
<- plot_ly(
plotly_figure2 x = unique(Z[, 1]),
y = unique(Z[, 2]),
z = z2,
type = "surface"
%>%
) add_trace(data = df2, x = df2[ , 1], y = df2[ , 2], z = df2[ , 3],
mode = "markers", type = "scatter3d",
marker = list(size = 1, color = "black"))%>%
layout(scene = list(camera = list(
eye = list(
x = -1.25,
y = -1.25,
z = 1.25
),center = list(x = 0,
y = 0,
z = 0)
)))subplot(plotly_figure2)
Few notes to consider
d
: the degree of the polynomial fitted in the BPST method has to be chosen wisely based on the time constraint. Computation time increases with increase ind
.The parameter
n
in theTriMesh()
function should be chosen in such a way that- The triangles that are generated are more or less uniform in size;
- The angles of the triangles are not too acute.