A Brief Guide To The `BPST` Package in `R`

Introduction

Splines can be considered piecewise polynomials that possess some overall continuity and smoothness criteria, especially at the divisions or knots (created to fit a local polynomial through the given data points on each division) over the entire domain.

The picture is from Friedman et. al.

Splines have proven to be an extremely powerful tool for smoothing observed data in the areas of mathematics, nonparametric statistics, functional data analysis, computer graphics, etc. Splines are the popular choice in these fields of study because of the simplicity of their construction, ease, accuracy of evaluation, and capacity to approximate the underlying functional component of interest over domains under complicated smoothness constraints.

The usual spline-based approaches to multivariate estimation problems involving tensor product spaces are most useful when the data are observed in a regular (e.g., rectangular) domain. However, the form of tensor products is undesirable when the data locations are spread in a general bounded domain of arbitrary shape (with complex boundaries and holes).

Triangulation is then the most appropriate tool to partition the domain (we will focus only on the 2D spaces in this blog) into segments and considering bivariate splines, the smooth piecewise polynomial functions over triangulation are then the most natural extensions of univariate spline functions over subintervals.

In this blog, we will discuss the application of bivariate penalized splines over triangulations through different functions in the BPST R package for approximating the patterns of observed data points over the domain.

How to install

The BPST package is available on GitHub. To install the package directly from GitHub, we need another R package, devtools. It can be installed as follows:

install.packages("devtools")

Now, the BPST package can be installed using the following code:

library(devtools)
devtools::install_github("FIRST-Data-Lab/BPST")
devtools::install_github("FIRST-Data-Lab/Triangulation")

Next, we need to load the BPST package to use:

library(BPST)

Since our interest lies in the bivariate spline functions over triangulation for the domain under consideration, it is necessary to install the Triangulation package. A brief guide can be found here. Once installed, we need to load the Triangulation package using

library(Triangulation)

Different functions available in `BPST`

The package contains several functions contributing to the main purpose of fitting bivariate penalized splines over triangulation and can be shortlisted for a brief introduction as follows:

inVT: decides whether a point is inside a given triangulation.
basis: generates the basis for bivariate spline over triangulation.
data.BPST: generates the testing datasets for bivariate spline smoothing.
fit.BPST: conducts the model fitting via bivariate penalized spline over triangulation.
plot.BPST: produces the contour map for the estimated surface of a fitted “BPST” object.
predict.BPST: provides predictions of a fitted BPST object.

How to use

In this section, we will mainly concentrate on the implementation of BPST with some examples which are supposed to mimic different real-life scenarios that we come across involving both simple and complex domains.

Let \((Z_{i,1}, Z_{i,2})^{\top}\) represent the location of the \(i\)th observation, \(i = 1, \ldots, n\) ranging over a bounded domain \(\Omega\) of arbitrary shape. At the \(i\)th location, \(Y_i\) is the response variable. We consider the following nonparametric regression model: \[ Y_i=\mu(z_{i,1},z_{i,2})+\sigma\varepsilon_i, \] where \(\mu(\cdot)\) is an unknown but smooth bivariate function to be estimated, and \(\varepsilon_i\)’s are the random noises with mean 0 and variance 1.

1. A simulated example on a rectangular domain

The “BPST” package offers several simulation-based examples, and the value of the response variable in those examples can be generated using the data.BPST() function, in which we can specify the options using data.BPST(Z, V, Tr, func, sigma, seed).

Domain and its triangulation

To generate data, we consider a basic rectangular domain defined on \([0, 2] \times [0, 1]\)

bb <- rbind(c(0,0), c(2,0), c(2,1), c(0,1), c(0,0))
plot(bb, type = "l", col = 2)

The TriMesh() function from the package Triangulation can then be used to triangulate this rectangular domain.

VT <- TriMesh(Pt = bb, n = 3, H = NULL)

cat("The number of vertices in this triangulation is ", nrow(VT$V), " and the number of triangles is", nrow(VT$Tr), "\n")

## The number of vertices in this triangulation is  28  and the number of triangles is 36

Data generation

Next, we generate \(101\times 101\) grid points over the domain \([0, 2] \times [0, 1]\).

# Number of grid division along X axis
n1_grid <- 101                        
# Number of grid division along Y axis
n2_grid <- 101                          
# Total number of grid points over [0, 2] X [0, 1]
n_grid <- n1_grid * n2_grid             

u_grid <- seq(0, 2, length.out = n1_grid)
v_grid <- seq(0, 1, length.out = n2_grid)
uu_grid <- rep(u_grid, each = n2_grid)
vv_grid <- rep(v_grid, times = n1_grid)
Z_grid <- as.matrix(cbind(uu_grid, vv_grid))
#plot(Z.grid)

Given the data points Z_grid, we can consider 8 different choices of \(\mu(\cdot)\) (func=1, …, func=8) and \(\sigma\) (sigma). For example, if we take func = 1 and sigma = 0.1, then the value of the response at the \(i\)th point (\(i = 1, 2, \ldots,\) n_grid) becomes:

 Y[i] = Z_grid[i, 1]^2 + 3 * Z_grid[i, 2]^2 + 4 * Z_grid[i, 1] * Z_grid[i, 2] +
        rnorm(1, mean = 0, sd = sigma)

In the following, we generate the response values at the grid points defined by Z_grid.

# argument of `data.BPST()`, denoting the choice of test function; 
# possible choices are 1, 2, ..., 8.
func <- 1  

# argument of `data.BPST()`, denoting the noise we add for the
# generation of response values
sigma <- 0.1  

seed <- 2022

gridpoints <- data.BPST(Z = Z_grid, V = as.matrix(VT$V),
                       Tr = as.matrix(VT$Tr), func, sigma, seed)

# Response value at the grid points over [0, 2] X [0, 1]
Y_grid <- gridpoints$Y  

# value of the true underlying function at the grid points
mu_grid <- gridpoints$mu 

# Identify the points that are outside the triangulated domain
ind <- gridpoints$ind
ind_grid <- (1:n_grid)[ind == 1] 

## Plot of the true underlying function on the  domain
true_func <- matrix(mu_grid, length(u_grid), length(v_grid), byrow = T)
image(u_grid, v_grid, true_func, col=heat.colors(100), xlab="x", ylab="y")
lines(bb[ , 1], bb[ , 2], lwd = 1)
contour(u_grid, v_grid, true_func, add=TRUE)
title("Plot of the true function on the domain")

We have generated the population data with coordinates and corresponding response values at those coordinates. Next, we will now consider a random sample of observations from the population.

# sample size
n <- 200

# sampling from only those population points which are inside the
# triangulated domain
ind_sam <- sort(sample(ind_grid, n)) 

## In this case, even though all the grid points will be within the
## rectangular domain, this step is for any generic case where the domain 
## may contain some holes in it and we might have some of the grid points 
## lying outside the triangulated domain
 
Z <- as.matrix(gridpoints$Z[ind_sam, ]) # sampled data coordinates within the domain
head(Z)

##      [,1] [,2]
## [1,] 0.00 0.27
## [2,] 0.00 0.31
## [3,] 0.00 0.56
## [4,] 0.00 0.69
## [5,] 0.02 0.07
## [6,] 0.02 0.74

Y <- as.matrix(gridpoints$Y[ind_sam])  # response value at the points `Z`
head(Y)

##            [,1]
## [1,] 0.25150591
## [2,] 0.37145185
## [3,] 0.94311833
## [4,] 1.47059731
## [5,] 0.04710159
## [6,] 1.72930570

# boundary of the rectangular domain in 2D
plot(bb, type = "l", col = 2) 
points(Z, col = 4, pch = 20)
title(main = "Data on the rectangular domain")

Now we have defined the rectangular domain \([0, 2] \times [0, 1]\) and generated the sample containing the coordinates and response value on 200 points on the domain. Let us try to estimate the regression function based on the sampled data using the bivariate smoothing over triangulation (BPST) method in Lai and Wang (2013).

Bivariate spline smoothing

Now to fit a model on the above simulated dataset using BPST smoothing, we use fit.BPST():

mfit <- fit.BPST(Y = Y, Z = Z, V = VT$V, Tr = VT$Tr, d = 4,
                r = 1, lambda = 10^seq(-6, 6, by=0.5))

# Y: The response variable associated with the observed data points.
# Z: The coordinates of observed data points (dimension 200 X 2).
#     Each row is the coordinates of a data point.
# V: The N by 2 matrix of vertices of a triangulation, 
#     where N is the number of vertices. 
#     Each row is the coordinates for a vertex.
# Tr: The triangulation matrix of dimension nT by 3,
#     where nT is the number of triangles in the triangulation. 
#     Each row is the indices of vertices in V.
# d: The degree of piecewise polynomials.
# r: The smoothness parameter
# lambda: The tuning parameter

# For more details of the arguments and outputs of `fit.BPST()`,
# use ?BPST::fit.BPST()

Once we get the model fitted in the above step based on the observed data, we can now use this fitted model to predict the value of the response variable on each of the grid points of that rectangular domain as follows

# prediction of the response values on the 
# entire grid over the domain based on the
# model fitted from observed data
mpred <- predict.BPST(mfit, Z_grid) 

# The fitted surface of the response over the entire domain

pred_func <- matrix(mpred$Ypred, length(u_grid), length(v_grid), byrow = T)
image(u_grid, v_grid, pred_func, col = heat.colors(100), xlab = "x", ylab = "y")
lines(bb[ , 1], bb[ , 2], lwd = 1)
contour(u_grid, v_grid, pred_func, add = TRUE)
title("Plot of the estimated test function via BPST")

Now here, since we are showcasing a simulation study, we do have the knowledge regarding the true function dictating the response values over the domain.

We can find the root mean square error (RMSE) for the fitted model as:

rmse <- (mean((Y - mfit$Yhat)^2, na.rm = TRUE))
cat("RMSE =", rmse, "\n")

## RMSE = 0.007467668

Also, we can examine the root mean square prediction error (RMSPE for predicting the response values over the entire domain based on the fitted model can be obtained.

rmspe <- sqrt(mean((Y_grid - mpred$Ypred)^2, na.rm = TRUE))
cat("RMSPE =", rmspe,"\n")

## RMSPE = 0.108686

2. A simulation example on a complex horseshoe domain

In this example, we consider a test function defined on a Horseshoe domain (Wood et al., 2008) for our demonstration of the BPST function involving complex domains. The test function and domain are well documented in the R package mgcv. The horseshoe domain reflects an example that the domain is almost divided into two parts by a strict concavity with a very narrow gap between the two segments of the domain containing a very high contrast of low and high-density regions.

First let us plot the boundary of the horseshoe domain:

bb <- Triangulation::horseshoe
plot(bb, type = "l", col = 2)

The triangulation of the domain can be obtained as:

VT <- TriMesh(Pt = bb, n = 5)

cat("The number of vertices in this triangulation is ", nrow(VT$V), " and the number of triangles is", nrow(VT$Tr), "\n")

## The number of vertices in this triangulation is  90  and the number of triangles is 112

Similar to the previous example, a fine grid over the domain needs to be established.

if(!require('pracma')) install.packages('pracma')
xm <- seq(-1, 3.5, length = 101)
yn <- seq(-1, 1, length = 101)
xy_grid <- pracma::meshgrid(xm, yn)
xx <- c(xy_grid$X)
yy <- c(xy_grid$Y)
gridpoints <- as.matrix(cbind(xx, yy))

The response value at the grid points can then be obtained using the test function given in the mgcv package

if(!require('mgcv')) install.packages('mgcv')
## Put response values = NA for points outside the
## horseshoe domain
Y_grid <- mgcv::fs.test(xx, yy)

## Plot of the true test function over domain
true_func <- matrix(fs.test(xx,yy), length(xm), length(yn), byrow = T)
image(xm, yn, true_func, col=heat.colors(100), xlab="x", ylab="y")
lines(bb$V1, bb$V2, lwd = 1)
contour(xm, yn, true_func, add=TRUE)
title("Plot of the true test function on the horseshoe domain")

Now to draw a sample of some certain size, say 200, we will first sort out which of the grid points are within the triangulated horseshoe domain; and then we will randomly select 200 sample points from that population.

## Checking which of the grid values are within 
## the triangulated horseshoe domain
ind <- BPST::inVT(V0 = as.matrix(VT$V), Tr0 = as.matrix(VT$Tr),
                  xx, yy)
ind_grid <- (1:nrow(gridpoints))[ind$ind == 1]

## Drawing a sample from horseshoe domain
n <- 200
ind_sam <- sort(sample(ind_grid, n))

## sampled data and the response values at the sampled points
samp_data <- gridpoints[ind_sam, ]
samp_res <- Y_grid[ind_sam]

## Plot of the sampled data on the horseshoe domain
plot(bb, type = "l", col = 2)
points(samp_data, col = 4, pch = 20)
title(main = "Sampled data on the horseshoe domain")

Given the sampled data, our focus is now to fit the model using BPST smoothing and use the fitted model to predict the value of the response on the grid points.

## Model fitted based on the observed  data
mfit <- fit.BPST(Y = samp_res, Z = samp_data, V = VT$V, Tr = VT$Tr, d = 3,
                r = 1, lambda = 10^seq(-6,6,by=0.5))

# prediction of the response values on the
# grid points over the domain based on the
# fitted model from observed data
mpred <- predict.BPST(mfit, gridpoints)

## Plot of the estimated test function via BPST
pred_func <- matrix(mpred$Ypred, length(xm), length(yn), byrow = T)
image(xm, yn, pred_func, col=heat.colors(100), xlab="x", ylab="y")
lines(bb$V1, bb$V2, lwd = 1)
contour(xm, yn, pred_func, add=TRUE)
title("Plot of the estimated test function via BPST")

The RMSE and the RMSPE for this model fit can be obtained by:

rmse <- (mean((samp_res - mfit$Yhat)^2, na.rm = TRUE))
cat("RMSE =", rmse, "\n")

## RMSE = 1.370137e-07

rmspe <- sqrt(mean((Y_grid - mpred$Ypred)^2, na.rm = TRUE))
cat("RMSPE =",rmspe,"\n")

## RMSPE = 0.01280794

3. Animage recovering example with missing pixels

Next, we consider smoothing and retrieving of an image, the standard test image Lena Sjööblom, as an example. The full noise free image is a \(155 \times 94\) pixel picture over the domain \([0, 1] \times [0, 1]\).

lena_full <- readRDS("lena.full.rds")
image(lena_full, col = gray.colors(256))

Among the total 14,570 pixels on the image, we randomly selected \(n = 7000\) (almost \(50 \%\)) pixels and substituted them with NA values, depicting them as missing pixels of a distorted image.

img_vec <- c(lena_full)
mis_pos <- sample(1:length(img_vec), size = 7000)
img_vec[mis_pos] <- NA
img_mis_pos.mtx <- matrix(img_vec, nrow = nrow(lena_full))
image(img_mis_pos.mtx, col = gray.colors(256))
title("Lena Image with 7000 Missing pixels at random")

To retrieve the original image from the given distorted one, we consider a three-step method as follows:

First, we triangulate the domain appropriately (according to some suitable choice for parameter \(n\) in TriMesh). The choice of \(n\) was made after considering several constraints like computation time and accuracy.
Then, we conduct a BPST smoothing on the triangulated domain with the available observations/pixels (\(14570 - 7000 = 7570\) pixels). For this example, we will use bivariate splines with degree \(d=5\). With a choice of a higher degree, we may get more accurate approximations, but at the expense of computation time.
Once the BPST model is fitted, we can use the model to predict the grayscale value of the rest of the pixels (i.e., over the entire grid of 14,570 pixels) as in the original image.

# Boundary & Grid points
bb <- rbind(c(0, 0), c(1, 0), c(1, 1), c(0, 1), c(0, 0))

nx <- dim(lena_full)[1]
ny <- dim(lena_full)[2]

xm <- seq(min(bb[, 1]), max(bb[, 1]), length = nx)
yn <- seq(min(bb[, 2]), max(bb[, 2]), length = ny)
xy_grid <- expand.grid(xm, yn)
xx <- xy_grid$Var1
yy <- xy_grid$Var2
Z <- cbind(xx, yy)

# Triangulation
VT <- TriMesh(bb, n = 7)

V <- VT$V; Tr <- VT$Tr

# Fitted model
mfit <- fit.BPST(Y = img_vec[-c(mis_pos)], Z = Z[-c(mis_pos), ],
                V = VT$V, Tr = VT$Tr, d = 5, r = 1)

# Prediction over the entire pixels
mpred <- predict.BPST(mfit, Zpred = Z)
Yhat <- mpred$Ypred

# Predicted image over the entire pixels
Yhat_mtx <- matrix(Yhat, nrow = nrow(lena_full))
image(xm, yn, Yhat_mtx, col = gray.colors(256), xlab = "", ylab = "")
lines(bb, col = 4)
ti <- paste()
title("Lena image recovered by BPST with degree = 5 and n = 7")

4. Example of estimating a bivariate function using BPST with 3D visualization

Let us first generate the data from a rectangular domain \([0, 1] \times [0, 1]\) using the regression model \(Y = \mu(x_1, x_2) + \epsilon\).

We consider two different signal function for the regression :

Linear: \(\mu_1(x_1, x_2) = 10x_1 + x_2 + 19\);
Sinusoidal: \(\mu_2(x_1, x_2) = 24 + 5\sin\{\pi(x_1^2 + x_2^2)\}\).

The error (\(\epsilon\)) is generated from a standard normal distribution.

# Linear signal function
mu1 <- function(x){
  return(10*x[1] + x[2] + 19)
}

# Sinusoidal signal function
mu2 <- function(x){
  return(24 + 5*sin(pi*(x[1]^2 + x[2]^2)))
}

Then, we create a \(101 \times 101\)-point grid with values evenly spaced between 0 and 1. We obtained the true signal and noisy observation for each coordinate pair \((x_1, x_2)\) lying on the grid in the unit square.

# Boundary & Grid points
bb <- rbind(c(0, 0), c(1, 0), c(1, 1), c(0, 1), c(0, 0))
plot(bb, type = "l")

nx <- 101
ny <- 101

xm <- seq(min(bb[, 1]), max(bb[, 1]), length = nx)
yn <- seq(min(bb[, 2]), max(bb[, 2]), length = ny)
xy_grid <- expand.grid(xm, yn)
xx <- xy_grid$Var1
yy <- xy_grid$Var2
Z <- cbind(xx, yy)

# True signal
tru_sig1 <- apply(Z, 1, mu1)
tru_sig2 <- apply(Z, 1, mu2)

# signal + noise
nois_resp1 <- tru_sig1 + rnorm(length(tru_sig1))
nois_resp2 <- tru_sig2 + rnorm(length(tru_sig2))

Next, we take a random sample of size \(n = 2000\) from the grid of \(101 \times 101\) points.

# sample of obs val
samp <- sample(1:length(nois_resp1), size = 2000)
Y1 <- nois_resp1[samp]
Y2 <- nois_resp2[samp]

Now we triangulate the domain using TriMesh()

VT <- TriMesh(bb, n = 3)

V <- VT$V
Tr <- VT$Tr
cat("The number of vertices in this triangulation is ", nrow(V),
     " and the number of triangles is", nrow(Tr), "\n")

## The number of vertices in this triangulation is  16  and the number of triangles is 18

Applying BPST based on this triangulation, we fit our model (BPST with degree \(d=3\)) with \(n = 2000\) observations, and then, with the help of the fitted model, we predict the value of response \(Y\) over the entire grid on the domain.

library(plotly)
# Model fit
deg <- 3

mfit1 <- fit.BPST(Y = Y1, Z = Z[samp, ], V = V, Tr = Tr, d = deg)
mfit2 <- fit.BPST(Y = Y2, Z = Z[samp, ], V = V, Tr = Tr, d = deg)

# Predictions
pred1 <- predict.BPST(mfit1, Zpred = Z)
pred2 <- predict.BPST(mfit2, Zpred = Z)

# predicted surface through BPST along with observed sample points
z1 <- matrix(pred1$Ypred, nrow = 101, byrow = T)
z2 <- matrix(pred2$Ypred, nrow = 101, byrow = T)

df1 <- data.frame(Z[samp, 1], Z[samp, 2], Y1)

plotly_figure1 <- plot_ly(
  x = unique(Z[, 1]),
  y = unique(Z[, 2]),
  z = z1,
  type = "surface"
) %>%
  add_trace(data = df1, x = df1[ , 1], y = df1[ , 2], z = df1[ , 3],
            mode = "markers", type = "scatter3d",
            marker = list(size = 1, color = "black"))%>%
  layout(scene = list(camera = list(
    eye = list(
      x = -1.25,
      y = -1.25,
      z = 1.25
    ),
    center = list(x = 0,
                  y = 0,
                  z = 0)
  )))

subplot(plotly_figure1)

df2 <- data.frame(Z[samp, 1], Z[samp, 2], Y2)

plotly_figure2 <- plot_ly(
  x = unique(Z[, 1]),
  y = unique(Z[, 2]),
  z = z2,
  type = "surface"
) %>%
  add_trace(data = df2, x = df2[ , 1], y = df2[ , 2], z = df2[ , 3],
            mode = "markers", type = "scatter3d",
            marker = list(size = 1, color = "black"))%>%
  layout(scene = list(camera = list(
    eye = list(
      x = -1.25,
      y = -1.25,
      z = 1.25
    ),
    center = list(x = 0,
                  y = 0,
                  z = 0)
  )))
subplot(plotly_figure2)

Few notes to consider

d: the degree of the polynomial fitted in the BPST method has to be chosen wisely based on the time constraint. Computation time increases with increase in d.
The parameter n in the TriMesh() function should be chosen in such a way that
- The triangles that are generated are more or less uniform in size;
- The angles of the triangles are not too acute.

References and further readings

A Brief Guide To The BPST Package in R