I the following criterion to judge that some quantity measures roughness penalty well:

*Increasing training sample size should should influence the value of optimal roughness penalty level in a monotonic way.*

I compared performance of df or spar parametrization using a sample function (defined in gen.data function) for different sizes of training sample size. Here is the code for df parametrization:

set.seed

**(**1**)**gen.data

**<-****function****(**n**)****{** x

**<-**runif**(**n,**-**2, 2**)** y

**<-**x**^**2**/**2**+**sin**(**4*****x**)****+**rnorm**(**n**)** return

**(**data.frame**(**x, y**))****}**

df.levels

**<-**seq**(**5, 15, length.out**=**100**)**n.train

**<-****(**3**^****(**0**:**5**))*********(**2**^****(**6**:**1**))**cols

**<-**1reps

**<-**100valid

**<-**gen.data**(**100000**)**plot

**(****NULL**, xlab**=**"df", ylab**=**"mse", xlim

**=**c**(**5, 18**)**, ylim**=**c**(**1, 1.3**))****for**

**(**n

**in**n.train

**)**

**{**

mse

**<-**rep**(**0, length**(**df.levels**))****for**

**(**j

**in**1

**:**reps

**)**

**{**

train

**<-**gen.data**(**n**)****for**

**(**i

**in**seq

**(**along.with

**=**df.levels

**))**

**{**

ss

**<-**smooth.spline**(**train, df**=**df.levels**[**i**])** ss.y

**<-**predict**(**ss, valid**$**x**)$**y mse

**[**i**]****<-**mse**[**i**]****+**mean**((**ss.y**-**valid**$**y**)****^**2**)****}**

**}**

mse

**<-**mse**/**reps lines

**(**df.levels, mse, col**=**cols, lwd**=**2**)** points

**(**df.levels**[**which.min**(**mse**)]**, min**(**mse**)**, col

**=**cols, pch**=**19**)** text

**(**15, mse**[**length**(**mse**)]**, paste**(**"n =", n**)**, col

**=**cols, pos**=**4**)** cols

**<-**cols**+**1**}**

It produces the following result:

The plot shows the desired property. Similar plot can be obtained for spar parameter by simple modification of the code:

It is easy to notice that optimal values of spar do not change in a monotonic way as number of observation increases.

This comparison shows that df is a better measure of regularization level in comparison to spar.

Additionally one can notice that curves for different sizes of training sample

*intersect*for spar parametrization, which is unexpected. It might be only due to the randomness of data generation process, but I have run the simulation several times and the curves always intersected. Unfortunately I do not have the proof what should happen when valid data set size and reps parameter both tend to infinity.

## No comments:

## Post a Comment