##### Document Text Contents

Page 2

P1: OTE/OTE/SPH P2: OTE

JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

A Practical Guide to

Scientific Data Analysis

David Livingstone

ChemQuest, Sandown, Isle of Wight, UK

A John Wiley and Sons, Ltd., Publication

Page 180

P1: OTA/XYZ P2: ABC

JWBK419-06 JWBK419/Livingstone September 26, 2009 15:12 Printer Name: Yet to Come

MULTIPLE LINEAR REGRESSION 161

Figure 6.6 Snapshot of the output from the statistics package Systat showing the

first step in a forward inclusion regression analysis.

since the model was chosen from a large pool of variables and thus may

suffer from ‘selection bias’ as discussed in Section 6.4.4. Finally, this

particular program also reports some problems with individual cases

showing that case 8 is an outlier and that case 14 has large leverage.

Discussion of these problems is outside the scope of this chapter but can

be found in references [3, 4 & 5] and in the help file of most statistics

programs.

6.3.1.2 Backward Elimination

This procedure begins by construction of a single linear regression model

which contains all of the independent variables and then removes them

one at a time. Each term in the equation is examined for its contribu-

tion to the model, by comparison of F-to-remove, Fremove, for example.

Page 181

P1: OTA/XYZ P2: ABC

JWBK419-06 JWBK419/Livingstone September 26, 2009 15:12 Printer Name: Yet to Come

162 REGRESSION ANALYSIS

Figure 6.7 Snapshot of the last stage in a forwards inclusion analysis.

F-to-remove is defined by an equation similar to (6.18):

Fremove =

RSSp−1 − RSSp

RSSp

/

n − p − 1

, (6.19)

where the notation is the same as that used in Equation (6.18). The

variable making the smallest contribution is removed (lowest Fremove)

and the regression model is recalculated, now with one term fewer. Any

of the usual regression statistics can be used to assess the fit of this new

model to the data and the procedure can be continued until a satisfactory

multiple regression equation is obtained. Satisfactory here may mean an

equation with a desired correlation coefficient or a particular number of

independent variables, etc.

Backward elimination and forward inclusion might be viewed as

means of producing the same result from opposite directions. However,

what may be surprising is that application of the two procedures to the

same data set does not necessarily yield the same answer. Newcomers to

data analysis may find this disturbing and for some this may reinforce the

prejudice that ‘statistics will give you any answer that you want’, which

of course it can. The explanation of the fact that forward inclusion and

backward elimination can lead to different models lies in the presence of

collinearity and multicollinearity in the data. A multiple regression equa-

tion may be viewed as a set of variables which between them account

Page 359

P1: OTA/XYZ P2: ABC

JWBK419-IND JWBK419/Livingstone September 28, 2009 15:46 Printer Name: Yet to Come

340 INDEX

replicates/replication 28

residual mean square (MSR)

155–6

residual sum of squares (RSS) 150, 160,

208

response data 223

responses 28

retrons 268

retrosynthesis 268

Reversible Nonlinear Dimension

Reduction (ReNDeR) network

278–80, 278

plot 279–80, 279

robustness 174–7

Root Mean Squared Error of Prediction

(RMSEP) 240, 287

plot 242

rosiglitazone 294, 294

rotation 81

nonorthogonal (oblique) 92

orthogonal 92

rule induction 297

Saccharomyces cerevisiae 224

scales of measurement 8–10

BC( DEF) 47

interval 9

nominal 8–9

ordinal 9

ration 9–10

significance of 10

Z descriptor 47, 48

scaling 60–2

SciFit package 104, 111

scree plots 133–4, 134, 208, 242,

287

selection bias 161, 180–3

self-organising map (SOM) 105–10,

106, 277–8,

287

sensitivity 200

analysis 292–3

set selection 25–55

significance 14

SIMCA 124, 191, 195–8, 196

compared to k-nearest-neighbour

technique (KNN) 197

steps of 196

similarity diagram 43

simple linear regression 146–54

assumptions for 149

SImple Modelling of Class Analogy see

SIMCA

Simplified Molecular Input Line Entry

System (SMILES) 256–60, 264,

271, 273

skewness 14, 18, 59

Soft Independent Modelling of Class

Analogy see SIMCA

Solenopsis

invicta 191

richteri 191

specificity 200

spectral map 78

analysis (SMA) 233–8

spread 14

squared multiple correlation coefficient

(r2) 151, 156

standard deviation (s) 12, 58–9

and autoscaling 61

standard error of prediction 158

standard error (SE) 157–8

standard scores 61

star plot 113, 114

Statistical Isolinear MultiCategory

Analysis see SIMCA

statistics 12

multivariate 19

univariate 19

stepwise regression 163–4

structure–activity relationships (SAR)

310–13

substituent properties 314

electronic effect (� ) 315–16

sulphonamides 221, 222

sulphones 221, 222

supermolecule 261, 261

supervised learning 21–2, 187–218

symbol Z 61

SYNLIB database 271

Systat package 112, 161

t statistic 157–8, 157, 179, 205

Tabu search (TS) 164

tabulation, of data sets 4

Taft equation 296

techniques

nonparametric 10

parametric 10

test set 22, 57, 289–90, 291

thiopurine methyltransferase 163

thioxanthene 169

THOR database 256, 259, 260

tiotidine 256, 260, 260

Topliss tree 51, 51

total squared distance (TSD) 141

Page 360

P1: OTA/XYZ P2: ABC

JWBK419-IND JWBK419/Livingstone September 28, 2009 15:46 Printer Name: Yet to Come

INDEX 341

total sum of squares (TSS) 150

toxicity 210–11, 301

and mixtures 327

prediction of 261–8, 297

workflow system 265, 266

Toxicity Prediction by Komputer

Assisted Technology (TOPKAT)

program 263–4, 265, 266–7

toxicophores 262

trained networks, rule extraction from

294

training, decision to stop 289

training algorithms, selection of 288–9

training iterations 290

training of networks 277

training set 22, 26–7, 289–90, 291

benzoic acids 87, 88

classification of 122

olive oils 284

strategies for selection 40

transfer functions 276

choice of 288–9

translation 81

treatments 28

TREPAN 294–5, 295

trial and error 26

trifluoroacetic acid 315

trifluperazine 230

trinitrobenzene 204

UDRIVE program 256, 259, 260,

260

Ultra-High Throughput Screening

(Ultra-HTS) 53

ultraviolet spectra 251

unexplained sum of squares see residual

sum of squares

univariate statistics 19

Unsupervised Forward Selection (UFS)

69–70

unsupervised learning 21–2,

119–44

validation set 289–90, 291

results 301

values, missing 65

variables

continuous 8, 11

continuous response 284

dependent 3, 159, 214

discrete 8, 11

independent 3–4, 159, 214

indicator 169–74

latent (LV) 206–7

qualitative 8

quantitative 8

selection of 67–72, 180, 215–16

variance 241

residual 126

of sample (s2) 12–13, 58

and autoscaling 61

shared 62–3, 63–4

of variable (V) 38–9, 49

variance-weighting 99, 100

varimax rotation 91–2, 92, 227, 239

vectors 106

and correlation coefficients 70–1, 71

genetic 302

water analysis 124, 136–7, 137, 197

weight vector (Wj ) 106

Wiswesser line notation (WLN) 260,

301

xanthene 169

XLS-Biplot program 234

Y scrambling 177–8, 179

Z scores 61

P1: OTE/OTE/SPH P2: OTE

JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

A Practical Guide to

Scientific Data Analysis

David Livingstone

ChemQuest, Sandown, Isle of Wight, UK

A John Wiley and Sons, Ltd., Publication

Page 180

P1: OTA/XYZ P2: ABC

JWBK419-06 JWBK419/Livingstone September 26, 2009 15:12 Printer Name: Yet to Come

MULTIPLE LINEAR REGRESSION 161

Figure 6.6 Snapshot of the output from the statistics package Systat showing the

first step in a forward inclusion regression analysis.

since the model was chosen from a large pool of variables and thus may

suffer from ‘selection bias’ as discussed in Section 6.4.4. Finally, this

particular program also reports some problems with individual cases

showing that case 8 is an outlier and that case 14 has large leverage.

Discussion of these problems is outside the scope of this chapter but can

be found in references [3, 4 & 5] and in the help file of most statistics

programs.

6.3.1.2 Backward Elimination

This procedure begins by construction of a single linear regression model

which contains all of the independent variables and then removes them

one at a time. Each term in the equation is examined for its contribu-

tion to the model, by comparison of F-to-remove, Fremove, for example.

Page 181

P1: OTA/XYZ P2: ABC

JWBK419-06 JWBK419/Livingstone September 26, 2009 15:12 Printer Name: Yet to Come

162 REGRESSION ANALYSIS

Figure 6.7 Snapshot of the last stage in a forwards inclusion analysis.

F-to-remove is defined by an equation similar to (6.18):

Fremove =

RSSp−1 − RSSp

RSSp

/

n − p − 1

, (6.19)

where the notation is the same as that used in Equation (6.18). The

variable making the smallest contribution is removed (lowest Fremove)

and the regression model is recalculated, now with one term fewer. Any

of the usual regression statistics can be used to assess the fit of this new

model to the data and the procedure can be continued until a satisfactory

multiple regression equation is obtained. Satisfactory here may mean an

equation with a desired correlation coefficient or a particular number of

independent variables, etc.

Backward elimination and forward inclusion might be viewed as

means of producing the same result from opposite directions. However,

what may be surprising is that application of the two procedures to the

same data set does not necessarily yield the same answer. Newcomers to

data analysis may find this disturbing and for some this may reinforce the

prejudice that ‘statistics will give you any answer that you want’, which

of course it can. The explanation of the fact that forward inclusion and

backward elimination can lead to different models lies in the presence of

collinearity and multicollinearity in the data. A multiple regression equa-

tion may be viewed as a set of variables which between them account

Page 359

P1: OTA/XYZ P2: ABC

JWBK419-IND JWBK419/Livingstone September 28, 2009 15:46 Printer Name: Yet to Come

340 INDEX

replicates/replication 28

residual mean square (MSR)

155–6

residual sum of squares (RSS) 150, 160,

208

response data 223

responses 28

retrons 268

retrosynthesis 268

Reversible Nonlinear Dimension

Reduction (ReNDeR) network

278–80, 278

plot 279–80, 279

robustness 174–7

Root Mean Squared Error of Prediction

(RMSEP) 240, 287

plot 242

rosiglitazone 294, 294

rotation 81

nonorthogonal (oblique) 92

orthogonal 92

rule induction 297

Saccharomyces cerevisiae 224

scales of measurement 8–10

BC( DEF) 47

interval 9

nominal 8–9

ordinal 9

ration 9–10

significance of 10

Z descriptor 47, 48

scaling 60–2

SciFit package 104, 111

scree plots 133–4, 134, 208, 242,

287

selection bias 161, 180–3

self-organising map (SOM) 105–10,

106, 277–8,

287

sensitivity 200

analysis 292–3

set selection 25–55

significance 14

SIMCA 124, 191, 195–8, 196

compared to k-nearest-neighbour

technique (KNN) 197

steps of 196

similarity diagram 43

simple linear regression 146–54

assumptions for 149

SImple Modelling of Class Analogy see

SIMCA

Simplified Molecular Input Line Entry

System (SMILES) 256–60, 264,

271, 273

skewness 14, 18, 59

Soft Independent Modelling of Class

Analogy see SIMCA

Solenopsis

invicta 191

richteri 191

specificity 200

spectral map 78

analysis (SMA) 233–8

spread 14

squared multiple correlation coefficient

(r2) 151, 156

standard deviation (s) 12, 58–9

and autoscaling 61

standard error of prediction 158

standard error (SE) 157–8

standard scores 61

star plot 113, 114

Statistical Isolinear MultiCategory

Analysis see SIMCA

statistics 12

multivariate 19

univariate 19

stepwise regression 163–4

structure–activity relationships (SAR)

310–13

substituent properties 314

electronic effect (� ) 315–16

sulphonamides 221, 222

sulphones 221, 222

supermolecule 261, 261

supervised learning 21–2, 187–218

symbol Z 61

SYNLIB database 271

Systat package 112, 161

t statistic 157–8, 157, 179, 205

Tabu search (TS) 164

tabulation, of data sets 4

Taft equation 296

techniques

nonparametric 10

parametric 10

test set 22, 57, 289–90, 291

thiopurine methyltransferase 163

thioxanthene 169

THOR database 256, 259, 260

tiotidine 256, 260, 260

Topliss tree 51, 51

total squared distance (TSD) 141

Page 360

P1: OTA/XYZ P2: ABC

JWBK419-IND JWBK419/Livingstone September 28, 2009 15:46 Printer Name: Yet to Come

INDEX 341

total sum of squares (TSS) 150

toxicity 210–11, 301

and mixtures 327

prediction of 261–8, 297

workflow system 265, 266

Toxicity Prediction by Komputer

Assisted Technology (TOPKAT)

program 263–4, 265, 266–7

toxicophores 262

trained networks, rule extraction from

294

training, decision to stop 289

training algorithms, selection of 288–9

training iterations 290

training of networks 277

training set 22, 26–7, 289–90, 291

benzoic acids 87, 88

classification of 122

olive oils 284

strategies for selection 40

transfer functions 276

choice of 288–9

translation 81

treatments 28

TREPAN 294–5, 295

trial and error 26

trifluoroacetic acid 315

trifluperazine 230

trinitrobenzene 204

UDRIVE program 256, 259, 260,

260

Ultra-High Throughput Screening

(Ultra-HTS) 53

ultraviolet spectra 251

unexplained sum of squares see residual

sum of squares

univariate statistics 19

Unsupervised Forward Selection (UFS)

69–70

unsupervised learning 21–2,

119–44

validation set 289–90, 291

results 301

values, missing 65

variables

continuous 8, 11

continuous response 284

dependent 3, 159, 214

discrete 8, 11

independent 3–4, 159, 214

indicator 169–74

latent (LV) 206–7

qualitative 8

quantitative 8

selection of 67–72, 180, 215–16

variance 241

residual 126

of sample (s2) 12–13, 58

and autoscaling 61

shared 62–3, 63–4

of variable (V) 38–9, 49

variance-weighting 99, 100

varimax rotation 91–2, 92, 227, 239

vectors 106

and correlation coefficients 70–1, 71

genetic 302

water analysis 124, 136–7, 137, 197

weight vector (Wj ) 106

Wiswesser line notation (WLN) 260,

301

xanthene 169

XLS-Biplot program 234

Y scrambling 177–8, 179

Z scores 61