Title Livingstone, Data Analysis Variance Standard Deviation Mean Level Of Measurement Normal Distribution 3.4 MB 360
```                            A Practical Guide to
Scientific Data Analysis
Contents
Preface
Abbreviations
1 Introduction: Data and Its Properties, Analytical Methods and Jargon
1.1 Introduction
1.2 Types of Data
1.3 Sources of Data
1.3.1 Dependent Data
1.3.2 Independent Data
1.4 The Nature of Data
1.4.1 Types of Data and Scales of Measurement
1.4.2 Data Distribution
1.4.3 Deviations in Distribution
1.5 Analytical Methods
1.6 Summary
References
2 Experimental Design – Experiment and Set Selection
2.1 What is Experimental Design?
2.2 Experimental Design Techniques
2.2.1 Single-factor Design Methods
2.2.2 Factorial Design (Multiple-factor Design)
2.2.3 D-optimal Design
2.3 Strategies for Compound Selection
2.4 High Throughput Experiments
2.5 Summary
References
3 Data Pre-treatment and Variable Selection
3.1 Introduction
3.2 Data Distribution
3.3 Scaling
3.4 Correlations
3.5 Data Reduction
3.6 Variable Selection
3.7 Summary
References
4 Data Display
4.1 Introduction
4.2 Linear Methods
4.3 Nonlinear Methods
4.3.1 Nonlinear Mapping
4.3.2 Self-organizing Map
4.4 Faces, Flowerplots and Friends
4.5 Summary
References
5 Unsupervised Learning
5.1 Introduction
5.2 Nearest-neighbour Methods
5.3 Factor Analysis
5.4 Cluster Analysis
5.5 Cluster Significance Analysis
5.6 Summary
References
6 Regression Analysis
6.1 Introduction
6.2 Simple Linear Regression
6.3 Multiple Linear Regression
6.3.1 Creating Multiple Regression Models
6.3.1.1 Forward Inclusion
6.3.1.2 Backward Elimination
6.3.1.3 Stepwise Regression
6.3.1.4 All Subsets
6.3.1.5 Model Selection by Genetic Algorithm
6.3.2 Nonlinear Regression Models
6.3.3 Regression with Indicator Variables
6.4 Multiple Regression: Robustness, Chance Effects, the Comparison of Models and Selection Bias
6.4.1 Robustness (Cross-validation)
6.4.2 Chance Effects
6.4.3 Comparison of Regression Models
6.4.4 Selection Bias
6.5 Summary
References
7 Supervised Learning
7.1 Introduction
7.2 Discriminant Techniques
7.2.1 Discriminant Analysis
7.2.2 SIMCA
7.2.3 Confusion Matrices
7.2.4 Conditions and Cautions for Discriminant Analysis
7.3 Regression on Principal Components and PLS
7.3.1 Regression on Principal Components
7.3.2 Partial Least Squares
7.3.3 Continuum Regression
7.4 Feature Selection
7.5 Summary
References
8 Multivariate Dependent Data
8.1 Introduction
8.2 Principal Components and Factor Analysis
8.3 Cluster Analysis
8.4 Spectral Map Analysis
8.5 Models with Multivariate Dependent and Independent Data
8.6 Summary
References
9 Artificial Intelligence and Friends
9.1 Introduction
9.2 Expert Systems
9.2.1 Log P Prediction
9.2.2 Toxicity Prediction
9.2.3 Reaction and Structure Prediction
9.3 Neural Networks
9.3.1 Data Display Using ANN
9.3.2 Data Analysis Using ANN
9.3.3 Building ANN Models
9.3.4 Interrogating ANN Models
9.4 Miscellaneous AI Techniques
9.5 Genetic Methods
9.6 Consensus Models
9.7 Summary
References
10 Molecular Design
10.1 The Need for Molecular Design
10.2 What is QSAR/QSPR?
10.3 Why Look for Quantitative Relationships?
10.4 Modelling Chemistry
10.5 Molecular Field and Surface Descriptors
10.6 Mixtures
10.7 Summary
References
Index
```
##### Document Text Contents
Page 2

P1: OTE/OTE/SPH P2: OTE
JWBK419-FM JWBK419/Livingstone September 25, 2009 13:8 Printer Name: Yet to Come

A Practical Guide to
Scientific Data Analysis

David Livingstone
ChemQuest, Sandown, Isle of Wight, UK

A John Wiley and Sons, Ltd., Publication

Page 180

P1: OTA/XYZ P2: ABC
JWBK419-06 JWBK419/Livingstone September 26, 2009 15:12 Printer Name: Yet to Come

MULTIPLE LINEAR REGRESSION 161

Figure 6.6 Snapshot of the output from the statistics package Systat showing the
first step in a forward inclusion regression analysis.

since the model was chosen from a large pool of variables and thus may
suffer from ‘selection bias’ as discussed in Section 6.4.4. Finally, this
particular program also reports some problems with individual cases
showing that case 8 is an outlier and that case 14 has large leverage.
Discussion of these problems is outside the scope of this chapter but can
be found in references [3, 4 & 5] and in the help file of most statistics
programs.

6.3.1.2 Backward Elimination

This procedure begins by construction of a single linear regression model
which contains all of the independent variables and then removes them
one at a time. Each term in the equation is examined for its contribu-
tion to the model, by comparison of F-to-remove, Fremove, for example.

Page 181

P1: OTA/XYZ P2: ABC
JWBK419-06 JWBK419/Livingstone September 26, 2009 15:12 Printer Name: Yet to Come

162 REGRESSION ANALYSIS

Figure 6.7 Snapshot of the last stage in a forwards inclusion analysis.

F-to-remove is defined by an equation similar to (6.18):

Fremove =

/
n − p − 1

, (6.19)

where the notation is the same as that used in Equation (6.18). The
variable making the smallest contribution is removed (lowest Fremove)
and the regression model is recalculated, now with one term fewer. Any
of the usual regression statistics can be used to assess the fit of this new
model to the data and the procedure can be continued until a satisfactory
multiple regression equation is obtained. Satisfactory here may mean an
equation with a desired correlation coefficient or a particular number of
independent variables, etc.

Backward elimination and forward inclusion might be viewed as
means of producing the same result from opposite directions. However,
what may be surprising is that application of the two procedures to the
same data set does not necessarily yield the same answer. Newcomers to
data analysis may find this disturbing and for some this may reinforce the
prejudice that ‘statistics will give you any answer that you want’, which
of course it can. The explanation of the fact that forward inclusion and
backward elimination can lead to different models lies in the presence of
collinearity and multicollinearity in the data. A multiple regression equa-
tion may be viewed as a set of variables which between them account

Page 359

P1: OTA/XYZ P2: ABC
JWBK419-IND JWBK419/Livingstone September 28, 2009 15:46 Printer Name: Yet to Come

340 INDEX

replicates/replication 28
residual mean square (MSR)

155–6
residual sum of squares (RSS) 150, 160,

208
response data 223
responses 28
retrons 268
retrosynthesis 268
Reversible Nonlinear Dimension

Reduction (ReNDeR) network
278–80, 278

plot 279–80, 279
robustness 174–7
Root Mean Squared Error of Prediction

(RMSEP) 240, 287
plot 242

rosiglitazone 294, 294
rotation 81

nonorthogonal (oblique) 92
orthogonal 92

rule induction 297

Saccharomyces cerevisiae 224
scales of measurement 8–10

BC( DEF) 47
interval 9
nominal 8–9
ordinal 9
ration 9–10
significance of 10
Z descriptor 47, 48

scaling 60–2
SciFit package 104, 111
scree plots 133–4, 134, 208, 242,

287
selection bias 161, 180–3
self-organising map (SOM) 105–10,

106, 277–8,
287

sensitivity 200
analysis 292–3

set selection 25–55
significance 14
SIMCA 124, 191, 195–8, 196

compared to k-nearest-neighbour
technique (KNN) 197

steps of 196
similarity diagram 43
simple linear regression 146–54

assumptions for 149
SImple Modelling of Class Analogy see

SIMCA

Simplified Molecular Input Line Entry
System (SMILES) 256–60, 264,
271, 273

skewness 14, 18, 59
Soft Independent Modelling of Class

Analogy see SIMCA
Solenopsis

invicta 191
richteri 191

specificity 200
spectral map 78

analysis (SMA) 233–8
squared multiple correlation coefficient

(r2) 151, 156
standard deviation (s) 12, 58–9

and autoscaling 61
standard error of prediction 158
standard error (SE) 157–8
standard scores 61
star plot 113, 114
Statistical Isolinear MultiCategory

Analysis see SIMCA
statistics 12

multivariate 19
univariate 19

stepwise regression 163–4
structure–activity relationships (SAR)

310–13
substituent properties 314

electronic effect (� ) 315–16
sulphonamides 221, 222
sulphones 221, 222
supermolecule 261, 261
supervised learning 21–2, 187–218
symbol Z 61
SYNLIB database 271
Systat package 112, 161

t statistic 157–8, 157, 179, 205
Tabu search (TS) 164
tabulation, of data sets 4
Taft equation 296
techniques

nonparametric 10
parametric 10

test set 22, 57, 289–90, 291
thiopurine methyltransferase 163
thioxanthene 169
THOR database 256, 259, 260
tiotidine 256, 260, 260
Topliss tree 51, 51
total squared distance (TSD) 141

Page 360

P1: OTA/XYZ P2: ABC
JWBK419-IND JWBK419/Livingstone September 28, 2009 15:46 Printer Name: Yet to Come

INDEX 341

total sum of squares (TSS) 150
toxicity 210–11, 301

and mixtures 327
prediction of 261–8, 297
workflow system 265, 266

Toxicity Prediction by Komputer
Assisted Technology (TOPKAT)
program 263–4, 265, 266–7

toxicophores 262
trained networks, rule extraction from

294
training, decision to stop 289
training algorithms, selection of 288–9
training iterations 290
training of networks 277
training set 22, 26–7, 289–90, 291

benzoic acids 87, 88
classification of 122
olive oils 284
strategies for selection 40

transfer functions 276
choice of 288–9

translation 81
treatments 28
TREPAN 294–5, 295
trial and error 26
trifluoroacetic acid 315
trifluperazine 230
trinitrobenzene 204

UDRIVE program 256, 259, 260,
260

Ultra-High Throughput Screening
(Ultra-HTS) 53

ultraviolet spectra 251
unexplained sum of squares see residual

sum of squares
univariate statistics 19
Unsupervised Forward Selection (UFS)

69–70

unsupervised learning 21–2,
119–44

validation set 289–90, 291
results 301

values, missing 65
variables

continuous 8, 11
continuous response 284
dependent 3, 159, 214
discrete 8, 11
independent 3–4, 159, 214
indicator 169–74
latent (LV) 206–7
qualitative 8
quantitative 8
selection of 67–72, 180, 215–16

variance 241
residual 126
of sample (s2) 12–13, 58

and autoscaling 61
shared 62–3, 63–4
of variable (V) 38–9, 49

variance-weighting 99, 100
varimax rotation 91–2, 92, 227, 239
vectors 106

and correlation coefficients 70–1, 71
genetic 302

water analysis 124, 136–7, 137, 197
weight vector (Wj ) 106
Wiswesser line notation (WLN) 260,

301

xanthene 169
XLS-Biplot program 234

Y scrambling 177–8, 179

Z scores 61