The data below are the intervals in hours between failures of the air conditioning equipment in 10 Boeing 707s. The aim was to describe concisely the variation within and between aircraft plane no 1 2 3 4 5 6 7 8 9 10 413 90 74 55 23 97 50 359 487 102 14 10 57 320 261 51 44 9 18 209 58 60 48 56 87 11 102 12 100 14 37 186 29 104 7 4 72 270 7 57 100 61 502 220 120 141 22 603 98 54 65 49 12 239 14 18 39 3 5 32 9 14 70 47 62 142 3 104 85 67 169 24 21 246 47 68 15 2 91 59 447 56 29 176 225 77 197 438 43 134 184 20 386 182 71 80 188 230 152 36 79 59 33 264 1 79 3 27 201 84 27 15 21 16 88 130 14 118 44 153 104 42 106 46 230 34 59 26 35 20 206 5 66 31 29 326 5 82 5 61 18 118 12 54 36 34 18 25 120 31 22 67 156 11 216 139 57 310 3 46 210 62 76 14 111 97 7 26 71 39 30 22 44 11 63 23 34 23 14 18 13 62 11 191 14 130 16 18 208 90 163 70 1 24 101 16 208 52 95 Minitab output gives N MEAN MEDIAN TRMEAN STDEV SEMEAN plane1 23 95.7 57.0 83.2 119.3 24.9 plane2 29 83.5 61.0 77.9 70.8 13.1 plane3 15 121.3 57.0 100.4 154.3 39.8 plane4 14 130.9 104.0 124.7 98.2 26.2 plane5 30 59.6 22.0 49.1 71.9 13.1 plane6 27 76.8 63.0 74.3 63.7 12.3 plane7 24 64.1 41.5 60.3 62.7 12.8 plane8 9 200.0 104.0 200.0 225.9 75.3 plane9 12 108.1 88.0 80.7 136.2 39.3 plane10 16 82.0 60.0 76.3 66.3 16.6 MIN MAX Q1 Q3 plane1 7.0 447.0 22.0 118.0 plane2 10.0 310.0 27.5 109.5 plane3 12.0 502.0 27.0 153.0 plane4 15.0 320.0 44.0 224.8 plane5 1.0 261.0 11.8 87.7 plane6 1.0 216.0 18.0 111.0 plane7 3.0 210.0 16.7 94.8 plane8 2.0 603.0 6.0 398.5 plane9 3.0 487.0 9.8 122.5 plane10 14.0 230.0 32.5 126 After some reflection it seems that plane number 8 is rather different. It may be worth trying to fit a probability distribution of the Gamma type, that is The _ values are for each plane Plane 1 2 3 4 5 6 7 8 9 10 _ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ Once again plane eight stands out with a low value . The question then is what distributions should one fit to the data (1) separate Gammas (2) common Gammas (3) Gammas with _=1 i.e. exponentials The exponential fit is quit good and gives a very simple and vivid summary of the data . In an average sense the fit is not bad but there are some unexplained variations. If we leave out the plane number 8 then a Chi-squared goodness of fit test would accept an exponential fit To test the distribution we tried a probability plot of the pooled data (less no 8) MTB > plot c12 c11 - 6.0+ - C12 - * - - * 4.0+ * * - ** - * * - 3*2 - 52 2.0+ **332 - 24332* - 66+ - 6++* - +++5 0.0+ 6+5 +---------+---------+---------+---------+---------+------C11 0 100 200 300 400 500 The plot looks fairly linear.! Example 2 Cycles to failure of worsted yarn x1: length of test specimen (250, 300, 350mm) x2: amplitude of loading cycle (8, 9 ,10 mm) x3: load (40, 45 ,50 grm) x1 x2 x3 obsns -1 -1 -1 674 -1 -1 0 370 -1 -1 1 292 -1 0 -1 338 -1 0 0 266 -1 0 1 210 -1 1 -1 170 -1 1 0 118 -1 1 1 90 0 -1 -1 1414 0 -1 0 1198 0 -1 1 634 0 0 -1 1022 0 0 0 620 0 0 1 438 0 1 -1 442 0 1 0 332 0 1 1 220 1 -1 -1 3636 1 -1 0 3184 1 -1 1 2000 1 0 -1 1568 1 0 0 1070 1 0 1 556 1 1 -1 1140 1 1 0 884 1 1 1 360 There are some argument for taking log observations here, and for taking logs of the explanatory variables x1, x2, x3 so transforming back to our original data scales ROW x1 x2 x3 cyc lcyc 1 250 8 40 674 6.51323 2 250 8 45 370 5.91350 3 250 8 50 292 5.67675 4 250 9 40 338 5.82305 5 250 9 45 266 5.58350 6 250 9 50 210 5.34711 7 250 10 40 170 5.13580 8 250 10 45 118 4.77068 9 250 10 50 90 4.49981 10 300 8 40 1414 7.25418 11 300 8 45 1198 7.08841 12 300 8 50 634 6.45205 13 300 9 40 1022 6.92952 14 300 9 45 620 6.42972 15 300 9 50 438 6.08222 16 300 10 40 442 6.09131 17 300 10 45 332 5.80513 18 300 10 50 220 5.39363 19 350 8 40 3636 8.19864 20 350 8 45 3184 8.06589 21 350 8 50 2000 7.60090 22 350 9 40 1568 7.35756 23 350 9 45 1070 6.97541 24 350 9 50 556 6.32077 25 350 10 40 1140 7.03878 26 350 10 45 884 6.78446 27 350 10 50 360 5.88610 we can take logs and try to fit the data. I have not plotted all the diagrams you would have made up to this point, we should ;leave some trees standing. Using the regression in minitab we have The regression equation is lcyc = 3.93 + 4.94 lx1 - 5.65 lx2 - 3.51 lx3 Predictor Coef Stdev t-ratio p Constant 3.930 2.254 1.74 0.095 lx1 4.9447 0.2581 19.16 0.000 lx2 -5.6540 0.3894 -14.52 0.000 lx3 -3.5117 0.3894 -9.02 0.000 s = 0.1844 R-sq = 96.6% R-sq(adj) = 96.2% Analysis of Variance SOURCE DF SS MS F p Regression 3 22.4217 7.4739 219.77 0.000 Error 23 0.7822 0.0340 Total 26 23.2039 SOURCE DF SEQ SS lx1 1 12.4852 lx2 1 7.1703 lx3 1 2.7661 Obs. lx1 lcyc Fit Stdev.Fit Residual St.Resid 1 5.52 6.5132 6.5205 0.0847 -0.0073 -0.04 2 5.52 5.9135 6.1069 0.0722 -0.1934 -1.14 3 5.52 5.6768 5.7369 0.0838 -0.0602 -0.37 4 5.52 5.8230 5.8546 0.0722 -0.0315 -0.19 5 5.52 5.5835 5.4410 0.0571 0.1425 0.81 6 5.52 5.3471 5.0710 0.0712 0.2761 1.62 7 5.52 5.1358 5.2589 0.0838 -0.1231 -0.75 8 5.52 4.7707 4.8453 0.0712 -0.0746 -0.44 9 5.52 4.4998 4.4753 0.0830 0.0245 0.15 10 5.70 7.2542 7.4220 0.0720 -0.1679 -0.99 11 5.70 7.0884 7.0084 0.0568 0.0800 0.46 12 5.70 6.4520 6.6384 0.0710 -0.1864 -1.10 13 5.70 6.9295 6.7561 0.0568 0.1734 0.99 14 5.70 6.4297 6.3425 0.0356 0.0872 0.48 15 5.70 6.0822 5.9725 0.0556 0.1097 0.62 16 5.70 6.0913 6.1604 0.0710 -0.0691 -0.41 17 5.70 5.8051 5.7468 0.0556 0.0584 0.33 18 5.70 5.3936 5.3768 0.0700 0.0168 0.10 19 5.86 8.1986 8.1843 0.0834 0.0144 0.09 20 5.86 8.0659 7.7706 0.0707 0.2952 1.73 21 5.86 7.6009 7.4007 0.0826 0.2003 1.21 22 5.86 7.3576 7.5183 0.0707 -0.1608 -0.94 23 5.86 6.9754 7.1047 0.0552 -0.1293 -0.73 24 5.86 6.3208 6.7347 0.0697 -0.4139 -2.42R 25 5.86 7.0388 6.9226 0.0826 0.1162 0.70 26 5.86 6.7845 6.5090 0.0697 0.2755 1.61 27 5.86 5.8861 6.1390 0.0817 -0.2529 -1.53 R denotes an obs. with a large st. resid. MTB > histo c19 Histogram of C19 N = 27 Midpoint Count -2.5 1 * -2.0 0 -1.5 1 * -1.0 4 **** -0.5 5 ***** 0.0 5 ***** 0.5 5 ***** 1.0 3 *** 1.5 3 *** MTB > plot c20 c5 - C20 - - * - * 7.5+ ** * - * - * 2 - * ** * - * * 6.0+ *2 * - * * - * * * - * - * 4.5+ * - - --------+---------+---------+---------+---------+--------lcyc 4.90 5.60 6.30 7.00 7.70 Thus we have cycle = x14.957x2-5.651x33.501 which can be written as a useful rule cycle = ()-5x3-7/2 Example 3 A brewer approached an AFRC research worker with a question about the types of yeast in his process. The AFRC unit did a DNA analysis on the yeasts the result of which was that they could classify the 30 yeast cells that they examined into 5 types. The results were Type Frequency A 14 B 8 C 5 D 2 E 1 The question that then arose was as follows: We have found 5 classes in 30 how many classes are there in the population? Natually we have probably not seen them all. To answer the question we make a few assumptions such as the classes are arranged in a sensible order. We can then try and find a distribution that fits the data. It is fairly easy to see that if we call A,B,C,D,E 1,..5 then S fx = 58 Sfx2 = 148 and so the mean estimate is 1.93333 while the estimate of the standard deviation is 1.19556. The distribution that might model the data is the Negative Binomial with mean and variance . This is not quite right because we cannot observe classes with zero numbers in them, so although the Negative Binomial may be a good fit what we see is conditional on x ­0 t hat is p[ X = x | x> 0] . Since p[ X > 0 ] = 1 - pr and p[ X = x | x> 0] =we can work out the distribution. Then all we have to do is estimate the parameters. A simple we is to equate the mean and variance of the new distribution with that of the sample values and solve the equations so = 1.93333 and = Solving gives p= 0.63 and r = 2.1. Now for the distribution with zero classes the mean number per class is or 1.1597 so if the number of classes is N we have 1.1597N=30 or N=25.9 giving the number of unobserved classes as 20. 1