No. of Recommendations: 0

TimberFool,
Just catching up on the boards and saw your work. Makes for some
great reading! Congratulations and thanks for your work!
Thought I'd toss in my $0.02 into the conversation:
To summarize before I start this long winded monologue, I'll be
commenting on some of the "datamining" problems, offering a solution,
and I'd also like to offer to do the number crunching (or help if I
can).
Zgriner and Incog (msg #11272) suggested different ways of subsampling
the data to avoid the effects of "datamining." This caught my
attention, because I remember this same problem being discussed in a
time-series analysis class I took while working on my MS in statistics
a few years back **<- shameless attempt to gain credibility 8*0**.
To restate the problem, we wish to find the mean and standard
deviation of the annual yield. The problem, is that the estimates
that have been used all assume that the data points are *independent*
samples from the population. Since we are using data from consecutive
years, we are obviously violating this assumption. While many
statistical methods are robust against certain assumptions, that is
*definitly* not the case here.
One way to get around this difficulty, since we have 32 years of data
('66 - '97) is to use only the 1st, 6th, 11th, 16th, ..., and 31st
data points. That would give us 8 samples to play with. Use those 8
samples to compute the mean and std.dev. Sure, this is clearly not a
"random" sample from the population, but it is a far more
statistically independent sample, which is a much more serious issue
when we are concerned with avoiding bias in our mean and variance
estimates.
I can see people getting concerned over estimating the mean from only
8 samples. My response there would be go ahead and estimate the
variance of the estimate as well. What we are doing here is this:
m = m~ + em
s = s~ + es
m = the true mean
m~ = our estimate of 'm' (a random variable)
em = a random error term (a random variable)
s = the true std. dev. of 'em'
s~ = our estimate of 's' (a random variable)
es = a random error term (a random variable)
So, what I am saying is to get a handle on if we've done something
unacceptable by reducing our data set to 8 samples, look at the
std. dev. of our estimate of the mean, m~. (I'll spare you the
details of how to do that). If this variance is small enough, then
we've nothing to worry about. If it is too large, maybe we'll want to
pick every 4th point in the set instead of every 5th (and get more and
more nervous as the samples get less and less independent).
BTW, might as well compute the std. dev. of s~ while we're at it.
Now then, having suggested all this number crunching, (TimberFool, are
you listening? 8*) I'll be glad to do it myself if I could have
access to the data set. TimberFool, is that allowed? If the data set
is propriatary, I'll understand. Otherwise, I'd be happy to dirty
my fingers in some raw data!
What do you say?
Nordman <ele@khoral.com>