p Value

Trust this as far as you can p?

No Gravatar

Back when I was involved with some 10 different research projects every week, I realized that one of the weakest points in my training was statistics.  I needed those skills to discern if what we were doing was actually an improvement or not.  As a techie, I was taught to believe (yes, that’s the verb) that statistical analysis would remove any bias I would bring to the analysis.


It’s several decades later and I now recognize that the reliance on such statistical analysis is one of the bigger causes of the problems that I see routinely in science, in economics (no, folks, I don’t buy that economics is a science- there’s too much bias and BS that is encompassed in what purports to be discovery there), and a slew of other disciplines.

There is a great need for statistical analysis- when it is used properly and without bias.  Our bicarbonate dialysate invention is a great example.  We developed the first liquid formation that used bicarbonate as the buffer.  We found a way to produce this product which had a long shelf life and didn’t decay.  (The bicarbonate ion has a tendency to form carbonates and bubble out of solution.)

When we were preparing our data for submission to the FDA to get our product approved, we began examining all the data collected at the University of California, Christ Hospital, and the Nassau County Medical Center- not just the efficacy of dialysis, but all the patient data.  And, given that our trials were being effected at a research institution, a well-regarded private hospital, and a public hospital that had a slew of indigent patients, we knew we had a great cross-section of patients and a wide range in degrees of care. We also were examining three different conditions- the use of (then) conventional dialysate [this used the non-physiological buffer of acetate], bicarbonate dialysate that was prepared by technicians and nurses right before treatment, and our prepared, off-the-shelf solution.

The obvious result was that our product met or exceeded all the dialysis parameters.  The patients had good removal of toxins, the patients had no untoward effects, and the use of bicarbonate (over acetate) proved to be the better treatment.

What we didn’t expect was the dramatic differences in the patients’ blood pressures.  Oh, sure we knew the acetate buffer, being non-physiologic, would not yield the best results.  But, our product provided a significant stability to the patients’ mean arterial blood pressures- significantly better than acetate and the powder formulation that was prepared fresh before each treatment.  Moreover, while not fully understood at the time, the data has demonstrated that lower mean arterial pressures (and the stability of those pressures) is associated with better long-term survival for the patient.   This research analysis involved my first use of two different statistical parameters.

But, I’ve ranted (in a scientific way 😊  )   about the public (and many technical folks) confusion between causation and correlation.  We (I am not immune) think that when we have statistical data that demonstrates the results we want, we are positive that we’ve developed the true answer.  But, given our new predilection for big data, we are finding way more correlations that have nothing to do with causation.

The fact that two (or more) conditions correlate well does not mean we understand the phenomenon- or why there is a correlation in the first place.   A great example is the conventional dumb blonde jokes.  Even if we find that more blondes make bone-headed decisions or just don’t get the situation does not mean that this failure is a result of their genetics, that they were born with blonde tresses.  (Of course, those that BLEACH their hair blonde may have a different reason 😊  – not.)

This is even more true in the financial sector.  So many stock brokers and professional investors allege they produce great results with your investment portfolio.  But, more often than not, a monkey randomly throwing darts at the stock pages could provide the same results.  (Trust me, this study has been done and it’s true.)

Many such folks use terms such as backtesting.  (Basically, this means testing an idea and see how it would fare over the past decade or two, using historical data.)  And, the proponents claim that they have 95% confidence that their results are perfectly correct.  No, they have excellent correlation- but have no inkling about the underlying facts or what is true.

p Value

And, it’s not just the financial types.  Many a techie is proud of his (or her) ability to data-mine- to overfit the data to an hypothesis they believe explains certain phenomena.   This is also called p-hacking.  Researchers (scientific, engineering, financial – any sort of analysis) select data and effect statistical analysis on the set, and keep refining their analysis, whereby they create “significant” results from non-significant data.  Often, this sort of problem occurs when meta-analysis is performed (in science).  For financial types, the conceptual term is “smart beta”.  The results are the same- non-significant data are deemed to be scientific because some sort of correlation can be discerned, after all these massages.  Absurd results can be manipulated to demonstrate low p values, while important findings can fail the test.

p Value definition

The p test- or null hypothesis significance testing-  attempts to separate valid results from background noise.  The concept assumes that there is no relationship between variables or no effect when we manipulate variables in scientific experiments.  Then, we compute the probability (or p) of finding that the hypothesis is wrong.  (This means the p value is below 0.05; above that number implies that the results are non-significant.)

Back when I was in grad school (where I was one of the VERY few to have a computer available to me in my lab), t and F analyses were calculated by hand.  Then, one searched mathematical reference data to discern the p value from statistical tables.  The goal was to find the test results where p was at least below 0.05 (at my grad school, we were pressed to use 0.01, a much more stringent restraint) to claim the finding was a valid one.

Now, computers provide the p values almost on the spot.  And, with large data fields, we have a big problem.  Because a tiny effect can lead to very low p values when the sample size is very large- just like we find with almost all of ‘big data’ nowadays.

Which means we probably are NOT finding those statistically significant reasons at all.  The real problem is the p value can provide relationship data- but not if it is causal or not.  It provides no indication of the scientific validity for that relationship.

Ready for another blonde joke?

Roy A. Ackerman, Ph.D., E.A.

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter