What can go wrong in a statistical study? Many things, there’s even a book on this subject http://www.ucpress.edu/book.php?isbn=9780520274709

Your next assignment is to post an article or a video( could be anything: peer reviewed journal articles, newspaper articles, book examples, youtube videos, TED Talks…) that discusses not so correct ways of doing statistics or interpreting statistical results. Please include a short review of the “bad statistics” discussed in the post .

Please post your assignments as a reply to this message. Due date: April 16th 2015

http://www.ted.com/talks/peter_donnelly_shows_how_stats_fool_juries/transcript?language=en#t-899368

This is a very interesting TED talk that begins by talking broadly about statistics, and then delved in to a more specific example about the bad use of statistics in criminal trials. This discussion begins at approximately the 13:30 mark in the video. During a criminal trial of a woman on trial for the death of her two children, a statistician used the rule of independence incorrectly to say that the chance of two children dying from Sudden Infant Death Syndrome is 1:8500 X 1:8500= 1:73,000,000. Looking at those statistics, the jury interpreted that this woman must be guilty of murder. However, the statistics here were wrong, in that they didn’t account for environmental factors that would have greatly increased the chance of a second baby’s death after a first one, as this family would have been classified in a high risk group. The bad use of statistics was further exacerbated when a newspaper reported the day after the trial that the expert had said “the chance that Sally Clark was innocent was 73,000,000:1”. This same pediatrician utilized bad statistics in 2 other trials, to which the women were convicted, and later put on appeal. Since statistics is so inherent to aspects of life, it needs to be done by experts rather than being misrepresented by scientists who don’t fully understand it.

http://tedxtalks.ted.com/video/Statistical-errors-in-court-Ric

This TED Talk, called “Statistical Errors in Court”, is by Richard Gill, a statistician at Leiden University. He describes the case of Lucia de Berk, a nurse found guilty of several murders based largely on statistical probability of death. Gill is well known for showing the flawed statistical analysis used to make the conviction in the case, which ultimately led to Lucia’s sentence being revoked. The case against Lucia was built on a suspicious pattern: there were nine incidents of death in a medical ward where she worked, and Lucia was present during all of them. In the original conviction of Lucia, a statistical analysis played an explicit role. After the appeal, the verdict was maintained but the flawed statistical calculation based on probabilities was removed. However, the flawed statistical data remained and the “obvious” coincidence between incidents and Lucia’s presence remained a crucial step in the prosecution’s case as well as an influence on the evaluation by medical specialists of the medical evidence. This debate works to confirm the public belief that statistics do not do much credit to real-world situations. The major problem stems from post-hoc testing of hypotheses, which is debated widely, but most participants don’t realize that the data is being used not just to prove Lucia’s guilt but also simultaneously to prove that murders are being committed. This talk also sheds light on the apparent ignorance of probability and statistics in the legal and medical professions, a fact that is often forgotten.

The video I watched examines how FOX news wrongly distorts statistics by using graphs. The first example is a graph displaying federal welfare received in the U.S, the reason why this graph wrongly presents data is because it uses a truncated graph; one where the Y-axis does not start at zero. It makes it look like a large increase in welfare spending when in reality it is only presenting a tiny piece of data. Another way FOX wrongly presents data is by changing the magnitude of units at different X-value. By changing the magnitude of the X-units it skews the line graph to deceive the readers without actually altering data. In conclusion, the way data is presented can ultimately lead readers to read the data in the way the author wants it to be read.

The video I watched examines how FOX news wrongly distorts statistics by using graphs. The first example is a graph displaying federal welfare received in the U.S, the reason why this graph wrongly presents data is because it uses a truncated graph; one where the Y-axis does not start at zero. It makes it look like a large increase in welfare spending when in reality it is only presenting a tiny piece of data. Another way FOX wrongly presents data is by changing the magnitude of units at different X-value. By changing the magnitude of the X-units it skews the line graph to deceive the readers without actually altering data. In conclusion, the way data is presented can ultimately lead readers to read the data in the way the author wants it to be read.

https://www.youtube.com/watch?v=w7EvBxRYNME

This video was made by a college statistics professor and emphasized critical thinking in analysis of graphs. The first bad graph shown in the video consisted of a graph in which the vertical axis was not properly scaled and did not start at zero. By not starting at zero it made it look like there was a drastic difference in the data obtained. When properly scaled the two bars on the graph were almost identical in height. He described the importance of analyzing the scaling of an axis in order to make sure that the results are not misrepresented. The next bad graph he showed compared area and volume with a comparison of Alaska to Texas. The proper area was not accounted for in the representation of Alaska making it look 6 time as large as Texas when it is in fact only 2.5 times as large. He emphasized the importance of making sure that dimensions of graphs showing area and volume represent proper ratios. He also warned that objects of area and volume are easy to deceive with. The last example he used involved line graphs. The graph shown did not have axis scaling. This made it the graph look like there was a drastic increase in data when in fact there was not. When comparing to line graphs it is also important to make sure the scaling is similar. Otherwise, you can have two graphs that look like they correlate with one another when in fact they do not. The speaker used easy examples to follow and showed how to fix misrepresented data as well. This video did a good job in teaching how to critically think and analyze graphs in order to not be misled by data.

https://www.youtube.com/watch?v=P42LiuhcTrk

http://imcs.dvfu.ru/lib.int/NEW/Math/MV_Probability/MVsa_Statistics%20and%20applications/Good%20P.I.,%20Hardin%20J.W.%20Common%20errors%20in%20statistics,%20and%20how%20to%20avoid%20them%20%28Wiley,%202003%29%28ISBN%200471460680%29%28220s%29.pdf

Errors in statistics can result from a number of bad techniques that often go unnoticed or are assumed to not have any effect on the outcome of studies. These errors in sample data are important to avoid because they can cause misinterpretations of population statistics based on false sample statistics. First, an obvious example of a source of error is to collect data from a sample population that may not be the best represent the hypothesis that you are trying to test. For instance you would not test the prevalence of heart conditions in a younger healthy population. Similarly it is important that all of the variables that may possibly have an effect on the outcome are at least considered if not accounted for. It is also a common source of error to use the same set of data to come up with a question but to also find a solution. The best way to produce valid statistics is to use widespread data that is the best possible/available representation of the population. Although there are other possibilities for error, making sure your sample population is a valid model is your greatest chance for accurate results.

With regards to interpreting statistical results, David McCandless’ Ted Talk on “The Beauty of Data Visualization” reminds me of the article we previously read: “How to Display Data Badly.” In a way, this Ted Talk complimented the article perfectly in that it reiterated exactly how imperative it is to display data effectively; because ultimately, data that is effectively communicated can be powerful enough to alter the very foundations upon which we build our perspectives. Data visualizations are essential for drawing attention to the pure, raw meaning and significance behind a data set. Without context provided through visualization, data is meaningless. Visualizations allow us to explore “the patterns and connections between numbers.” By some, data is considered to be “the new oil,” an exciting resource from which to generate energy. But McCandless prefers to say “data is the new soil,” which is “a fertile creative medium,” from which data visualizations bloom like flowers. Data is enriched and capable of bearing beautiful, meaningful flowers if interpreted correctly. Visually interpreting statistical results is the most effective way to communicate important data because we as humans have been morphed into visual beings. Each day, massive amounts of visual information enter our eyes and bombard us. McCandless suggests combining “the language of the eye” with “the language of the mind” through visual data representation. In this way, data can be correctly interpreted and the languages can be used to fundamentally alter perspectives and changes views. McCandless provided an excellent example of how to display data in a specific context so that it may be correctly interpreted: when examining data to determine which country has the greatest number of soldiers in terms of absolute values, it seems that China is first. But when considering that China has the largest population of all countries in the world, it seems obvious that this data is skewed. Visualizing the number of soldiers per 100,000 people gives relative values that are more representative of the full picture and are more likely to change perspective. When the same data was displayed in this manner, China became 24th and Korea took first place. Ultimately, information visualizations are essential for compressing knowledge; but the data must be displayed accurately if viewers are to correctly interpret statistical results in the appropriate context. “Let the data set change your mindset.”

http://psychcentral.com/blog/archives/2006/03/16/bad-statistics-usa-today/

This is an article from psychccentral.com, written by Dr. John Grohol. The article discusses about the misuse of statistics to misrepresent data. In an article, Dr. Grohol wrote about how USA Today used statistics to argue that there is a growing trend for the children living with their children in mid 20’s since 1970s. The author of USA Today suggested that there is a 48% increase of people age 18-34 to live with their parents since 1970. While the number seems significant, Grohol mentioned that this author did not use the data obtained to use in the right context. For instance, the author did not take into account military draft, which would explain lower number of people back in the 70’s comparing to now. Dr. Grohol also reworked the statistics taking into account population growth since 1970s. The result of the percentage of people age 18-34, taken population growth into account, is in fact 16% ( as of 2006), not 48% as author suggested. Additionally, Dr. Grohol also pointed out that 16% increase in the time course of 36 years probably make the data a lot less significant. Thus, it is important for author to tell the full story of their data. The author must interpret data in the right context, also taking into account factors that may make data become inaccurate or biased. On the other hand, the reader also has to be able to see how reliable the story is based on how good and fair the statistical methods of papers are performed and presented.

http://www.nature.com/news/scientific-method-statistical-errors-1.14700

This article was published in Nature and discusses the errors in misinterpreting the p-value. The author introduces the topic by talking about a study conducted in 2010 looking at the difference between extremists’ and moderates’ interpretation in ambiguous topics i.e. not black and white. Motyl’s group got a p-value0.05. However, the issue had not been the data or the analysis but rather the heavy reliance in significance in the p-value. At this point, the article shifts to the history of the p-value with Ronald Fisher. Fisher intended for the p-value to be used as an indication to reevaluate data. This statistic was not meant to be definitive and absolute. The author mentions a formula (but does not give the name) where the p value can be converted to a probability that the finding was based on a true effect and not simply a false alarm e.g. p-value=0.01 yields only at least a probability 11% or more to be true meaning that there is nearly 89% of a chance that this was based on a false alarm. So, the p-value should not be used alone to interpret statistics but rather needs to be supplemented by at least the confidence intervals and the sample size. Overall, the idea that the p-value truly determines the significance between data samples is false. Instead, the p-value is an indicator that there is a likely significant difference between data samples. In addition, the article suggests that multiple statistical analyses should be done in tandem for a global approach.

http://www.stat.columbia.edu/~gelman/bag-of-tricks/chap10.pdf

This document lists numerous ways in which statistics can be used improperly, from making up numbers, to misleading readers, to ignoring the baseline, or selection bias. In each of the many types of errors, examples are given, most taken from newspaper clippings to provide a real illustration, such as the pentagon claiming to have shot down 41 Iraqi missiles when actually the number was only 4. Another example showed a map with different sections shaded according to total crimes committed in those areas, but it did not account for population of those areas.