Final biostats project




Shane Ragland

Biostatistics research project


Statistical analysis of physical factors of patients who underwent a pulmonary bronchoscopy



The physical characteristics were recorded of 304 healthcare workers who perform pulmonary bronchoscopies and are suspected to have contracted pulmonary tuberculosis from improper precaution during the procedures. The first hypothesis proposed is to see if there is any correlation or linear relationship between the Body Mass Index (BMI) and the age of the workers. The second is to ascertain if TB and Smoking are independent of each other. The third is to determine if the BMI population mean between three levels of Smoking History are statistically different.

Literature Review:

Healthcare workers that perform or are around patients who undergo a pulmonary bronchoscopy are recommended to take care when performing or are around the procedure. Pulmonary tuberculosis is a highly contagious disease, and particulate matter from the procedure can leave contagious particulates airborne. It is recommended that during the procedure face masks and equipment that can filter out these particulates are worn and such precaution is exceedingly important to take when a patient has pulmonary tuberculosis. However, if a patient nor the doctor knows they have TB, the patient can be unexpectedly diagnosed in the future which means the healthcare workers who performed the procedure can have been exposed to the disease. (Na et al., 2016) The paper that provides the data this study will used is a retrospective study of 1,954 healthcare workers for whom CT and bronchoscopy information was available from the Pusan National University Hospital in Busan South Korea. (Na et al., 2016) South Korea has a particularly high incidence rate of PTB, so determining risks of exposure is particularly important. 304 of the people used in the study are thought to be exposed to PTB from improper precaution. The paper states that there were no significant differences in the population used in the study in either age or body mass index. (Na et al., 2016) The smoking history of the patients were recorded as: never smoked (0), past smoker (1), or current smoker (2). The future diagnosis of the patient with PTB was determined from hospital records and was recorded as either diagnosed (1) or undiagnosed (0). (Na et al., 2016)

Statistical Analysis:

Descriptive Statistics for Numeric Variables

Variable N N Miss Minimum Mean Median Maximum Std Dev
































The mean is close to the median of both of the continuous variables, age and BMI, which suggests that the data is approximately symmetric in a normal bell curve.  The BMI data has a range of 29.3 and the Age data has a range of 69.

Correlation and Linear Regression model of BMI and Age

To determine whether or not Age is correlated and has a linear relationship to BMI, a correlation and a linear regression model were used. These methods were chosen because both BMI and Age are continuous variables, and these models suggest whether or not they are correlated and have a linear relationship. As shown in the correlation table below, the Pearson correlation coefficient ,r, is only -0.004, which means that the two variables age and BMI are very weakly correlated. A strong positive correlation would be indicated by a coefficient of between 0.7 and 1 , and a strong negative correlation would be between -0.7 and -1. The correlation coefficient of -0.004 does not lie in either of these intervals and is close to 0, which represents a very weak correlation between the variables BMI and Age.

This can be further seen with the linear regression analysis. The r-squared value is -0.003, which means that only 0.3% of the variance in the data can be explained by the linear regression model. The slope of the linear regression model is -0.0007, which suggests as one increases a year in age, one’s BMI lowers by -0.0007, starting from age 0 at the y-intercept of 21.89, however because of the low correlation value, this model does not explain the variance in data well.


Pearson Correlation Coefficients, N = 304
Age -0.00400
Parameter Estimates
Variable DF Parameter
t Value Pr > |t|
Intercept 1 21.88719 0.59800 36.60 <.0001
Age 1 -0.00072294 0.01040 -0.07 0.944


 Chi Square: TB versus smoking-

H0: That Tuberculosis Diagnosis and History of smoking are independent

HA: That Tuberculosis Diagnosis and History of smoking are not independent

In order to see if the two categorical variables, TB and Smoking, are independent of each other, a Chi-square test was conducted. SAS calculated the test statistic χ2= 0.990 and the P-value, P(χ2>0.990)=0.6068. At the 0.05 significance level, one should not reject the null hypothesis (as 0.6068 > 0.05.) In conclusion, the chi-square test indicates that Tuberculosis Diagnosis and Smoking history are independent of each other.



Row Pct

Col Pct

Table of Smoking by TB
Smoking TB
0 1 Total
0 86












1 27












2 32












Total 145






Statistics for Table of Smoking by TB

Statistic DF Value Prob
Chi-Square 2 0.9990 0.6068


ANOVA Test: Smoking Versus BMI

H0: The mean BMI for all levels of smoking history (never, past, and current) are equal.

HA: At least two of the mean BMI’s for all levels of smoking history (never, past, and current) are not equal.

To test whether or not the BMI population mean between three levels of smoking history are statistically different, an ANOVA test was used. SAS calculated a test statistic of  F= 0.60 and a P-value of P(F>0.60)=0.50492. At the 0.05 significance level, one would not reject the null hypothesis (because 0.5492 > 0.05.) Thus, one cannot reject that the mean BMI for all levels of smoking history (never, past, and current) are equal. One can conclude that the population means are not statistically different.

Source DF Sum of Squares Mean Square F Value Pr > F  
Model 2 10.616569 5.308284 0.60 0.5492



By testing for correlation, linear relationships, independence, and difference in means, one can begin to make inferences about this data set. From observing Pearson’s correlation coefficient, it was concluded that Age and BMI were weakly correlated. From the chi-square hypothesis test, it was determined that Smoking History is independent of TB, and thus past smoking has no significant effect on contracting TB. From the ANOVA table, it was established that the mean BMI for all levels of smoking history (never, past, and current) are equal. Thus, one can better understand how the three variables of Age, BMI, and Smoking History interact with TB and with each other.

Data set of exposure to pulmonary TB during broncoscopy in patients with unexpected PTB

Link to Dryad article:

Link to file:

Current plan is to use:

Continous variables: BMI and Age

Categorical Variable: Current Smoker/non-smoke/past smoker and Diagnosis of TB or not


Novel ways of displaying bio statistical data

When I was posed the question to find an interesting way bio statistical data is displayed I first thought about practical ways I have seen. The most common “interesting ways” I have seen data being portrayed in public health are the PSA posters we have all seen in high school. The first one that specifically came to mind was the one showing how if your sexual partners mirrored your promiscuity how many people you have exposed yourself to like the one below.


The figure makes unsafe sexual promiscuity more scary by showing a graphic, adding a visual message, which may be a more effective way of getting a point into a person’s mind than saying a number which may have less impact.

However, the best example I can think to provide for this example is a video from Penn and Teller (Vegas magicians/comedians) explaining why people should vaccinate their kids. Similarly to the above picture, Penn and Teller show biostatistical data that in the form of regular numbers and percentages may not leave an impact on the viewer in a novel way. They use balls as diseases and people as pins to explain how vaccinating children makes them safer.


Post 2: Study of the effect of a novel drug on chorea in Huntington’s

For my journal article I chose a study that is experimentally observing the effect of a drug, Deutetrabenazine, on Chorea in Huntington’s Disease patients. Huntington’s disease (HD) is an disease that results in the death of brain cells. A common symptom of this disease is chorea, which is involuntarily spasms and movement as a result of damage to the brain due to the neurodegeneration common in HD. Treatment of chorea is important because it has a significant impact on a patients safety and quality of life. Deutetrabenazine is a transporter inhibitor that is thought to be more stable, and thus require less dosing, than currently used treatments. The research question the researchers posed was “Can Deutetrabenazine decrease Chorea in patients with Huntington’s disease”. They decided an experimental study was required to test the effect of the drug as they needed to apply conditions in order to find the difference between a treated and untreated group in order to find the difference. The format for studying the drug used a sample size of 90 patients with HD with a chorea score of 8 or higher (Range is 0-28, with higher scores being worse).  Half of the patients received doses of Deutetrabenazine and the other half received placebo in a double-blind manner each for one year. The study went in detail on the mean differences in scores before and after treatment in both groups. The placebo and Deutetrabenazine groups were created by taking the 90 Huntington patients that qualified and sorting them randomly into the placebo and Deutetrabenazine groups. The drug group on average improved from a mean of 21.1 to 7.7, and then the placebo improved from a mean 13.2 to 11.3. The mean difference in improvement between groups was -2.5 units with the drug group improving more. More people were also improved by the drug treatment in addition to showing better improvement over placebo, with 23 patients improving with Deutetrabenazine compared to only 9 on placebo (51% versus 13%). There were not any extreme side effects of Deutetrabenazine shown in the study, as the side effects were similar to placebo.  The study concluded that Deutetrabenazine did result in improved motor signs when compared to placebo at 12 weeks, but further testing would need to be done to determine to what extent it is better, or if the changes would continue to be as effective or safe over time.  This kind of drug can be important to research further as finding more consistent, stable drugs for HD can help reduce the overall symptoms of chorea more efficiently, while requiring less dosing. The studies double blind study does seem to have a strange ethical component in that is doesn’t give patients with HD a treatment they believe will work and rather gave them placebo, which is in a strange ethical territory.



Geschwind, M. D. and Paras, N. (2016). Deutetrabenazine for Treatment of Chorea in Huntington Disease. Jama 316, 33.

Why do we care about Biostatistics?

In today’s world where medical care and research needs to appeal to an exploding population being able to accurately determine risk and reward is exceedingly important. Using Biostatistics in order to properly guide a physician, a team of researchers, or a savvy consumer in his medical care is not only becoming increasingly essential, but it also can help solve many problems people in the life sciences face. The ability to critically analyze, evaluate, and portray data is a skill anyone in the sciences should have, and it becomes especially important that the data is well analyzed and portrayed if it affects the health of others. Overall, we care about biostatistics because it can help us make informed decisions about how to conduct a successful study, properly analyze the results, and be able to make informed decisions from the conclusions of data.

Improper use, or lack of, of statistical research in medical studies can come at the cost of lives. I was listening to NPR last week when the topic of lack of diversity in cancer studies came up. The lack of minorities included in cancer studies is making minority groups more susceptible to succumbing to the disease. Some drugs, like one blood thinner mentioned in the talk, do not even work on certain minority groups because they simply were not even considered in the trial. Despite making up a significant amount of the population, around 40%, the amount included in the studies is much lower than what would be included if a proper bio statistical approach was taken in directing the study. So an important application of biostatistics is making sure a medical study is an accurate representation of the populace so the research can help more people, rather than a homogenous group. So an improper use of biostatistics to not properly conduct a study is not only irresponsible, but it is also unethical. So the study of biostatistics is important to prevent these kinds of errors as a researcher.

A good use of biostatistics is being able to determine risk versus reward from a set of data to be able to make an educated decision on how to proceed. For example, extended biomedical research from a large pool of diverse clinical study has determined the BRCA1 gene has an extremely high incidence of breast cancer. As a physician, if through genetic testing you know your patient has the breast cancer gene, by consulting the biostatistical data from research it would be a good suggestion to take pre-emptive measures against breast cancer. Another good example of using biostatistical data to make an informed decision is in drug design and research. I recently read a paper that tested potential drugs to help patients with p53 mutation, p53 mutation is involved in over 50% of cancers. When deciding what type of mutation to study, they tested over 20,000 cancer patients to see where in the p53 gene mutation had occurred, they then studied the ones that occurred the most frequently, “p53 hotspots”.  The researchers intelligently allocated their resources to studying the mutations that occurred most often because that was the most efficient way to help the most people with their limited resources as suggested by the biostatical data. Thus, an important application of biostatistics is being able to critically evaluate data to determine the risk and reward of decisions you will have to make as a researcher or physician.

In my time here at Rollins College I have seen the importance of being able to analyze data for its accuracy, importance, and to be able to draw conclusions from it. The ability to construct a good argument for a lab report, or to understand a review article, can hinge solely on the ability to understand the data in a figure. It is also especially important for knowing if a study has bias or was well conducted. I have seen how the manipulation of statistics, in the same drug paper discussed earlier, can lead the reader to a dishonest conclusion. Being able to tell if the data was manipulated, for example using different methods to show errors to try and make the error seem smaller and this the data more significant, is an important part for being able to tell if a research was well conducted, or if the data displayed is “honest”. I plan on using biostatistics extensively in my future care as a physician. In my experience as a medical shadow, I have seen doctors make good and bad decisions that hinged solely upon how frequently they used a biostatistical approach to diagnosing a patient. Knowing what diseases are most common in certain demographics, what symptoms most likely occur from, and what treatments are most effective is all information one can derive from proper biostatistical research. Keeping up with new clinical studies that extensively used a proper biostatistical approach to guide and evaluate their data collection and analysis is a way to help ensure your patients safety and to give a proper diagnosis and treatment plan. I have seen several successful physicians do research on the best possible treatments for a disease by consulting bio statistical articles on the success rates of certain treatment plans before suggesting which route to take, and they explained it as simply the most responsible thing to do. In conclusion, I personally care about biostatistics because it is the most responsible way to make informed decisions, which is important because your decisions will affect the welfare of others.


Works cited:

“BRCA1 & BRCA2: Cancer Risk & Genetic Testing.” National Cancer Institute. N.p., n.d. Web. 02 Sept. 2016.

Bullock, Alex N., and Alan R. Fersht. “Rescuing the Function of Mutant P53.” Nature Reviews Cancer Nat Rev Cancer 1.1 (2001): 68-76. Web.

Gore, Ad, Pv Chavan, Yr Kadam, and Gb Dhumale. “Application of Biostatistics in Research by Teaching Faculty and Final-year Postgraduate Students in Colleges of Modern Medicine: A Cross-sectional Study.” Int J App Basic Med Res International Journal of Applied and Basic Medical Research 2.1 (2012): 11. Web.

“Lack Of Diversity In Clinical Trials Presents Possible Health Consequences.” NPR. NPR, n.d. Web. 02 Sept. 2016.