The report is attached below.
Report
Literature Review and Data Analysis
“The Effect of Sitagliptin on Carotid Artery Atherosclerosis in Type 2 Diabetes”
The Effects of Anthropogenic Events On Pseudoboletia indiana
A way to better understand biostatistics through the running of statistics on real life experiments.
Here’s my paper about how ocean acidification and ocean temperature increases are affecting a species of sea urchins:
theeffectsofanthropogeniceventsonpseudoboletiaindiana
Written Report
The paper is attached below:
Data Analysis Paper
Mixing of specific Arabidopsis thaliana genotypes stabilize yield in diseased and nondiseased sample populations
Taylor Schilling
BIO 342
27 November 2016
Dr. Zeynep
On my honor, I have not given, nor received, nor witnessed any unauthorized assistance on this work. Taylor Schilling
Objective
Statistical data analysis was done using the raw data from the study, Impact of disease on diversity and productivity of plant populations (Creissen et al. 2016). The purpose for this analysis is to better understand the overall spread of data from this experiment as well as the relationship between the variables used. The productivity of Arabidopsis thaliana can be shown by seed mass; in order to find out if vegetative production can also be measured by rosette size, the linear relationship was found between rosette size and seed mass using linear regression. Further, Anova tests were used to find out if population means of rosette size change depending on the genotype as well as if population means of rosette size differ depending on the number of genotypes per pot. This shows whether or not the specific genotype and number of genotypes present actually affect the population means of rosette size – and more importantly, whether or not the study was successful in finding significant data regarding their purpose. Last, a chi square test was done using the number days to flower (separated in low, medium and high categories) and the types of genotypic mix. This shows if the genotype mix affects the days A. thaliana takes to flower and further reinforce the ability of this study to portray results that are applicable to the overall population.
Lit Review
In Impact of disease on diversity and productivity of plant populations, Creissen et al. (2016) studied the effects that diversity in plant genotypes had on stabilizing plant productivity in Arabidopsis thaliana while being attacked by a pathogen Hyaloperonospora arabidopsidis (Hpa). They focused on plant competition and the effects on plant production when the pathogen is introduced as well as the effect of biodiversity on the system’s ability to buffer against the disease.
The research shows that pathogens promote plant biodiversity and prevent competitive exclusion – at least when a resistant genotype is present. Biodiversity is reduced when less competitive species are diseased. Additionally, species richness lessens the effect of disease and increases plant productivity. Four specific genotypes (Van0, Ga0, NFA10 and NFA8) of A. thaliana were chosen based on their fitness and planted in pots. There were four plants per pot – 20 pots of each of the 11 monocultures and mixtures in each pathogen treatment (220 pots total). The researchers measured the diseased leaf area after six and ten days, rosette leaf size, plant height and flowering time.
The results of this study show that, when diseased, the yield ultimately depended on the number and combination of certain genotypes. Hpa reduced seed production in all mixes with the most susceptible genotypes, NFA8 and NFA10. This is shown in the decrease in rosette diameter. There was an increased competitive ability in resistant genotypes, Ga0 and Van0 in the 2 and 4way mixes. It is also important to note that Ga0 is the most competitive with or without Hpa, while NFA8 is highly competitive without Hpa, though less so with the presence of the pathogen. Without the pathogen, the pots with only highly competitive genotypes have the lowest yield while pots with less competitive genotypes have the highest yield. With the disease, the combination of the somewhat susceptible NFA10 and the fully resistance Van0 had the highest yield in monoculture and 2way mix. Additionally, the study found that 2way genotypic mixes had overall higher yields than monoculture and 4way genotype mixed pots. In fact, 4way mixes produced had the lowest yield without Hpa and the same yield as monoculture with Hpa. This shows that not only the combination of genotype matters in plant yield, but the number of different genotypes in the mix matters as well.
This research supports the ability of resistant genotypes to maintain productivity, stability and diversity. There is more resistance of plants to change their behavior (know as ecological resistance) and buffer negative effects during events such as the introduction of a pathogen in order to maintain their wellbeing and ensure their survival. With a pathogen, a high yield of a resistant genotype results especially when there is a mixture of resistant and susceptible genotypes. Disease helps to maintain genotypic diversity, which in turn enhances productivity because disease pressure leads to compensatory actions – in this case, the overyielding of one genotype (for example, Van0) compensating for the loss of another (NFA10). These compensatory interactions are highest when the genotypes had different competitive abilities. They compensate depending on their specific response to disease, which ultimately leads to more production. Therefore, pathogens promote biodiversity by inhibiting competitive exclusion and supporting complementation. Mixtures, then, may reduce the effect of pathogens as well as the competitiveness between plants, as seen with the 2way mix between Van0 and NFA10.
This research is applicable to those who work with agriculture because it helps them decide which plants and plant genotypes to plant to get the best yield and yield stability possible. It also highlights the importance of genotypes and number of genotypes in a mix to buffer the affect of a pathogen and to have the highest productivity possible.
Data Analysis
Descriptive Statistics: Days to Flower
Minimum  Quartile 1  Median  Quartile 3  Maximum  
42  49  52  62  90  
Mean  Standard Deviation  Interquartile Range  Variance  
55.4  8.7  13.0  74.8  
The number of days that plants took to flower is between 42 and 90 days. The middle 50% of days it took to flower is between 49 and 62 days with the center of the sample being 52 days. With that being said, the mean is 55.4 days. The interquartile range shows the middle half of the data; 13 days less than quartile 1 (36 days), and 13 more than quartile 3 (75 days). The dispersion of days around the mean is around 8.7.
Descriptive Statistics: Rosette Size
Minimum  Quartile 1  Median  Quartile 3  Maximum  
25  55  67  82  138  
Mean  Standard Deviation  Interquartile Range  Variance  
69.4  18.1  27.0  328.8  
The rosette size ranges between 25 and 138 mm. The middle 50% of rosette sizes is between 55 and 82 mm, with the middle of the sample being 67 mm. The mean is 69.4 mm. The interquartile range is 27 less than quartile 1 and more than quartile 3 (28109 mm), which is more accurate as it disregards outliers. The dispersion of rosette sizes around the mean is 18.1.
Descriptive Statistics: Seed Mass
Minimum  Quartile 1  Median  Quartile 3  Maximum  
0.01  0.20  0.28  0.35  0.68  
Mean  Standard Deviation  Interquartile Range  Variance  
0.28  0.11  0.15  0.01  
The seed mass ranges between 0.01 g to 0.68 g. The middle 50% of seed mass is between 0.20 to 0.35 g with the middle of the sample being approximately 0.28 g. The mean is also 0.28 g. The interquartile range is 0.15 g less than quartile 1 and more than quartile 3 (0.050.50 g), which is more accurate due to the exclusion of outliers. The dispersion of seed mass around the mean is 0.11.
Research question: Is there a linear relationship between the overall seed mass and rosette size in the plants?
Correlation
Seed Mass (g)  
Rosette Size (mm)  0.49041
<0.0001 
Null hypothesis: There is no relationship between the overall seed mass and rosette size in these plants.
Alternative hypothesis: There is a relationship between the overall seed mass and rosette size in these plants.
The correlation between seed mass and rosette size is ~0.49. This is a strong and positive correlation, which means that there is a strong and positive relationship between the two variables. As the seed mass increases, the rosette size increases. Further, the Pvalue is less than 0.0001, meaning that it is significant and the null hypothesis is rejected. Thus, there is statistical evidence to support the relationship between seed mass and rosette size. As such, it would be logical to calculate linear regression.
Regression
Seed Mass = 0.06989 + 0.00302*Rosette Size
Pvalue  Rsquare 
<0.0001  0.2405 
Parameter Estimate  Pvalue  
Intercept  0.06989  <0.0001 
Rosette Size  0.00302  <0.0001 
Null hypothesis: There is no linear relationship between the overall seed mass and rosette size in these plants.
Alternative hypothesis: There is a linear relationship between the overall seed mass and rosette size in these plants.
The linear regression Pvalue (<0.0001) is less than the alpha value (0.05) meaning that the null hypothesis is rejected and there is a statistically significant linear relationship between seed mass and rosette size. The regression line shows that with every millimeter increase in rosette size, the seed mass increases by 0.00302 g. At 0 millimeters, the seed mass is 0.06989 g. The Rsquare value, however, is 0.2405, which means that 24.05% of the seed mass data is unexplained. This regression line is therefore not a good model for the linear relationship between seed mass and rosette size because the majority of the data is unexplained.
Research question: Are the population means of rosette size significantly different for each genotype?
Null hypothesis: The population means of rosette size are the same for each genotype.
Alternative hypothesis: The population means of rosette size are different for each genotype.
Due to the Pvalue (<0.0001) being less than alpha (0.05), the null hypothesis is rejected. There is statistically significant evidence to support that the population means of rosette sizes are different for each genotype (Ga0, NFA10, NFA8, Van0).
Research question: Are the population means of rosette size significantly different for each number of genotypes per pot?
Null hypothesis: The population means of rosette size are the same for each number of genotypes per pot.
Alternative hypothesis: The population means of rosette size are different for each number of genotypes per pot.
The Pvalue (<0.0001) is less than the alpha value (0.05), meaning that the null hypothesis is rejected and that there is statistically significant evidence to show that the population means of the rosette sizes are different for each number of genotypes per pot (1 genotype/per pot, 2 genotypes/pot and 4 genotypes/pot).
Research question: Do the plants take different numbers of days to flower between pots that are monocultures, 2way genotypic mixes, and 4way genotypic mixes?
Low  4249 days to flower (lower 33%) 
Medium  5062 days to flower (middle 33%) 
High  6390 days to flower (upper 33%) 
(Taken from 5number summary of days to flower)
Observed and expected values  
Mono  2way mix  4way mix  Total  
Low  86 (91)  263 (274)  97 (82)  446 
Medium  106 (113)  392 (343)  59 (102)  558 
High  63 (51)  116 (155)  74 (46)  252 
Total  255  771  230  1256 
Null hypothesis: The number of days it takes to flower and genotypic mix are independent.
Alternative hypothesis: The number of days it takes to flower depends on the genotypic mix.
By calculating the obtained and expected values of days to flower (categories: low, medium and high number of days to flower) and type of genotype mix (categories: mono, 2way mix, 4way mix) the Pvalue (5.66×10^{12}) is found to be less than alpha (0.05). Therefore, the null hypothesis is rejected and there is statistically significant evidence to support that the days to flower and genotype mix are associated.
Conclusion
In order to better understand the raw data gathered from Creissen et al. (2016) during their study, correlation, regression, Anova and chisquare tests were performed. The conclusions made were that there is a linear relationship between seed mass and rosette size; population means of rosette size re different for each genotype; population means of rosette size are different for each number of genotypes per pot; and plants flower in different amounts of days depending on if they are in a monoculture, 2way mix or 4way mix. The study was therefore successful at gathering significant results that can be related to overall populations.
Creissen, H. E., Jorgensen, T. H., and Brown, J. K. M. (2016). Impact of disease on
diversity and productivity of plant populations. Functional Ecology 30, 649657.
Final biostats project
Presentation:
Shane Ragland
Biostatistics research project
Statistical analysis of physical factors of patients who underwent a pulmonary bronchoscopy
Objective:
The physical characteristics were recorded of 304 healthcare workers who perform pulmonary bronchoscopies and are suspected to have contracted pulmonary tuberculosis from improper precaution during the procedures. The first hypothesis proposed is to see if there is any correlation or linear relationship between the Body Mass Index (BMI) and the age of the workers. The second is to ascertain if TB and Smoking are independent of each other. The third is to determine if the BMI population mean between three levels of Smoking History are statistically different.
Literature Review:
Healthcare workers that perform or are around patients who undergo a pulmonary bronchoscopy are recommended to take care when performing or are around the procedure. Pulmonary tuberculosis is a highly contagious disease, and particulate matter from the procedure can leave contagious particulates airborne. It is recommended that during the procedure face masks and equipment that can filter out these particulates are worn and such precaution is exceedingly important to take when a patient has pulmonary tuberculosis. However, if a patient nor the doctor knows they have TB, the patient can be unexpectedly diagnosed in the future which means the healthcare workers who performed the procedure can have been exposed to the disease. (Na et al., 2016) The paper that provides the data this study will used is a retrospective study of 1,954 healthcare workers for whom CT and bronchoscopy information was available from the Pusan National University Hospital in Busan South Korea. (Na et al., 2016) South Korea has a particularly high incidence rate of PTB, so determining risks of exposure is particularly important. 304 of the people used in the study are thought to be exposed to PTB from improper precaution. The paper states that there were no significant differences in the population used in the study in either age or body mass index. (Na et al., 2016) The smoking history of the patients were recorded as: never smoked (0), past smoker (1), or current smoker (2). The future diagnosis of the patient with PTB was determined from hospital records and was recorded as either diagnosed (1) or undiagnosed (0). (Na et al., 2016)
Statistical Analysis:
Descriptive Statistics for Numeric Variables
Variable  N  N Miss  Minimum  Mean  Median  Maximum  Std Dev 
BMI
Age Smoking TB 
304
304 304 304 
0
0 0 0 
15.0000000
19.0000000 0 0 
21.8473684
55.0888158 0.6480263 0.5230263 
21.6000000
57.0000000 0 1.0000000 
44.3000000
88.0000000 2.0000000 1.0000000 
2.9693583
16.4226964 0.8430238 0.5002930 
The mean is close to the median of both of the continuous variables, age and BMI, which suggests that the data is approximately symmetric in a normal bell curve. The BMI data has a range of 29.3 and the Age data has a range of 69.
Correlation and Linear Regression model of BMI and Age
To determine whether or not Age is correlated and has a linear relationship to BMI, a correlation and a linear regression model were used. These methods were chosen because both BMI and Age are continuous variables, and these models suggest whether or not they are correlated and have a linear relationship. As shown in the correlation table below, the Pearson correlation coefficient ,r, is only 0.004, which means that the two variables age and BMI are very weakly correlated. A strong positive correlation would be indicated by a coefficient of between 0.7 and 1 , and a strong negative correlation would be between 0.7 and 1. The correlation coefficient of 0.004 does not lie in either of these intervals and is close to 0, which represents a very weak correlation between the variables BMI and Age.
This can be further seen with the linear regression analysis. The rsquared value is 0.003, which means that only 0.3% of the variance in the data can be explained by the linear regression model. The slope of the linear regression model is 0.0007, which suggests as one increases a year in age, one’s BMI lowers by 0.0007, starting from age 0 at the yintercept of 21.89, however because of the low correlation value, this model does not explain the variance in data well.
Pearson Correlation Coefficients, N = 304  
BMI  
Age  0.00400 
Parameter Estimates  
Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  21.88719  0.59800  36.60  <.0001 
Age  1  0.00072294  0.01040  0.07  0.944 
Chi Square: TB versus smoking
H_{0}: That Tuberculosis Diagnosis and History of smoking are independent
H_{A}: That Tuberculosis Diagnosis and History of smoking are not independent
In order to see if the two categorical variables, TB and Smoking, are independent of each other, a Chisquare test was conducted. SAS calculated the test statistic χ^{2}= 0.990 and the Pvalue, P(χ^{2}>0.990)=0.6068. At the 0.05 significance level, one should not reject the null hypothesis (as 0.6068 > 0.05.) In conclusion, the chisquare test indicates that Tuberculosis Diagnosis and Smoking history are independent of each other.


Statistics for Table of Smoking by TB
Statistic  DF  Value  Prob 
ChiSquare  2  0.9990  0.6068 
ANOVA Test: Smoking Versus BMI
H_{0}: The mean BMI for all levels of smoking history (never, past, and current) are equal.
H_{A}: At least two of the mean BMI’s for all levels of smoking history (never, past, and current) are not equal.
To test whether or not the BMI population mean between three levels of smoking history are statistically different, an ANOVA test was used. SAS calculated a test statistic of F= 0.60 and a Pvalue of P(F>0.60)=0.50492. At the 0.05 significance level, one would not reject the null hypothesis (because 0.5492 > 0.05.) Thus, one cannot reject that the mean BMI for all levels of smoking history (never, past, and current) are equal. One can conclude that the population means are not statistically different.
Source  DF  Sum of Squares  Mean Square  F Value  Pr > F  
Model  2  10.616569  5.308284  0.60  0.5492 
Conclusion:
By testing for correlation, linear relationships, independence, and difference in means, one can begin to make inferences about this data set. From observing Pearson’s correlation coefficient, it was concluded that Age and BMI were weakly correlated. From the chisquare hypothesis test, it was determined that Smoking History is independent of TB, and thus past smoking has no significant effect on contracting TB. From the ANOVA table, it was established that the mean BMI for all levels of smoking history (never, past, and current) are equal. Thus, one can better understand how the three variables of Age, BMI, and Smoking History interact with TB and with each other.
Excel Resource
Excel Resource
This website is a pretty good comprehensive resource.
www.realstatistics.com
Excel Resources
The following is a series of videos about statistics in excel