To get a good answer, you must write a good question. Answering a statistics question without context is like boxing blindfolded. You might knock your opponent out, or you might break your hand on the ring post.

What goes into a good question?

1. Tell us the PROBLEM you are trying to solve. That is, the substantive problem, not the statistical aspects.

2. Tell us what math and statistics you know. If you’ve had one course in Introductory Stat, then it won’t make sense for us to give you an answer full of mixed model theory and matrix algebra. On the other hand, if you’ve got several courses or lots of experience, then we can assume you know some basics.

3. Tell us what data you have, where it came from, what is missing, how many variables, what are the Dependent Variables (DVs) and Independent Variables (IVs) – if any, and anything else we need to know about the data. Also tell us which (if any) statistical software you use.

4. Are you thinking of hiring a consultant, or do you just want pointers in some direction?

5. THEN, and ONLY THEN tell us what you’ve tried, why you aren’t happy, and so on.

I am doing a project on count data using SPSS software. I have analysed the number of calls to a call centre over a 18 month period. It is clear that January shows the highest number of calls. I applied the one way anova test and it showed that there was a significant difference between means. I then carried out a post hoc (LSD) analysis. I looked at the multi-comparisons table and the I-J column (mean differences) for January and it showed some minus numbers when compared to some of the other months. How is it possible that Januarys means can be lower to that of other months when it showed a higher call count? Please could someone explain this to me!

If there were other independent variables, then that could account for it. If you only have 18 months of data, then many months will only have one instance. I don’t think ANOVA is right, in this case, as those months will have zero variance and there will be nothing to analyze. (Analysis Of Variance). And, if the number of calls per month is low, you might need to account for the fact that they are counts, which are all going to be integers greater than or equal to 0. (This probably isn’t a problem here, unless it is a very small call center).

You need some other kind of model.

Hi I am hoping some one can help me with my stats. I hoping to find that increasing intake of x will increase levels in subjects. I have 4 groups (1 control) and in each group I will be having two outcomes (measuring maternal levels of X and baby level of X). which statistical test I can use? I think my outcome variable is continuous. am I correct? and how will I know if I have a normal or skewed distribution? I appreciate it if some one can get back to me. thanks

I don’t understand. Why would increasing intake increase levels ? What levels? You ask if you r outcome is continuous, but you also say you have two outcomes….

Mr Flom

I am measuring two things in each group.

for example I am measuring level of X in blood of pregnant women and also measuring the same thing in their babies. My hypothesis is that levels of X in the blood will increase with Higher doses of Treatment A. What I meant by a continuous outcome is that my results are not fixed values (e.g may be 1 or 1.9 ). yes I will have two outcomes (in each group. and I want to compare the two outcome in each group with the control. I just do not know what statistical analysis I can use to come up with a P value .

does the mean,mode, and median affected when the data is added??

Any of them could be, none of them necessarily are.

I need help with choosing a statystical test. I have two grups divided into three categories each. I want to analyze whether the differences in numbers of members in each category is statistically different between groups. The sum of all members within grop is not equal between gropus. Experiment was done in three repeats.

Example (one repeat):

groupI: category A – 11 members, B-9, C-3

groupII: A-3; B – 17, C -14.

guestion: Is the difference between IC and IIC significant?

I can work out the math once I know the test, I just need to be pointed in the right direction. Thanks!

In normal distribution why continuity correction is not apply??? Please answer me

To do this you would need a whole bunch of groups. Right now, in effect, your N is 2

Your question, by itself, makes no sense. Sorry.

Hello, I am doing psychological research regarding happiness and depression. I have over 500 participants who have taken a measure of happiness and a measure of depression. Most of my research is regarding correlations between these 2 measures, but want I really want to do is compare the 90th percentile of both groups. So, those who scored the highest on depression vs those who scored the highest on happiness (I have no overlapping participants who are high in both). My hypothesis is that those who are Happy people have a lower likelihood of scoring higher on the depression measure compared to depressed people scoring higher on the happiness measure (This is the differential relationship between happiness and depression, I apologize if it is confusing. However, my difficulty is whether or not to use 2 different t-tests or to use 2 different correlations. I am also wondering if this can be done in any regard. I have taken upper level statistics and am familiar with conducting both of these tests. Thank you.

Hi David

I think you can do this by comparing two proportions with one test.

Thank you for your reply, but which test should I use?

Hi David – a t-test would do.

And you suggested a single t-test? How would I conduct a single t-test in this manner?

You have two proportions: Proportion of depressed people who are happy and proportion of happy people who are depressed. So, you do a t-test to see which proportion is greater.

Thank you very much sir, I appreciate your help!

I would like to use a one sample t-test where accuracy is 50-50..

The data is rated on a three point likert scale of 1-performed task incorrectly, 2 – performed task with minor mistakes, 3-performed task accurately

I have calculated the mean and standard deviation of my data and that is fine, I am just confused about what test value to use.. (newbie)

Do I use .5 as in 50% or do I use 1.5 as this is the center of the possible high score of 3 that can be achieved?

With a 3 point Likert scale you shouldn’t be taking the mean or SD and you shouldn’t be doing a t-test. You should be using chi-square or one of its ordered alternatives.

I don’t know what you mean by “accuracy is 50-50″ either

Thanks so much for taking the time to answer me. I only have one set of 10 participants and i am trying to see if they performed a task correctly with accuracy greater than a hypothetical accuracy of 50-50.. In a one sample ttest i would compare the mean of my participants 2.875 and standard deviation of .35355 against a population mean of 50% accuracy… I am confused about whether I use .5 as the population mean or if I use 1.5 due to the fact that 3 is the highest score a person can achieve (and 1.5 is half of 3).. So its just the mean I am comparing that is confusing me..

I was told to use a one sample t-test.. Is it bad to have 1-performed task incorrectly, 2 – performed task with minor mistakes, 3-performed task accurately? I could collapse the levels to just 1 and 2 (accurate or inaccurate) I am just unsure of what is statistically correct..

Hi Sarah

You cannot answer that question with the information you have.

I am not sure to understand the inercept interpretation in case of quantile regression (Median regression for example)

If we have quantitative and binary independant variable, can we interpret the constant as:

“the constant is the median value of y for individuals coded with 0 (for the binaries)”

or should we say

” the constant is the mean of Q50 for individuals coded with 0″

Thank you for your answer.

Olivier.

Hi Oliver

For median regression you want to say the former (about the median). The latter is about OLS regression

Actually I woud like to know how interpret the first coefficient (the one which is not link with an explanatory variable in two different cases

1) In OLS case

2) In a quantile regression

Thank you again!

Hi Olivier,

See my answer, above. In quantile regression it’s a quantile. In OLS regression it’s a mean

Thanks for your answer,

Then in OLS, we focus on the mean to explain all the distribution and in Median regression, we use the median as a starting point to explain all the distribution. Am I right?

Olivier

Hello, I am conducting a research proposal and am confused on how I am suppose to do data analysis. ie am i suppose to use ANOVA, Z scores etc. My study is measuring the relationship between high-risk sexual behaviours, AmED consumption and Sensation seeking behaviours. the bahviours are measured on frequency scales while alcohol consumption follows the Daily Drinking Questionnaire . Please help! The questionnaires will be presented on two different occasions

If you would like to hire me to help you with data analysis, go to the main page of my website at http://www.statisticalanalysisconsulting.com

Problem statement: In responses collected from online survey, How to eliminate multiple representations from same company

Assuming that I intend to get one (and only one) response from each company for analysis. In the process of survey, I sent out to a large distribution and there are multiple responses from single company. What is the best process to pick one response before starting the analysis.

(eg., Company 1 – 10 responses, Company 2 – 15 responses, Company 3,4,5 – 1 each).

Unit of Analysis is Company. How to pick 1 response from Company 1 and 2 so that the analysis is not biased.

Some thoughts

a. Pick a response randomly and ignore the others

b. Follow below strategy to narrow

i. Ignore partial responses

ii. Go-Thro’ answers and remove not worthy responses – Like Yes for All etc

[inconsistency score . eg.,

iii. Choose based on responses to priority questions

iv. Finally choose based on designation

c. Apply somelogic to group the responses into a single response

Please provide the best way to handle this issue and provide references

It would depend on the purpose of your whole research; I would have to do research after hearing more from you.

If you’d like to hire me, see the main part of my website.

I have been given an assignment to creatively formulate a problem, collect the relevant information and data, perform any relevant analysis like the different statistical analysis using SPSS if necessary and write a report. Pleas can you help?

Hi Ada – I don’t have SPSS, but I can do analysis in R or SAS. If you’d like to hire me, you can contact me from the main part of my website.

I have a question about which model I should use/how to structure my data in R.

1) I’m hoping to run a regression using participation in two social welfare programs as my main independent variables to predict whether or not an individual decides to migrate. The idea is that these programs may address why these people decide to migrate and ultimately may reduce the number of people who decide to move.

2) I’ve had a bit of grad level education in statistics for the social sciences. Right now, I’m thinking I should run a logistic, mixed effects model that allows the intercept and coefficients to vary, but I’m not totally sure if that is right.

3) My data comes from the Mexican Family Life Survey. I am really not sure how to deal with it. There are three levels: individual, household, and community. I have two time periods (two survey waves, panel) as well, which I think I need to account for. The survey is broken up into multiple booklets, each of which center around a certain theme of questions. Each booklet has its own weights to make the survey more representative to the population. I don’t know how to incorporate these either. I’ll be using R to do this analysis.

DV: migration at the time of survey wave 2 (yes, no)

IV: participate in social welfare program (yes, no), + other covariates (age, sex, education, income, etc.), each has two observations, one at wave 1 and one at wave 2.

I really just don’t know for sure what model I should be running or how to structure a dataframe in R to incorporate the time aspect of the data and the weights.

4) I think I just need some pointers in the right direction.

Can you please help? I’m feeling quite lost.

Hi Christina

It sounds like you are on the right track regarding the model. It’s not possible to fully answer your questions about how to structure the data set without my looking at your data and then taking the time to figure it all out. This would require hiring me.

However, you can find considerable help online and in books. There is an r-help list: https://stat.ethz.ch/mailman/listinfo/r-help . There is CRAN http://cran.us.r-project.org/ . If you click on “contributed” you will find info about a couple things re the nlme package. You can also Google nlme

Peter

i am a genetics student doing multinomial logistic regression analysis , and wanted to know whether i am applying the correct statistics.

i have three groups under categories , one continuous variable, 1 ordinal variable and two categorical variables for all the three groups. i have taken category as dependent variable and continuous variable as covariates and other variables as factors.

can you please help me regarding this.

Hi Lavanya = if your dependent variable is categorical, then multinomial logistic regression is certainly one good method.

I am an epidemiology graduate student working on a project exploring the association between early life stress and late-life cognitive performance. I have survey data from a large longitudinal panel study (n= ~4000). Each individual took a cognitive performance instrument (the TICS-m) and my main exposure variable of interest is a continuous variable representing the individuals severity of early life stress (as measured by traumas, abuse, neglect, etc). So my main hypothesis is early life stress –> poor cognitive performance.

The outcome variable is the score on the cognitive instrument dichotmized ( poor educational attainment–> poor cognitive performance in late life so at first I thought not controlling for it since it seems to me like it is a mediator instead of a confounder. On the other hand, it is well cited that the TICS-m has obvious ceiling and floor effects associuated with education. Older folk who achieved advanced degress earlier in life will always do relatively well on the cognitive battery even when they are cognitively impaired and those older folk who have very little education will do relatively bad on the battery even when they do not have any cognitive impairments. This is making me want to control for subjects education but I’m afraid of running into endogeneity problems. Education –> cognitive status, but also cognitive status –> education. There are ways to control for the endogeneity but given the cross sectional nature of the study and the fact that this is more exploratory since there is not the miuch research in this field, I prefer not to use a sophisticated causal inference technique. So I guess my question to you would be, what should I do with education? Any advice would be much appreciated. THANKS!

Sorry the previous comment had some sentences missing

Hi Adan,

First, it is almost certainly wrong to dichotomize the TICS-m.

Second, if it is the outcome it cannot be either a control variable or a mediator.

Third, education sounds like something you certainly have to control for, it might also be a mediator or a moderator.

Peter

I am trying to settle an argument at work. My background is one undergrad statistics class. If the odds of winning the lottery are say 1 in 1,000,000, would buying a second ticket make the odds 2 in 1,000,000, reduced to 1 in 500,000? The question is how can one additional ticket reduce the odds by half? If not, how can I pose the answer in a way my coworker will understand? Thanks for any help, Ryan

Yes, you are right. Think of simpler situations. What is the chance that a coin will land heads? 1/2. That it will land heads or tails? 1/2 + 1/2 = 1. What about a die? That it will land on 1? 1/6. That it will land 1 or 2? 1/6 + 1/6 = 2/6. Same with lottery tickets.

What is the standard deviation for the following:

Group A – Out of 37 persons 10 persons bags were not inspected; Group B – Out of 27 persons 17 persons bags were not inspected; Group C – Out of 10 persons 6 people persons were not inspected

Hi Sam,

This looks like a homework problem. I don’t do those for people, sorry. But if you want to get answers you could submit it to http://stats.stackexchange.com/questions but they will want you to put in some work before they give an answer.

Peter

Good day,

I have a general question regarding multiple regression. If a variable does not show to have a significant relationship with the outcome variable when checking for bivariate correlations (before running the regression), is it possible to find a significant effect (in the coefficient table) of that IV on the outcome variable when added to the multiple regression?

Hi Monika

Yes, it is. The practice of bivariate screening is common, but not good. Not only can a variable that was not sig bivariately be sig. in a multiple regression, but it can also have effects on other parameters.

Peter

Hello,

My name is Jacob Miller and I am having trouble thinking of a model for measuring the probability of a positive ID.

For example. Lets say I have n samples, each with x number of identifiable traits, who are either control, treatments, positive, or negative.

Now, lets assume that there is a recipient for the data. All they see is the average number of controls who were positive, controls who were negative, treatments who were positive, and treatments who were negative.

Lets set a lower bound for the number of positive observations in a cohort (y>=2). What is the probability that the recipient of the aggregated data could get a positive ID on one of the positive subjects, assuming the worst possible scenario (that there were only two positive observations).

What process should I use to establish the probability of identification? Basically, what I want to know is, given a certain number of samples in treatment and control groups, who can read as either positive or negative, what is the probability that a recipient of the aggregated results could identify a single test subject between two time points?

For example:

Company A wants to know if its new add campaign is working. It has one store location. It has sales data from before the start of the campaign. Firm A offers to monitor the effectiveness of the campaign. They have tools to measure if an individual sees an ad, and also if that turns into a purchase. However, they will only send reports with the aggregates of n individuals. What is the probability that, given that company A also has individual sales data, a change in the aggregates sent by Firm A could result in the positive identification of a customer of company A?

Thanks!

Sorry, but I don’t know. Peter