To get a good answer, you must write a good question. Answering a statistics question without context is like boxing blindfolded. You might knock your opponent out, or you might break your hand on the ring post.

What goes into a good question?

**1.**Tell us the PROBLEM you are trying to solve. That is, the substantive problem, not the statistical aspects.

**2.** Tell us what math and statistics you know. If you’ve had one course in Introductory Stat, then it won’t make sense for us to give you an answer full of mixed model theory and matrix algebra. On the other hand, if you’ve got several courses or lots of experience, then we can assume you know some basics.

**3.** Tell us what data you have, where it came from, what is missing, how many variables, what are the Dependent Variables (DVs) and Independent Variables (IVs) – if any, and anything else we need to know about the data. Also tell us which (if any) statistical software you use.

**4.** Are you thinking of hiring a consultant, or do you just want pointers in some direction?

**5.** THEN, and ONLY THEN tell us what you’ve tried, why you aren’t happy, and so on.

I specialize in helping graduate students and researchers in psychology, education, economics and the social sciences with all aspects of statistical analysis. Many new and relatively uncommon statistical techniques are available, and these may widen the field of hypotheses you can investigate. Graphical techniques are often misapplied, but, done correctly, they can summarize a great deal of information in a single figure. ** I can help with writing papers, writing grant applications, and doing analysis for grants and research.**

** Specialties:** Regression, logistic regression, cluster analysis, statistical graphics, quantile regression.

You can **click here to email** or reach me via phone at 917-488-7176. Or if you want you can follow me on Facebook, **Twitter**, or LinkedIn.

Statistical Methods for Calculating Vending Machine Refill

Am looking into statics to help support a project I am undertaking. The project scope concerns intelligent replenishment / refill of vending machines.

During an onsite service, a technician must make decisions regarding machine refill to optimise sales for the period leading up to the next refill visit. The vending machines have no call back to base feature.

The technician must…

• maximise sales by specifically stocking and selling those items that demonstrate superior sell through (this may vary over time and seasonally)

• limit the restocking of inventory for high value items like cigarette packets or perishable items involving use by dates so over supply does not occur, but balance this by..

• ensuring sales are not lost through insufficient restocking of inventory items

The statistical model or programmatic code needs to consider the following real world factors…

• Factors impacting short term sales (seasonal factors, gazetted holiday, localised events, machine inventory levels)

• Factors impacting long term sales trends (long term seasonal factors, machine location)

• Sales performance per bay location within a vending machine

• Item properties including use by dates, purchase price, sales price

Was wondering whether a definitive highly accurate statistical method existed to achieve these outcomes? I was planning to construct an array of intelligent algorithms to overcome these parameters with the intent of operation within PDA styled smart client devices.

A colleague suggested modelling data using Poisson regression but I gather that is just a starting point to overcome all the interwoven variables?

Good Day,

I am Zhanar and I am student of MBA. Now I am on the stage of my master thesis research and I have several questions regarding to Factor Analysis steps and SPSS interpretation. I have applied Job Satisfaction Survey in my research I wonder if the steps of my factor analysis is correct. So, I transferred all the data of the survey to SPSS, run factor and received outputs for the interpretation. Kaiser-Meyer-Olkin Measure of Sampling Adequacy, Bartlett’s Test of Sphericity and Cronbach’s Alpha confirmed the reliability of the data. I have received 5 factors which are ok. However, I have a variable which loads highly two factors at the same time with positive and negative coefficient correlation? How can I interpret this?

And secondly, as soon as SPSS formulated 5 factors with Rotated Component Matrix at the final stage should I make some further analysis per factor?

Thanks in advance for any recommendation and information about my task,

Regards,

Zhanar

It is common that a variable that loads highly on two factors. In the ideal world, each variable would load highly on one and only one factor, but… life is not ideal! It means that that variable is a part of two latent variables or factors. The sign of the loading is fairly arbitrary.

As to further analysis, it depends on your goals and what you are trying to do and so on.

Peter

Is it logical to say that you can increase statistical power by having a highly sensitive outcome measure? I say this because a highly sensitive test has low Type II error, and low Type II error leads to high power. In one of the articles I was reading, the authors’ power analysis revealed that they only needed 6 participants per group, and they believed that it was because their outcome measure was highly sensitive.

I have a moderate level of understanding with statistics (in the context of psychology), and am in my 3rd year of university.

Dear Sir,

I have seen a book example that, Friedman test can be used as a nonparametric analogue of ‘Two way ANOVA without replication’. My question is whether Friedman test can also be used as a nonparametric analogue of ‘Two way ANOVA with replicates data.

Thanks in advance and regards,

Alam

I don’t see why not.

We are two graduate students who got involved in an interesting project focused on cultural values and their influence on social media. The study is focused around 15 hypotheses that we have worded based on our theoretical framework. Since neither of us can be called stat geniuses, though one of us did completely an introductory class on statistics, we’re writing here to ask how to best go about handling the hypothesis testing.

Here is an example of a hypothesis: Low context cultures will use more factual communication than high context cultures

The data is all the textual information across 8 facebook walls throughout the world. The definition of Low/High context cultures is made clear by our theory, and the same goes for factual communication. From here, we have counted the occurrences of factual communication from the walls. This enables us to calculate a frequency for each wall, but this will hardly be substantial enough.

Our question is this. What approach would you suggest taking using descriptive statistics and what approach using inferential statistics?

I hope it makes sense, and thanks in advance!

Best regards, Denmark 🙂

Sounds like you could use crosstabulations for descriptive statistics and a bunch of logistic regressions for inferential statistics

Peter

Hi Long, sorry for the late reply, something was wrong with my website. Yes, this is exactly so: A more sensitive outcome measure will increase power, other things being equal.

Peter

Hi Barry

Sorry I did not get to this, something was wrong with my website. Do you still need help? Peter

Hi Yared

Sorry I did not get to this, something was wrong with my website and I didn’t get notified. Do you still need help?

peter

I have obtained a very large odds ratio and standard error for my interaction term (interaction of 2 IVs) after running a binary logistic regression. My study has 4 conditions and 3 dependant variables, however, 1 of the dependent variables is categorical and so I was told by my supervisor to conduct a binary regression for that dv separately while conducting a Manova for the other variables. The dv is a simple yes or no answer to signing up so a fictional follow up study. I have 52 participants. Overall, 44 participants responded yes and 8 responded no. I am aware that the odds ratio may have been so large because I have a very small number in some of my cells. How would I report a very large odds ratio according to apa guidelines in a table? Also, should I run different analysis since my odds ratio is so large? The odds ratio is 6 figures so it is huge! Any help would be appreciated. This is for my undergraduate psychology dissertation.

If you have only 8 people saying “No” then you will probably be overfitting the model with 2 variables plus an interaction. You should really look at only one variable at a time. You quite possibly have quasi-complete or complete separation. So, sorry, but you can’t fit the model you want to the data you have.

Reporting doesn’t vary with the size of the OR.

Hi,

can you please explain why a set of homogeneous data need no statistical treatment? I need your help. Thanks

If the data are completely homogenous there is nothing to analyze.

Hi,

I am trying to perform multinomial logistic regression for upsell modelling.

But the proportion of non responders in my dataset is very high, ~95%. Rest of the data is divided among 4 responding categories.

After building the model, while assigning probability to each observation using max probability criterion from depending categories, i observed that all the observation have been marked as non responders because the predicted probability for non responding category is highest for all the observation.

This is leading to wrong predictions.

Can you please suggest how can i correct this.

What is probably happening is that your independent variables aren’t doing a very good job. But, depending on which program you are using, you can output the probabilities of being in different categories and then use whatever cutoffs you like.

Is there a site where I can subscribe to for me to ask random statistics questions that come up as I take an online stats course. For example, I am having problems ensuring my t-statistic is correct in a confidence interval example.

An analyst wishes to formulate a 95% confidence interval for the mean length of stay for MS DRG 291. A sample of size 15 resulted in a sample mean of 4.5 and sample standard deviation of 1.7. I see the two-tailed t-statistic asal as 2.145 but the example I am using from our text says it should be 1..96 for df 14. So I’d like to know if I can ask someone what I am doing wrong.

There are places like talkstats or CrossValidated.

Or, if you want to hire me, then let me know.

I have a 162 values of a discrete random variable. I want to do analysis, validation and, if it is possible, forecasting

That’s not nearly enough information for me to be able to tell anything you can do or not do with this.

How much sample size I would need for a 38-item survey?

It is a 38-item questionnaire with a 5-point Likert scale measurement.

The population size is around 430 individuals.

Thank you,

The question is unanswerable as is. You need to specify what you want to do with the survey. What is your hypothesis?

I want to analyze data from a survey, but am not sure which statistical test to use. I am interested in looking at Age vs. Average # of Current Employers. The data I have are as follows:

60 years old: 3.9 employers (30 respondents)

Do you have any recommendations?

You probably want some form of regression.

I would appreciate some advice one a suitable method of analysis for my research project.

I have a ratio data, 4 different and increasing levels of abuse claimed by suspect (IV) and level of guilt assigned to suspect (DV). And then I have 2 sets of samples from 2 different regions.

So to test within each region, an ANOVA would be the most suitable option. I do assume this to be true, yes?

To compare the data from the 2 regions, my current idea is to just simply run a two tailed T-test for each level of abuse claimed by suspect between the 2 regions. Is there another method that would produce a more meaningful data? it just feels like an ANOVA and T-Test is too simple.

I would appreciate if you can help me. I have data points which fell in graph with x axis is calendar years and y axis is mortality rate. I have trendline for these datapoints. I want to use the latest data point (which means maybe data of 2014) for further analysis. But I am still not sure whether I should use actual value or y value calculated from equation given in trendline.

Thanks

What do you mean by “use the latest data point …. for further analysis?”

Hi Peter

I have a set of data of orders delivered to customers. For every order we predict a commit date on which the order will be delivered. We are measuring the the performance by comparing te committed date and the actual date.

To know if my predictions are correct can I consider the commit date I predict as a forecast and measure Forecast accuracy (MAD, MAPE, SSE) & Bias(Mean Bias & confidence intervals).

I have this doubt because the above measures are usually used to check the accuracy of a forecasted demand & in this case I’m forecasting/predicting the date by when a certain order will be delivered.

Please let me know your views on this.

Appreicate your help.

Thanks & Best Regards

N. Chandramouli

I don’t see any reason you couldn’t use MAD and such; the problem would be that number of days is an integer, rather than a continuous variable. This is only likely to cause any problems at all if the difference takes on very few values. In that case, median may be hard to determine exactly. In addition, things relating to the mean might be uninformative if delay time is highly skewed.

Hi Peter. Urgently need your advice. I am doing an optimization study, for which I generated a list of runs using central composite design with ‘rotatable’ alpha. My study has three factors with two levels each, and one response variable. The generated design had two axial runs that gave negative (not doable) values for the factors. I disregarded these runs and conducted the rest. I have only learned now that that was a mistake.

Now I want to know if there is any way the data I obtained can still be valid even without the two negative axial runs. I think that in theory it should still be acceptable since the objective of my study is to maximize the response value, which is achieved by (generally) maximizing the levels of the factors. Therefore the loss of the two negative runs should not have much of an effect. Or, if there is any way to use the data that I already have to fit a model without having to conduct further experiments, because I really have no more time to do so. At this point I can only make do with what I have.

I really hope you will be able to help me on this matter, I am quite desperate. Thank you very much.

Sorry, but I don’t know the answer to this.

I need to list all the cases where the value exceeds the cook’s distance. I already calculated cook’s distance and saved the new variable. I can identify in the data view which cases exceed the value but i don’t know how to make a table or list of only the relevant cases (CD>the relevant value) in the output. My apologies for the poor grammar but i really could use some help.

It depends on your software.

I am currently working on a assignment on SPSS and I’m stuck on which type of analysis to use. I am sorry if I sound ignorant on the topic as I quite new with the program and statistics.

I have a group of data on violence in Uganda and I have placed variables, values, and categorized them either nominal, ordinal, or ration interval. Now, I want to see if there is a relationship between the age, gender, and level education with knowing someone who was a victim of violence, knowledge of the acid and burn victims (ABV) concept, and knowledge of according punishments for such acts. I would like separate analysis for each variable. Which sort of statistical analysis should I use for such data?

Thank you for the help.

Please let me know if you need further information.

Sounds like you need logistic regression.

Hi, will you be able to advise on this please? I used Tukey’s method to detect outliers and after detecting I removed them. But when I performed the test again, I found outliers again. Am I doing something wrong? many thanks.

There is no simple method to detect outliers and simply removing them because they are outliers is a mistake.

How to deal with outliers depends both on what you are trying to do and what sort of outliers they are

Hi,

I wanted to do PERMANOVA analysis to know the difference/variance between my sample groups. But my data is not homogenous (I performed Permutation test for homogeneity of multivariate dispersions test in R) and the replications inside each group is also different. For ex. Three different regions has different sample numbers – >Region A = 36, B = 48 and C = 54. What is the best method to use here? Thanks.

You can use a regression method that does not rely on homogeneity of variance. E.g. quantile regression or robust regression

Hi,

My question is about Mann-Kendall test. I have the need to evaluate data of precipitation to see if the amount of precipitation during last 50 years in my study site is following a trend or not and if the trend is increasing or decreasing? but I don’t how I can calculate Mann-Kendall test in Spss? i have study diffrent books about Spss but there is not such a test there. I was wondering if you could help me .

Sorry, but I don’t use SPSS

Peter

Discrete Time Survival Analysis

1- Is it necessary for time-dependent covariates to test any violation of the proportional hazards assumptions?

2- What is the difference between panel logistic regression and discrete time survival analysis except including a baseline hazard?

Really, If I include time dummy variables (12 dummy variables) in the model, it would take a long time to implement or it reports not concave.

3- Is it necessary to do Hausman, Breusch-Pagan, Chow tests in order to determine fixed effects or random effects should be used?

4- In Stata, why is the result of implementing random effects parametric survival model different from that of implementing random effects panel logistic? I know that the former reports hazard rate but the latter reports odds ratio. However, the significance of predictors also are different.

a- statistics-longitudinal/panel data-binary outcomes-logistic regression

b- statistics-longitudinal/panel data-parametric survival regression.

Is it necessary for both to include baseline hazard function (ln(time))?

According to the following link, it seems that logit command is used in conjunction with temporal dummy variables, which is referred to as non-parametric because we used temporal dummy variables. If xtlogit is used, there is no need to use temporal dummy variables, so in this way discrete time survival analysis will be the same as unbalanced panel logistic.

Besides, heterogeneity is just applicable to random effects. Also, it is suggested that cloglog is preferred over logit.

http://www.stata.com/statalist/archive/2014-03/msg00027.html

Besides, I read here (http://stats.stackexchange.com/questions/73355/duration-analysis-of-unemployment) that Chamberlain’s estimator (which should be used for discrete time survival analysis) is not currently implemented in R.

Please find the attached, which is some part of my bankruptcy data. In fact, the real number of my covariates could be large, which some kind of variable selection approach should be adopted. I also find this related post (http://www.ibm.com/support/knowledgecenter/en/SSLVMB_22.0.0/com.ibm.spss.statistics.cs/spss/tutorials/genlin_ulcer_howto.htm#genlin_ulcer_howto) in SPSS.

I am so sorry for asking a lot but really I am too confused. I couldn’t find any material which could provides answers for my questions.

Thanks in advance.

Best regards,

Dear Dr. Flom

Discrete Time Survival Analysis: 1- Is it necessary for time-dependent covariates to test any violation of the proportional hazards assumptions? 2- What is the difference between panel logistic regression and discrete time survival analysis except including a baseline hazard? Really, If I include time dummy variables (12 dummy variables) in the model, it would take a long time to implement or it reports not concave. 3- Is it necessary to do Hausman, Breusch-Pagan, Chow tests in order to determine fixed effects or random effects should be used? 4- In Stata, why is the result of implementing random effects parametric survival model different from that of implementing random effects panel logistic? I know that the former reports hazard rate but the latter reports odds ratio. However, the significance of predictors also are different. a- statistics-longitudinal/panel data-binary outcomes-logistic regression b- statistics-longitudinal/panel data-parametric survival regression. Is it necessary for both to include baseline hazard function (ln(time))? According to the following link, it seems that logit command is used in conjunction with temporal dummy variables, which is referred to as non-parametric because we used temporal dummy variables. If xtlogit is used, there is no need to use temporal dummy variables, so in this way discrete time survival analysis will be the same as unbalanced panel logistic. Besides, heterogeneity is just applicable to random effects. Also, it is suggested that cloglog is preferred over logit. http://www.stata.com/statalist/archive/2014-03/msg00027.html Besides, I read here (http://stats.stackexchange.com/questions/73355/duration-analysis-of-unemployment) that Chamberlain\’s estimator (which should be used for discrete time survival analysis) is not currently implemented in R. Please find the attached, which is some part of my bankruptcy data. In fact, the real number of my covariates could be large, which some kind of variable selection approach should be adopted. I also find this related post (http://www.ibm.com/support/knowledgecenter/en/SSLVMB_22.0.0/com.ibm.spss.statistics.cs/spss/tutorials/genlin_ulcer_howto.htm#genlin_ulcer_howto) in SPSS. I am so sorry for asking a lot but really I am too confused. I couldn\’t find any material which could provides answers for my questions. Thanks in advance. Best regards,

Hi Ebrahami

I don’t use STATA or SPSS so I can’t say wht those programs do.

1. Yes, you need to test proportional hazards

2. Survival analysis is designed to deal with censored data; logistic regression is not.

3. I think you should determine random vs. fixed based on whether the effects are random or fixed, not on statistical measures like teh ones you mention.

Peter