Statistics help for people writing dissertations

Statistics help for people writing dissertations

I can help you complete your dissertation by helping you with statistics

What statistics should I use?

You've decided on a topic and have some idea of your research questions and hypotheses. But what next? How can your hypotheses be properly tested and your research questions answered? What are the right statistical tools to use? Together, we can figure it out. Maybe it's regression (but what kind?) Maybe it's t-tests. ANOVA. Or maybe it's something you've never taken a class in such as cluster analysis or factor analysis or a multilevel model. I know about these methods and I can help you figure out which one is right for you. All this can help you complete the "Methods" chapter of your dissertation.

How many subjects do I need?

Before you start gathering data you need to know how much data to gather. Too few subjects and you won't find statistical significance even if your theory is correct. But if you try for too many subjects you waste time and money. Deciding on the right number is called power analysis and most dissertation committees require it. I can help you do it. [/one_third] [one_third_last]

How can I do the analysis?

Depending on your needs and what your mentor and committee allow, I can provide any level of help in doing the analysis and writing it up. If you do the analysis, I can help you understand the output. If I do the analysis, I can send you the output, I can annotate it or I can help with writing. And that will help with the "Results" chapter.


Let Me Introduce Myself
Peter RlomI specialize in helping graduate students and researchers in psychology, education, economics and the social sciences with all aspects of statistical analysis. Many new and relatively uncommon statistical techniques are available, and these may widen the field of hypotheses you can investigate. Graphical techniques are often misapplied, but, done correctly, they can summarize a great deal of information in a single figure. I can help with writing papers, writing grant applications, and doing analysis for grants and research.

Specialties: Regression, logistic regression, cluster analysis, statistical graphics, quantile regression.


Client Testimonials

  • Gail was doing research for a District Reading Assessment and needed Peter to help decipher data from bilingual and linguistic text use research. On Speed, Efficiency as well as Proactiveness when responding to requests:

    Dr. Peter Flom usually gets back to me within 2 to 6 hours of my email request. Unbelievable, and beyond what I have experienced with other statisticians.
    On Attention to Detail and Thoroughness:
    Dr. Peter Flom looks closely at the data I provide. He asks questions so he can be sure to analyze the results correctly. For example: Dr. Flom is currently looking at what is called Developmental Reading Assessment (DRA)...noting the research data currently being analyzed does not quite "make sense" (meaning children able to read more words, are scoring lower on the DRA). Peter cares about finding through his expert feedback, questions, and dialog that the DRA assessment has issues with reliability and validity. Dr. Flom wants research, our research to be high quality!
    On how Peter has helped Gail learn:
    Okay...I am a dummy at statistics and statistical analysis. Dr. Peter Flom never makes me feel that way....all my questions are welcomed and answered. What I love about Dr. Peter Flom is how he uses analogies. "What is and effect size?" I ask. Dr. Peter Flom explains, "Okay, say your research is on a diet pill...people taking the pill lose .1 of a pound in 3 months." This "effect" size is minimal.
    On Providing Recommendations
    Dr. Peter Flom has not only kept up with emails when reviewing my research. He is available via phone to talk with me about data and research. Dr. Peter Flom has taken the time to read grant proposals, dissertations, and other parts of a huge research.
    I asked Gail what she liked most about working with Peter, her response:
    Dr. Peter Flom, has the rare combination of being a genius at statistical analysis with the ability to explain the statistics to those (like me) who do research as a passion on a subject... who do not really understand quantitative analysis. Dr. Flom cares, cares about the research integrity!
    All in all, Gail said Yes, should would recommend Peter to her friends and colleagues.

Frequently Asked Questions


What can I get from the FREE 30 minute consultation?

[box type="shadow"]In your initial consultation, you can ask me questions about my background, explain your ideas as they stand, discuss whether I am the right person to assist you, and, if I am, how we can best work together. If not, I may be able to recommend someone else.[/box]

How should I do before my initial consultation?

[box type="shadow"]It is best to come to see me early in the dissertation process, but I can help you at any stage. All you need do to prepare is have an idea you want to discuss.[/box]

Can I get help with a small project?

[box type="shadow"]No project is too small. Some dissertations involve only relatively simple statistics – If a simple research method answers your questions, then that is the method to use. However, it is not always the case that I am the right consultant for you. While I know a lot about statistics, no one knows it all. In the initial conversation, we will determine this.[/box]

What will this cost?

[box type="shadow"]I have an hourly fee. The more work you need from me, the higher the cost. I bill in small increments, so that neither you nor I will ever be surprised, and I consult with you frequently, keeping you updated every step of the way. Although my time is not cheap, you should consider that 1) You are hiring an experienced professional, with more than a decade of experience and 2) You have spent many thousands of dollars and years of time getting to this point. A PhD is much better than an ABD.[/box] [/one_half] [one_half_last]

How will the work be delivered to me?

[box type="shadow"]I can deliver work in almost any format. I can give you raw output, annotated output, or an outline of the methods and results chapter. One thing I do not do is deliver fully written methods and results sections. Not only is that a key part of the dissertation process, but it leaves you with material you may not understand, and that you may not be comfortable defending before your committee. You don’t need to know every detail of every analysis, but you need to understand the key elements. I work as a partner of my clients, not a “hired gun”.[/box]

Will I have to cite Peter Flom in my dissertation?

[box type="shadow"]No, but if you like the work I do, I appreciate referrals. I do require that you inform your committee, early in the process, that you have hired a consultant.[/box] [/one_half_last]

Research Methods

[one_half] [learn_more caption="Cluster Analysis"] Cluster analysis is a set of techniques designed to answer the question “How do my subject cluster together on the variables that I have?”  This is a question that is often interesting, but is too rarely asked, because many applied researchers have not heard of the methods involved. Suppose, for example, you had data on the votes of members of a congressional or parliamentary body.  Cluster analysis would allow you to ask whether there were groups of politicians who voted similarly.  In systems with strong party voting, you could do this within particular parties.  Or, you might look at the whole body, and see if some members of particular parties were outliers within their party. Or suppose you had the answers of a group of subjects to all the questions on a questionnaire or personality measure or test of ability.  You could use cluster analysis to see if people grouped together on their responses to some items.  Perhaps some particular set of items identifies a key group of people. There are a great many methods of cluster analysis, but they all attempt to answer this basic question. [/learn_more] [learn_more caption="Dichotomous Logistic"] In logistic regression, the goal is the same as in linear regression (link): we wish to model a dependent variable (DV) in terms of one or more independent variables  However, OLS regression is for continuous (or nearly continuous) DVs; logistic regression is for DVs that are categorical. When the DV has two categories (e.g., alive/dead; male/female; voted for McCain/Obama), we use dichotomous logistic regression. WHY LOGISTIC REGRESSION IS NEEDED One might try to use OLS regression with dichotomous DVs. There are several reasons why this is a bad idea: 1. The residuals cannot be normally distributed (as the OLS model assumes), since they can only take on two values for each combination of level of the IVs 2. The OLS model makes nonsensical predictions, since the DV is not continuous - e.g., it may predict that someone does something more than ‘all the time’. A VERY QUICK INTRODUCTION TO LOGISTIC REGRESSION Logistic regression deals with these issues by transforming the DV. Rather than using the categorical responses, it uses the log of the odds ratio of being in a particular category for each combination of values of the IVs. The odds is the same as in gambling, e.g., 3-1 indicates that the event is three times more likely to occur than not. We take the ratio of the odds in order to allow us to consider the effect of the IVs. We then take the log of the ratio so that the final number goes from negative infinity to infinity, so that 0 indicates no effect, and so that the result is symmetric around 0, rather than 1. The log of the odds ratio is known as the logit. [/learn_more] [learn_more caption="Factor Analysis"] Factor analysis is a set of techniques designed to find latent variables in sets of data.  A latent variable is one that cannot be directly measured.  For example, height is not latent, because it can be directly measured.  But many abilities, traits and beliefs are latent.  These include such things as intelligence, political beliefs, depression, and so on. To measure any of these, we would typically ask many questions that we thought were related to the latent trait, and then factor analyze them to determine the best way to score them, whether all the questions were related to the trait, how many traits there were, and related questions. There are many types of factor analysis, but they fall into two large groups:  Exploratory and confirmatory.  In exploratory factor analysis we have no preconceived notions of the factor structure, but in confirmatory we attempt to replicate some earlier results. There are two phases to factor analysis: extraction of the factors and rotation, and there are a variety of methods for each.   The rotations can be divided into orthogonal and oblique methods.  In orthogonal factor analysis, we require that each trait (or factor) is orthogonal (uncorrelated) with the other traits.  In oblique factor analysis, we do not make this assumption. [/learn_more] [learn_more caption="Graphical Procedures"] Statistical graphics are an essential part of statistical analysis.  But graphs are often poorly done, and there are some lesser known graphical methods that can be highly effective for displaying data.   To name just a few of these:  mosaic plots, dot plots, strip plots, lattice plots, density plots, quantile plots and various enhancements of the familiar scatterplot. [/learn_more] [learn_more caption="Linear Regression"] Linear regression refers to the type of regression where we have a continuous or nearly continuous dependent variable. It is sometimes divided into simple linear regression (where there is only one independent variable) and multiple linear regression (where there are more than one). The simplest case is when there is only one IV, and it is continuous. In this case, we can make a scatterplot of the DV and the IV.  Here is a scatterplot of the heights and weights of a group of young adults in Brooklyn, New York. (It’s from a project I worked on, long ago). It is traditional to put the IV along the X axis, and the DV along the Y axis.. Just by looking, it is clear that there is some relationship between height. There are various ways to model this relationship, and these can be represented by various lines (see here ).  OLS regression assumes that the relationship is linear, that is, it fits a straight line to represent the relationship. Algebra tells us that any straight line can be represented as an equation like Here, y is height, x is weight, and a and b are parameters which we attempt to estimate (hence, simple linear regression, and regression generally, is a parametric method). Various lines might be fit to these points; we need a method to choose one of them, or, in other words, to select a and b. Ideally, the points would lie exactly on the line, but there is no such line for these points.  Any line will miss most of the points; we need a method to say how badly a line misses the points. The most common way is through ordinary least squares (OLS) which uses the sum of the squared distances from the line to the points. When there are more than one IVs, the method is quite similar, but instead of a scatterplot in two dimensions, we have to imagine a space with as many dimensions as there are variables, and then minimize the distances in that space. Fortunately, the computer takes care of all this, and gives us output. The only difference that need concern us is that now if there are p IVs, the equation looks like . That is, each of the IVs has an associated parameter. How multiple linear regression controls for the effects of other variables One interesting feature of multiple linear regression is that the effect of each IV is “controlled” for the other IVs. That is, the parameter for variable   accounts for the effect of on, assuming that , and so on stay the same. If, for example, we were interested in people’s weights as effects. Of their age, sex, and height, then the resulting equation would show how men and women of a given age and height differ; how age is related to weight, if sex and height are kept constant, and how height is related to weight, if age and sex are kept constant. Assumptions of multiple linear regression Multiple linear regression (and simple linear regression as well) makes certain assumptions about the data. 1. Linearity As discussed in the previous diary, the model assumes that the relationship between the DV and the IVs can be well-estimated by a straight line 2. Normality of residuals. Residuals refers to the distances between the line and the points. Multiple linear regression assumes that these distances are normally distributed with a mean of 0. 3. Homoscedasticity and independence of residuals Not only must the residuals be normally distributed, they must have equal variance (that’s called homscedasticity) and they must not be related to the IVs. [/learn_more] [learn_more caption="Logistic Regression"] Logistic regression is a type of regression which is used when the dependent variable is categorical.   The DV can be dichotomousordinal or nominal.  But there are several reasons why using OLS regression with any categorical DVs is a bad idea: 1. It violates the assumptions of the model (residuals will be heteroscedastic and nonnormal). 2. The OLS model makes nonsensical predictions, for example, predicting that people are halfway between dead and alive. 3. For nominal DVs, the coding is completely arbitrary, and for ordinal DVs it is arbitrary up to a monotonic transformation.  Yet recoding the DV will give very different results. Logistic regression deals with these issues by transforming the DV.  It uses the log of the odds ratio of being in a particular category for each combination of values of the IVs. The odds is the same as in gambling. The ratio of the odds allows us to consider the effect of the IVs.  We then take the log of the ratio so that the final number goes from -1 to 1, so that 0 indicates no effect, and so that the result is symmetric around 0, rather than 1. [/learn_more] [/one_half] [one_half_last] [learn_more caption="Multi-Level Models"] In most regression techniques we assume that the data are independent.  Often, this is reasonable.  For example, if I collect data on the heights, weights, and other variables from a random sample of American adults, the data are independent – what I weigh does not depend on what you weigh.  But there are times when it is not a reasonable assumption Two types of analysis where this is the case are repeated measures and clustered data. Repeated measures refers to cases where the same subjects are measured repeatedly.  For example, if we were interested in the effect of different diets on weight loss, we would likely weigh the same people repeatedly.  What I weigh next month does depend on what I weigh today.  The dependence isn’t perfect, but it is clearly not independent data.  Clustered data refers to subjects that are clustered in space.  The classic example is students, who are nested with classes, which are nested within schools.  My score on some standardized test is probably related to my classmates’ scores – because students from the same class share many characteristics – most obviously, they have the same teacher, but they are also likely to be similar in other ways. There are multilevel models that correspond to linear regression (link), and various types of logistic regression (link). [/learn_more] [learn_more caption="Multinomial Logistic Regression"] Multinomial logistic regression is a type of logistic regression that deals with dependent variables that are nominal – that is, there are multiple response levels and they have no specific order.. For example, you might be interested in type of residence (e.g. private house, shared house, apartment,, etc) with demographic and other variables. The main problem with multinomial logistic regression is the enormous amount of output it generates; but there are ways to organize that output, both in tables and in graphs, that can make interpretation easier. Multinomial logistic regression must sometimes be used with ordinal data, if none of the ordinal logistic regression methods can be used because of poor fit or violation of assumptions. [/learn_more] [learn_more caption="Ordinal Logistic Regression"] Ordinal logistic regression is a type of logistic regression (link) that deals with dependent variables that are ordinal – that is, there are multiple response levels and they have a specific order,  but no exact spacing between the levels.  For example, you might be interested in correlating political views (measured as very conservative, conservative, moderate, liberal, very liberal) with demographic and other variables. By far the most commonly used ordinal regression technique is the proportional odds method, but there are others, and there are times when ordinal data should be analyzed using multinomial logistic regression or linear regression.  This is so, in part, because the differences between nominal, ordinal, interval and ratio level data are not exact – many variables do not fit neatly into one category. [/learn_more] [learn_more caption="Principal Component"] Principal component analysis is a technique for reducing the dimensions of a set of data.  Unlike factor analysis it is not designed to uncover latent traits, but simply to reduce the number of variables while retaining as much of the variance in the original data as possible. [/learn_more] [learn_more caption="Quantile Regression"] In ordinary regression, we are interested in modeling the mean of a continuous dependent variable as a linear function of one or more independent variables.  This is often what we do, in fact, want, and this form of regression is extremely common. Sometimes, though, we want something else.  Sometimes the dependent variable isn’t continuous and we turn to logistic regression or some form of count regression.  Sometimes the dependent variable is censored, as a time to event, and we turn to survival analysis. But sometimes even though the dependent variable is continuous, we are not interested in the mean, but in some other statistic about the population.  One such situation is when we want to model some quantile (also known as percentile) of the population.  That is, we might be interested not in what affects the mean, but in what affects (say) the 3rd quartile, or the 95th percentile, or some other percentile. When might we want this? Suppose our dependent variable is bimodal or multimodal – that is,  it has multiple “humps”.  If we knew what caused the bimodality, we could separate on that variable and do stratified analysis, but if we don’t know that, quantile regression might be good. If our DV is highly skewed – as, for example, income is in many countries – we might be interested in what predicts the median (which is the 50th percentile) or some other quantile. One more example is where our substantive interest is in people at the highest or lowest quantiles.  For example, if studying the spread of sexually transmitted diseases, we might record number of sexual partners that a person had in a given time period.  And we might be most interested in what predicts people with a great many partners, since they will be key parts of spreading the disease. [/learn_more] [learn_more caption="Survival Analysis"] Survival analysis gets its name from the fact that it is often used to look at how long people will live, and to see what influences that. Do women live longer than men? Do people who take aspirin live longer than those who do not? and that sort of thing. But it can be time to any event. We could look at how long prisoners stay in jail, how long patients stay in the hospital, how long couples stay married, or any other variable that is a time. When the dependent variable is continuous, we would ordinarily first think of linear regression, It’s a very good method when you want to look at the relationship between a continuous dependent variable and one or more independent variables. But, like nearly all statistical techniques, it makes assumptions. And one of the assumptions that is so clear as to usually go unstated is that we know the value of the dependent variable; usually, this is not a problem. If we want to model, say, what people weigh, we can weigh them. But in one common type of analysis, we don’t always know the dependent variable – that’s when the dependent variable is time to an event. In that case, we need survival analysis. The key reason that we need survival analysis is that these data are often censored. If, for example, we were looking at how long couples stay married, we could select some couples, and follow them over time. But some couples won’t get divorced before we finish our study. Similarly, some patients won’t die during our study, and so on.

Types of survival analysis

Although there are a wide variety of techniques for doing survival analysis, they fall into three famlies: Parametric, semi-parametric, and non-parametric. The difference is in what we wish to assume about the distribution of survival times. In parametric survival analysis, we assume that survival times come from some specific statistical distribution; in semiparametric survival analysis, we do not need to make this assumption, but we do make another assumption – usually the proportional hazards assumption. In nonparametric analysis, we avoid even that assumption. Since the exact nature of the survival function is hard to know, and is critical to the results, semiparametric survival analysis is much more commonly used than parametric. And semi-parametric offers more useful output than nonparametric analysis. By far the most common method is known as Cox proportional hazards regression. Semiparametric methods, unlike parametric methods do not allow you to predict a survival time; rather, they just let you see differences between groups, or differences based on some other measure. For instance, Cox methods would not predict how long couples would stay together, but it could predict how much more quickly (say) couples with a large age difference got divorced than couples with similar ages. This is often of primary interest. In addition, recent developments allow us to look at multiple events – for instance, we might model repeated patterns of being arrested over time, or getting a particular disease. [/learn_more] [/one_half_last]

Schedule your FREE 30 Minute Consultation

Let’s discuss the details of your project to see if my expertise in statistical data analysis can help you build better dissertations, write more compelling grant submissions and test your hypotheses with solid statistical analysis techniques.

Schedule your FREE 30 Minute Consultation

Let’s discuss the details of your project to see if my expertise in statistical data analysis can help you build better dissertations, write more compelling grant submissions and test your hypotheses with solid statistical analysis techniques.