Stata Video Tutorials
Webpage Contents (navigate to the section you need)
 Accessing Stata
 Entering or Importing Data into Stata
 Data Management and Prep within Stata
 Descriptive Statistics and Graphs in Stata
 Bivariate Analyses
 CrossTabulation and ChiSquared Test of Independence (two categorical variables)
 Ttests (comparing the means of two groups on an interval variable)
 ANOVA (comparing the means of three or more groups on an interval variable)
 Correlation and Simple Regression (interval variables; interval dependent variable)
 Multivariate Analyses
 Multiple OLS Regression (interval dependent variable; multiple independent variables)
 Multiple Logistic Regression (dichotomous dependent variable; multiple independent variables)
Note: Example commands are included in some cases. Simply replace italicized words with your variables.

 catvar = categorical variable
 groupvar = grouping (categorical) variable
 intvar = interval variable
 indvar = independent variable
 depvar = dependent variable
ACCESSING STATA (return to contents)
 Accessing the remote desktop (VLab1)
ENTERING OR IMPORTING DATA (return to contents)
 Entering data directly into Stata
 Importing data from an Excel file
 Using Stat/Transfer (translating data from other formats)
DATA MANAGEMENT AND PREP (return to contents)
 Know your data
 codebook var1 var2
 If you want to add notes to your data set  about the data set itself; about particular variables, etc.  you can do so through the notes command. Below are the commands for general data set notes and for a particular variable.
 notes: text
 notes [this simple command will show you all of the notes attached to your data set]
 notes var: text
 notes var
 Dofiles
 Changing variable names from upper to lower case (and vice versa)
 varcase
 There may also be times when you have variable names that include both upper and lower case letters, or some variables that are upper case and others lower case. The command varcase, in those instances, will just reverse the casing. Another command that will switch everything to lower case across the board is the following. The * will have Stata make the change for all variables. Alternatively, you could specify a particular variable(s) in its place.
 rename *, lower
 Cloning a variable and renaming a variable
 clonevar newvar = oldvar
 rename oldvarname newvarname
 Recoding categorical variables(e.g., creating dummies; reordering response categories) Keep in mind that in the commands below the first value you list in the ( ) is the category value in the variable you want to recode; the second number refers to the value you want that category to be assigned in the new variable you are generating as part of the command.
 recode catvar (# = # "label1")(# = # "label2"), generate(newvar) label(newvar) test
 recode catvar (#/# = # "label1")(# # = # "label2"), generate(newvar) label(newvar) test
 tabulate catvar, gen(catvar) [this generates dummies from a categorical variable]
 Reverse coding categorical variables
 revrs var
 If you need to see the values signed to all of the categories of a categorical variable (and codebook isn't showing you all of them), you can use the following command to see those values:
 fre catvar
 Changing numeric values in other data formats (e.g., 9) to Stata's version of missing values (.)
 mvdecode _all, mv(9)
 mvdecode var, mv(9)
 Adding variable and value labels
 Generating new variables from existing variables
 Creating a composite variable
 Start by recoding variables as needed (intuitive directions, etc.)
 Decide whether items will need to be standardized due to varying value sets, then:
 alpha var1 var2 var3 var4, item
 or
 alpha var1 var2 var3 var4, std item
 alpha var1 var2 var3 var4, item
 Decide which variables should be included in the final composite by examining the alpha scores. Then you can use the alpha command to actually generate the new composite variable. Notice that, assuming you want to base the composite on the mean value of the components, that you can set a minimum number of values that must be present before a composite is calculated (e.g., 2 of the set).
 alpha var1 var2 var3, gen(compvar) min(2)
 or
 alpha var1 var2 var3, gen(compvar) min(2) std
 alpha var1 var2 var3, gen(compvar) min(2)
 If all of your variables share the same value set, then no standardization is needed. You then have the option of either basing the composite on the mean of those values, or adding them up. If you add them up (rowtotal), be sure to calculate a composite for only those cases that have no missing values for any of the component variables. Instead of using the alpha command to generate the variable, you'll need to first create a variable that counts up the number of missing values for your cases with respect to the component variables. You can then use that variable to set the condition that a composite be calculated only for those cases with no missing values.
 egen float compmiss = rowmiss(var1 var2 var3)
 tab compmiss
 egen float compvar = rowtotal(var1 var2 var3) if compmiss==0
DESCRIPTIVE STATISTICS AND GRAPHS (return to contents)
 Frequency distributions for categorical variables & associated graphing options (bar & pie chart)
 tab catvar
 fre catvar
 Combining crosstabs and descriptive statistics
 Summary statistics for interval variables(central tendency, standard deviation, etc.)
 sum var
 sum var, detail
 by groupvar, sort: summarize var
 Testing for normality(for an interval variable); use the following three commands:
 histogram var, normal
 sum var, detail
 sktest var
 Box Plot(BoxandWhisker Plot) (for interval variables)
 graph box intvar, over(catvar)
 graph hbox intvar, over(catvar)
 Histogram(for both interval and categorical variables)
 histogram var
 histogram var, frequency normal
 histogram intvar, by(catvar, cols(1))
BIVARIATE ANALYSES (return to contents)
Crosstabulations and ChiSquared (return to contents)
 Crosstabulation and Chisquared test (including Cramer's V)
 tab indvar depvar, row chi V
 tab indvar depvar, row chi V gamma taub
 Graphing options for crosstab results (catplot and spineplot)
 catplot catvar1 catvar2, percent(rowvar) asyvars legend(pos(6) row(1)) recast(bar)
Ttests (return to contents) (Note: Stata has changed the menu system for ttests)
 Two group mean comparison ttest
 ttest intvar, by(catvar)
 Interpreting the results of the two group mean comparison test
 Cohen's d (measure of effect size)
 cohend intvar catvar
 Complementary graphing options:
 See Box Plot video to see a complementary graphing option for the ttest
 graph box intervalvar, over(dichotomousvar)
 Or, run an ANOVA and then use the margins and marginsplot commands.
 anova intervalvar dichotomousvar
 margins dichotomousvar
 marginsplot, xdimension(dichotomousvar) recast(bar)
 marginsplot, xdimension(dichotomousvar) recast(dot)
 See Box Plot video to see a complementary graphing option for the ttest
 Ttest assumptions
 Wilcoxon ranksum (MannWhitney) test(nonparametric alternative to ttest)
 ranksum intvar, by(catvar)
 Paired sample (aka repeated measures) ttest
 ttest var1 == var2
 Onesample test of proportion
 prtest var == proportion
 Twosample/group proportion test
 prtest var1 == var2
 prtest intvar, by(group)
 Onesample test of means
 ttest var == value
ANOVA (return to contents)
 Oneway analysis of variance(including Bonferroni test and effect size measure)
 oneway intvar catvar, tabulate bonferroni
 If looking for a measure of effect size, run the following ANOVA command; it provides an Rsquared:
 anova intvar catvar
 Complementary graphing options for ANOVA:
 See Box Plot video to see a complementary graphing option for ANOVA
 Or, run the margins and marginsplot commands following an anova command.
 anova intervalvar catvar
 margins catvar
 marginsplot, xdimension(catvar) recast(bar)
 marginsplot, xdimension(catvar) recast(dot)
 KruskalWallis rank test (nonparametric alternative to ANOVA)
 kwallis var, by(catvar)
Correlation and Simple Regression (return to contents)
 Pearson's r (correlation) (listwise and pairwise)
 corr var1 var2 var3
 pwcorr var1 var2 var3, sig
 pwcorr var1 var2 var3, listwise sig
 Scatterplots and fitted line graphs
 twoway (scatter depvar indvar)
 twoway (scatter depvar indvar, mlabel(labelvar))
 twoway (scatter depvar indvar) (lfit depvar indvar)
 twoway (scatter depvar indvar, jitter(3))
 sunflower depvar indvar
 Simple regression
 regress depvar indvar
 regress depvar i.catvar
 regress depvar ib2.catvar
 Spearman's rho or rank correlation coefficient(nonparametric test of association)
 spearman var1 var2 var3, stats(rho p)
 Bivariate analyses  Similarities between the ttest, ANOVA, and simple regression
MULTIVARIATE ANALYSES (return to contents)
Multiple OLS Regression (return to contents)
 Multiple OLS regression
 regress depvar indvar1 indvar2
 regress depvar indvar1 indvar2, beta
 regress depvar indvar1 indvar2, robust [if concerns about normality of depvar]
 nestreg: regress depvar (1stvar) (2ndvar) (3rdvar 4thvar)
 Graphing options for OLS regression (run after your regression command):
 coefplot, drop(_cons) xline(0) nolabel
 coefplot, drop(_cons) xline(0) msymbol(d) mcolor(white) levels(99 95 90 80 70) ciopts(lwidth(3 ..) lcolor(*.2 *.4 *.6 *.8 *1)) legend(order(1 "99" 2 "95" 3 "90" 4 "80" 5 "70") row(1)) nolabel
 coefplot, drop(_cons) xline(0) msymbol(d) cismooth nolabel
 Or, run the margins and marginsplot commands after your regression:
 margins, dydx(*) post
 marginsplot, horizontal xline(0) yscale(reverse) recast(scatter)
 Semipartial correlations (provides measures of contribution to Rsquared for each variable)
 pcorr2 depvar indvar1 indvar2 indvar3
 Regression diagnostics  Checking for Multicollinearity and Outliers
 Calculating Cook's d to identify influential cases (outliers):
 predict cooksd, cooksd
 list id cooksd if cooksd > 4/n
 Checking for multicollinearity
 estat vif
 Calculating Cook's d to identify influential cases (outliers):
 Adjusting for sampling weights
 regress depvar indvar1 indvar2 [pweight = weightvar]
Multiviarate Logistic Regression (return to contents)
 Logistic regression
 logistic depvar indvar1 indvar2
 listcoef, help percent
 nestreg: logistic depvar (1stvar) (2ndvar) (3rdvar 4thvar)
 Graphing options for logistic regression results (run immediately after your logistic regression). These will graph the predicted probability of Y for different values in your selected categorical and/or interval independent variable(s).
 margins catvar, atmeans
 marginsplot, xdimension(catvar)
 Or
 margins catvar, atmeans
 marginsplot, xdimension(catvar) recast(bar)
 marginsplot, xdimension(catvar) recast(dot)
 Or, to graph odds ratios (immediately after a logistic regression):
 coefplot, drop(_cons) xline(1) eform xtitle(Odds Ratio) nolabel
 Calculating predicted probabilities
 Ordered logistic regression
 ologit depvar indvar1 indvar2
Report an issue 
Last updated: 05/03/2022