Stata for Clinical Research

As a former SAS user converting to Stata, I needed a reference for statistical methods commonly used in clinical research.
These methods may also apply to other fields. This is an early version missing many important functions.

Descriptive Statistics

  • initial analysis might be su, inspect
  • summarize (su) – mean, stddev, min max; add ,detail for more info
  • tabstat, stat() by() – get summary statistics
    • note fsum is an optional plugin that is a better version of this
  • codebook – an alternative to summarize
  • describe – variable name, label, type (describe using to look at a file)
  • inspect (ins) – graph, number obs, num missing
  • list (li) – print data; add “in 1/20” for first 20 obs
  • swilk – Shapiro-Wilk test for normality (if significant, not normal)
  • pnorm – normal probability plot
  • table – table of summary statistics, use contents() with this.

    • show vitd25 by sex:

      table male, contents(min vitd25 mean vitd25 max vitd25)
      
  • assert – test if something is true;

    assert maxweight >= weight
    
  • tab – frequency table

  • tab2 – two way table

Optional plugins

  • I use fsum. To get this plugin, use:

    ssc install fsum
    
  • find to top 10 plugins

    ssc hot
    
  • distinct: counts distinct values of a variable

  • vallist: lists distinct values of a variable (or can just use tabulate)

Commands

  • display – give the value of something (or calculate)
  • help – give info about a command
  • rename – change a variable’s name
  • use – load a file
  • order – reorder variables (order var1, after(var2))

Variable creation

  • use gen to create new variables

    gen obese = weight > 30 if weight != .
    
  • use replace to update the values of an existing variable

    replace var2 = 1 if var1 > 1
    
  • use xtile to create quantiles

    xtile newvar = oldvar, nq(3) /* create tertiles */
    
  • can specify cutpoints using a variable name:

    egen writecat = cut(write), at(30,40,50,60,70)
    
  • specify a counter based on the number of observations for an ID

    gsort id -eventdate
    by id: gen eventnum = _n
    
  • convert strings to dates

    gen realdate  = date(datestring,"MDY") /* for 4-digit years */
    gen realdate  = date(datestring,"MD20Y") /* for 2-digit years */
    format realdate %td /* so that its readable */
    

Dummy Variables

  • (generate dummy variables for each level of vargroup, e.g. var1, var2)

    tabulate vargroup, gen (var)
    

Missing Values

To copy values from the prior row for a given variable if values are missing:

    replace var = var[_n-1] if var1==.

It is critical that the data is sorted correctly.

Copy these variables only if the id matches:

    replace var = var[_n-1] if var==. & id==id[_n-1]

Math functions

  • round(var, increment)

    round(var1, 0.01)
    
  • Change variable type

    • convert to string:

      tostring zip, generate (zip5) format(%05.0f)
      
    • use “replace” instead of generate to replace the original variable

    • convert from string: destring

      destring notastring, replace
      

String functions

  • strpos(first,second): returns the position of second in first, or 0 if it’s not included

    Example: strpos("example","amp") would equal 3
    
  • subinstr(source,find,with,n) searches the source string, finds n occurrences of the second string and replaces them iwth the third string. If n is ommitted, all instances are replaced.

    Example: replace var1 = subinstr(var1,"exmpl","example",1)
    

Tips

  • Use a wildcard to match a bunch of variables (e.g. su vit* to summarize any variable beginning with vitamin)
  • use bysort [variable]: function – to do analysis by group
  • if is the equivalent of SAS’s WHERE

Statistical analysis

Basic statistics

  • ttest – two sample ttest

    ttest vitd25, by(male)
    
  • ranksum – Wilcoxon rank-sum test (nonparametric)

    ranksum vitd25, by(male)
    
  • median – turns continuous variable into binary based on median and performs chi-squared test

    median vitd25, by(male)
    
  • pwcorr – Pearson correlation

    pwcorr var1 var2, obs sig
    
  • spearman – Spearman correlation

    spearman var1 var2, pw stats(p obs rho)
    
  • tab, chi2 – Chi-squared analysis

     tab var1 var2, chi2
    

Regression

  • Stepwise:

    stepwise pr(0.10) pe(0.05) forward: regress var1 var2 var3 var4
    
  • store estimates:

    /* store estimates in var model1 */
    estimates store model1 
    
    
    /* create a table comparing models */
    estimates table model1 model2 model3, p b(%9.2f)
    
  • reg – linear regression

  • logit – logistic regression with coefficients
    • add odds ratios using or option
  • logistic – logistic regression with odds ratios

Mixed models

    xtmixed varlist || groupvar:

Shortcuts

Foreach loops

  • foreach variable of varlist var1 var2 var3 {
    spearman `variable’ var2;
    }

Graphical Analysis

Scatterplots

  • sc (variables), mcolor(color) msize(size) title(title)

  • With regression line:

    graph twoway (sc var1 var2) (lfit var1 var2)
    
  • use lfitci to add confidence intervals

  • Use third variable as categorical:

    twoway (sc bmi dbpugml), by(gender, total)
    

Histograms

    histogram bmi, by(male, col(1))

Boxplots

    graph box var1, over(var2)

Bar graphs

    graph bar var1, over(var2)

Formatting

  • replace numeric values with text (e.g. in graphs)

    label define labelname 1 "yes" 2 "no"
    
  • assign this naming scheme to a variable

    label values var1 labelname
    
  • replace a variable name with a label

    label variable var1 "Variable One"
    
-->