Stata for Clinical Research
As a former SAS user converting to Stata, I needed a reference for statistical methods commonly used in clinical research.
These methods may also apply to other fields. This is an early version missing many important functions.
Descriptive Statistics
- initial analysis might be su, inspect
- summarize (su) – mean, stddev, min max; add ,detail for more info
- tabstat, stat() by() – get summary statistics
- note fsum is an optional plugin that is a better version of this
- codebook – an alternative to summarize
- describe – variable name, label, type (describe using
to look at a file) - inspect (ins) – graph, number obs, num missing
- list (li) – print data; add “in 1/20” for first 20 obs
- swilk – Shapiro-Wilk test for normality (if significant, not normal)
- pnorm – normal probability plot
-
table – table of summary statistics, use contents() with this.
-
show vitd25 by sex:
table male, contents(min vitd25 mean vitd25 max vitd25)
-
-
assert – test if something is true;
assert maxweight >= weight -
tab – frequency table
- tab2 – two way table
Optional plugins
-
I use fsum. To get this plugin, use:
ssc install fsum -
find to top 10 plugins
ssc hot -
distinct: counts distinct values of a variable
- vallist: lists distinct values of a variable (or can just use tabulate)
Commands
- display – give the value of something (or calculate)
- help – give info about a command
- rename – change a variable’s name
- use – load a file
- order – reorder variables (order var1, after(var2))
Variable creation
-
use gen to create new variables
gen obese = weight > 30 if weight != . -
use replace to update the values of an existing variable
replace var2 = 1 if var1 > 1 -
use xtile to create quantiles
xtile newvar = oldvar, nq(3) /* create tertiles */ -
can specify cutpoints using a variable name:
egen writecat = cut(write), at(30,40,50,60,70) -
specify a counter based on the number of observations for an ID
gsort id -eventdate by id: gen eventnum = _n -
convert strings to dates
gen realdate = date(datestring,"MDY") /* for 4-digit years */ gen realdate = date(datestring,"MD20Y") /* for 2-digit years */ format realdate %td /* so that its readable */
Dummy Variables
-
(generate dummy variables for each level of vargroup, e.g. var1, var2)
tabulate vargroup, gen (var)
Missing Values
To copy values from the prior row for a given variable if values are missing:
replace var = var[_n-1] if var1==.
It is critical that the data is sorted correctly.
Copy these variables only if the id matches:
replace var = var[_n-1] if var==. & id==id[_n-1]
Math functions
-
round(var, increment)
round(var1, 0.01) -
Change variable type
-
convert to string:
tostring zip, generate (zip5) format(%05.0f) -
use “replace” instead of generate to replace the original variable
-
convert from string: destring
destring notastring, replace
-
String functions
-
strpos(first,second): returns the position of second in first, or 0 if it’s not included
Example: strpos("example","amp") would equal 3 -
subinstr(source,find,with,n) searches the source string, finds n occurrences of the second string and replaces them iwth the third string. If n is ommitted, all instances are replaced.
Example: replace var1 = subinstr(var1,"exmpl","example",1)
Tips
- Use a wildcard to match a bunch of variables (e.g. su vit* to summarize any variable beginning with vitamin)
- use bysort [variable]: function – to do analysis by group
- if is the equivalent of SAS’s WHERE
Statistical analysis
Basic statistics
-
ttest – two sample ttest
ttest vitd25, by(male) -
ranksum – Wilcoxon rank-sum test (nonparametric)
ranksum vitd25, by(male) -
median – turns continuous variable into binary based on median and performs chi-squared test
median vitd25, by(male) -
pwcorr – Pearson correlation
pwcorr var1 var2, obs sig -
spearman – Spearman correlation
spearman var1 var2, pw stats(p obs rho) -
tab, chi2 – Chi-squared analysis
tab var1 var2, chi2
Regression
-
Stepwise:
stepwise pr(0.10) pe(0.05) forward: regress var1 var2 var3 var4 -
store estimates:
/* store estimates in var model1 */ estimates store model1 /* create a table comparing models */ estimates table model1 model2 model3, p b(%9.2f) -
reg – linear regression
- logit – logistic regression with coefficients
- add odds ratios using or option
- logistic – logistic regression with odds ratios
Mixed models
xtmixed varlist || groupvar:
Shortcuts
Foreach loops
- foreach variable of varlist var1 var2 var3 {
spearman `variable’ var2;
}
Graphical Analysis
Scatterplots
-
sc (variables), mcolor(color) msize(size) title(title)
-
With regression line:
graph twoway (sc var1 var2) (lfit var1 var2) -
use lfitci to add confidence intervals
-
Use third variable as categorical:
twoway (sc bmi dbpugml), by(gender, total)
Histograms
histogram bmi, by(male, col(1))
Boxplots
graph box var1, over(var2)
Bar graphs
graph bar var1, over(var2)
Formatting
-
replace numeric values with text (e.g. in graphs)
label define labelname 1 "yes" 2 "no" -
assign this naming scheme to a variable
label values var1 labelname -
replace a variable name with a label
label variable var1 "Variable One"