DMV Assignment 3: Managing Data

Once you have written a successful program that manages your data, create a blog entry where you post your program and the results/output that displays at least 3 of your data managed variables as frequency distributions. Write a few sentences describing these frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.

More after the jump.. 

In keeping with my original project hypothesis, I was interested in exploring the relationship between reported level of trust for the police in the Outlook on Life (OOL) survey for Non-Hispanic Whites and Non-Hispanic Blacks. The question in OOL is presented as part of a group question asking respondents to indicate how much they think they can trust each institution. The response options are a Likert scale with every point labeled as follows:
  • 1 = Just about always
  • 2 = Most of the time
  • 3 = Only some of the time
  • 4 = Never

Refused responses are coded as -1 and represent about 3% of the responses. Means for the two groups I’m interested in would be skewed downward if the refused responses were included in the analysis, so I need to remove them from my analysis.

I feel it necessary to begin this post by stating that I disagree with the practice of over-writing/recoding into the same variables to account for valid non-responses. I understand in practice the need to create data with clear missing responses so that case-wise removal of data in multi-variate analyses can be used to limit the number of missing cases per analysis, however I would instead choose to recode into new analysis variables so that I did not lose the valid non-responses in the original data. There may be cases where understanding a valid non-response as different from a skipped question would be meaningful, but the methods taught for over-writing recoding valid non-responses data to be missing in the course lessons would prohibit such analysis from being conducted.

Alternatively, for my purposes, I have two alternatives to ensure that non-responses do not unintentionally skew my results. The first would be to recode my variable(s) of interest into new, analysis variables. For instance, the variable I’m interested in for this exercise is how much respondents report they feel they can trust the police (variable w1_k1_b in the Outlook on Life data), which has (as shown in the table below) 60 responses coded as “Refused” (coded value of -1):

[The police] How much do you think you can trust the following institutions?
W1_K1_B Frequency Percent Cumulative
Frequency
Cumulative
Percent
-1 60 2.87 60 2.87
1 154 7.36 214 10.23
2 896 42.83 1110 53.06
3 788 37.67 1898 90.73
4 194 9.27 2092 100.00

I could recode the variable into a new variable, e.g., w1_k1_b_analysis, that would contain only the non-refused responses (coded values of 1 through 4) and proceed to use my new variable in my analysis. This would preserver the original data – including the valid non-responses – should I later decide to explore questions for which that data may be needed, e.g., exploring whether there is a different pattern of refusal of certain types of questions by race/ethnicity. If I were conducting multivariate analysis, it would be most appropriate to adopt this approach to ensuring that the valid non-responses do not skew my results while ensuring that the maximum number of cases are available for all comparisons being conducted.

Alternatively, as at this time I am only interested in responses to this single item, I chose to filter the data so that only non-refused responses from the two race groups I’m interested in were included. The code for this is shown below:

LIBNAME mydata “/courses/d1406ae5ba27fe300″ access=readonly;
/* mydata is the local name for the database */
/* Research question: Race and perception of law enforcement between Blacks and Whites during the beginning of the #BlackLivesMatter movement
SPECIFICALLY H1: Are non-Hispanic Blacks less likely to trust the police than non-Hispanic Whites?
*/
DATA new; set mydata.oll_pds;
LABEL ppethm=”Race / Ethnicity”
w1_k1_b=”[The police] How much do you think you can trust the following institutions?”

IF ppethm=1 or ppethm = 2;

/* Select statements limit the cases included in the analysis; includes only those who
indicated race/ethnicity of “White, Non-Hispanic” or “Black, Non-Hispanic” */

IF w1_k1_b ~= -1;
/*Remove cases where respondents refused the question I’m interested in (coded as -1) Could also set refused values to Missing by recoding:
IF w1_k1_b = -1 then w1_k1_b = .;*/

/* Use If/Then, Else If/Then to recode into new variables: IF AAA=1 THEN NEW=2; ELSE IF AAA=2 THEN NEW=4 etc.
To create groups, can use IF AAA LE VAL THEN NEW=1; ELSE IF AAA LE VAL+2 THEN NEW=2; etc.
To calculate new variables, just enter expression: NEWCALC=BBB*CCC etc.
To calculate sums over multiple variables: NEWSUM = SUM (of AAA BBB CCC); */

/* All data manipulation or coding code should go before the first PROC statement. */
PROC SORT; by CASEID;
PROC FREQ; TABLES ppethm w1_k1_b;

RUN;

The results of running the program above show that only non-refused responses (distribution of w1_k1_b shows no -1 responses) for the two race/ethnicity groups (PPETHM = 1 or 2) I’m interested in remain.

Race / Ethnicity
PPETHM Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 797 39.22 797 39.22
2 1235 60.78 2032 100.00
[The police] How much do you think you can trust the following institutions?
W1_K1_B Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 154 7.58 154 7.58
2 896 44.09 1050 51.67
3 788 38.78 1838 90.45
4 194 9.55 2032 100.00

Filtering on both variables reduces the number cases included in my analysis from the original 2092 to 2032 – the 60 refused responses have been removed. Though the frequency of each of the other responses has not changed, the percent of the total responses has been adjusted to account for the now-filtered-out 60 refused responses.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s