## Wooldridge dataset:Generate a histogram of each variable and comment on their distributions.

## Generate a histogram of each variable and comment on their distributions.

__The__ __following__ __question__ __uses__ __the__ __“bwght”__ __Wooldridge__ __dataset.__

*To* *import* *these* *datasets,* *you* *should* *use* *the* *following* *commands* *(the* *first* *installs* *the* *command for opening the datasets from online [you only have to do this once], the second actually implements the new command to open the dataset from* *online):*

*ssc install bcuse bcuse bwght*

1. One constant question of interest in regards to health and public policy is the impact of smoking during pregnancy. Suppose we wish to examine this relationship using the child’s birthweight as a general measure of his or her health.

A. Use some descriptive statistics techniques to examine the variable *cigs*, the number of cigarettes a pregnant women smoked per day on average during her pregnancy. What can we say about the distribution of this variable? Would this impact a regression of birthweight on the number of cigarettes smoked per day? Explain.

B. Run a regression of birthweight on the number of cigarettes smoked per day, as well as a regression of birthweight on the number of cigarettes smoked per day and family income. Does family income substantially change your estimated coefficient on *cigs*? Would omitting family income produce any sort of bias? Explain.

__The following questions use the “gpa2” Wooldridge dataset.__

Suppose you are interested in estimating how a student’s SAT score (a standardized, high school level exam) affects his or her college GPA.

1. Summary Statistics

A. What kind of dataset is this? How can you tell?

B. Report the mean, standard deviation, minimum, and maximum for each variable.

C. Generate a histogram of each variable and comment on their distributions.

2. Generating a model

A. Write down the bi-variate model that you will use the estimate the impact of a student’s SAT score on his or her college GPA.

B. Run an OLS regression of a student’s college GPA on his or her SAT score and report the output. What is the y-intercept here telling us?

C. How do we interpret the sign and magnitude of the coefficient on SAT?

D. When a student’s SAT score is equal to 1200, what is his or her predicted value of colGPA? Is it possible for a student with a 1200 SAT score to have an actual college GPA greater than or less than this predicted value? Explain.

E.Can we say that the relationship between colGPA and SAT is causal? Explain why or why not.

__The following question uses the “rental” Wooldridge dataset.__

Suppose that you are contracted to explore the determinants of rental rates in major cities across the U.S. In addition to rental rate data, you decide to collect data for the year 2015 on what you believe are three key explanatory variables: the population of each city, the average income in each city, and the total student enrollment in each city (collegiate and above).

A. What is the null hypothesis of the impact of population on rental rates, and what is the alternative hypothesis that you are testing?

B. Consider two potential versions of a model:

Model (1): regressing rental rates on population and average income Model (2): regressing rental rates on population, average income, and the

percentage of the city’s total population that are students (here “students” means college and above).

Estimate Models (1) and (2) using an OLS regression and show your output.

*Hint: You will need to create a new variable that is the percentage of the city’s total population that are students.*

B. Why might we want to run Model (2) using the percentage of the city’s total population that are students (as we did), rather than total student enrollment (the raw data we collected)? Does including both a city’s total population as well as the student’s share of the total population pose any problems for your estimation of Model (2)? Explain in each case.

C. Comment on the statistical significance of your slope coefficients in Models (1) and (2), referring to both the t-statistics and p-values from your output, and commenting on any differences in statistical significance between the models.

E.You decide that a log-log model might be more appropriate here. Re-run Model (2) as a log-log model, show your output, and comment on the statistical significance of your explanatory variables.

*Hint* *1:* *You* *don’t* *need* *to* *convert* *your* *newly-created* *“student* *share”* *variable Hint 2: use the “gen” command to create new variable and “log” to* *calculate*

*the* *natural* *log* *of* *a* *variable,* *Stata* *calculates* *a* *natural* *logarithm* *by* *default.*

F.Conduct an F-test for the joint significance of all of your explanatory variables in the log-log version of Model (2). What can you say about the joint significance of the included explanatory variables?