This site is being preserved as it was on 17 September 2001 as a memorial to the life and work of Eberhard Wenzel.
Website by Eberhard Wenzel is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Copyright 1995 by the Russell Sage Foundation. Russell Sage Working papers have not been reviewed by the Foundation. Copies of working papers are available from the author, and may not be reproduced without permission. If you have any questions about permissions, please contact The Electronic Policy Network, P.O. Box 383080, Cambridge, MA 02238, query@epn.org, or by phone at (617) 547- 2950.

Preferred Citation: Robinson G. Hollister and Jennifer Hill, "Problems in the Evaluation of Community Wide Initiatives," (New York: Russell Sage Foundation) 1995 [http://epn.org/sage/rsholl.html]).


Problems in the Evaluation of Community-Wide Initiatives

Robinson G. Hollister and Jennifer Hill

RUSSELL SAGE FOUNDATION, Working Paper # 70
A Paper prepared for the Roundtable on Comprehensive Community Initiatives
April, 1995

Introduction

In this paper we outline the types of problems which can arise when an attempt is made to evaluate the effects of community-wide programs. We partially review experience with different methods where available. In general we find the problems are substantial, so in a concluding section we provide some suggestions for steps which might be taken to improve methods of evaluation which could be used in these situations[1].

We face several definitional problems at the outset. What do we mean by community-wide initiatives? What type of effects are we particularly interested in measuring? Finally, what are the major objectives of the evaluation?

To help with definitions, we turn to papers produced by members of the Roundtable committee.

Community-wide Initiatives

P/PV has defined community as: "...the intersection of place and associational network. Community encompasses both where youth spend their time and whom they spend it with"[2].

Brown and Richman describe "urban change initiatives" as sharing "to some extent the following guiding principles or development assumptions:

  1. Community Building: neighborhood development should 'rebuild the social fabric' of the neighborhood.
  2. Community Empowerment: neighborhood development should involve residents and other neighborhood stakeholders in the process of identifying and prioritizing problems and designing solutions to these problems.
  3. Comprehensive Community Change: neighborhood development should adopt a holistic and integrative perspective on neighborhood dynamics and change." [3]

Under the title of community-wide initiatives, in this paper we intend to focus on programmatic interventions which treat groups of individuals as a whole. That is, all the individuals in a given geographic area or in a given class of people are eligible for the program's intervention or are potential targets of that intervention. It is the inclusiveness of eligibility that is central to the concept of community-wide initiatives. We emphasize this feature at the outset because we will be distinguishing sharply those situations in which it would be possible to use random assignment methods for individuals in the evaluation to create control groups from those in which the character of the eligibility of the intervention precludes use of such methods.

Some brief examples may help. In the late 1970's and early 1980's the federal government funded a program called Youth Incentive Entitlement Pilot Project (YIEPP). This program selected a number of school catchment areas in several states. All the low income persons between the ages of 16 and 19 within that area were eligible to participate in the program. The program provided them with work opportunities during the summer time - guaranteed jobs essentially - and part time work during the school year. If they took the job in the summer, they were to continue in school during the school year. A major objective was to encourage school continuation by making employment possible for the low income population. The key feature is the inclusiveness that dictates against random assignment; since all of the low income youth in a given school catchment area were eligible, random assignment was not possible.

A different type of example is that of community development corporations (CDCs) which confine their efforts for community change to geographically designated areas and, at least in theory, all the residents of those areas are potentially eligible for services provided through the community development corporations efforts.

Effects to be Measured

Turning to the definition of effects which the evaluations assess, we focus for the most part on longer term outcomes which are said to be the concern of the community-wide initiative. We want to separate the longer term outcomes from the more immediate short term changes that are often covered under what is called a process analysis. Thus in the YIEPP example the long term outcomes of interest were school continuation rates of the youth and their employment and earnings. The participation of the youth in the program, while it was of some interest, was not itself considered a long term outcome of central interest. Rather, it was a process effect.

In the case of the community development long term outcomes might be an improvement in the quality of the housing stock in the designated area or an increase in the number of jobs in the designated area held by people residing in that designated area, while a process outcome might be participation in community boards which make decisions about how to allocate the program resources.

It should be recognized, of course, for what is considered as a "process variable" for some purposes, may be considered an outcome variable for others, e.g., participation of community members in decision-making could be regarded as part of a process leading to a program outcome of improved youth school performance in one situation but could be an "empowerment" outcome valued in its own right in another situation. A clear delineation of the theory of the intervention process would specify which are "process" and which are "outcome" effects.

The Counterfactual.

The basic question an evaluation seeks to address is whether the activities consciously undertaken which constitute the community-wide initiative generated a change in the outcomes of interest. In order to address the central evaluation issue the problem in this case, as in virtually all evaluation cases, is to establish what would have happened in the absence of the program initiative. This is often referred to as the counterfactual. Indeed most of our discussion will turn around a review of alternative methods that have been tried in order to establish a counterfactual for a given type of program intervention.

To those who have not steeped themselves in this type of evaluation, it often appears that this is a trivial problem. Simple solutions are proposed. For example, let's look at what the situation was before the initiative and what the situation is after the initiative in the given community. The counterfactual is the situation before the initiative. Or let's look at this community and find another community that initially was very much like it and then see how after the program initiative the two communities compare on the outcome measures. That will tell us the effects of the program. The comparison community will provide the counterfactual - what would have happened in the absence of the program.

As we shall see however, and as most of us know, these simple solutions are not adequate to the problem - primarily because individuals and communities are changing all the time with respect to the measured outcome even in the absence of any intentional intervention. Therefore, measures of the situation before the initiative or with comparison communities are not secure counterfactuals; they may not represent well what the community would have looked like in the absence of the program.

Let's return again to some concrete examples. YIEPP pursued a strategy of pairing communities in order to develop the counterfactual. For example, the Baltimore school district was paired with Cleveland. The Cincinnati school district was paired with a school district in Louisville, etc. In making the pairs the researchers sought to choose communities that had labor market conditions similar to those of the treatment community.

A similar procedure, with a great deal more detailed analysis, was adopted as part of an on-going study of school dropout programs currently being conducted by Mathematica Policy Research. The school districts with the dropout program were matched in statistical detail with school districts in the near neighborhood, that is within the same city or SMSA (Standard Metropolitan Statistical Area).

In both of these examples, even though the initial match seemed to be quite good, circumstances evolved in ways that made the comparison areas doubtful counterfactuals. In the case of YIEPP, for example, Cleveland had unexpectedly favorable improvement in its labor market compared to Baltimore. Louisville had disruption of its school system because of court ordered school desegregation and busing. This led the investigators to discount some of the results from using these comparison cities. In the case of the school drop out study, though the districts matched well in terms of detailed school and population demographics at the initial point, a couple of years later when surveys had been done of the students and teachers in the respective school districts it was found that in terms of the actual processes of the schools, the match was often very bad indeed. The schools simply were operating quite differently in the pre-program period and had different effects on students and teachers.


Random Assignment as the Standard for Judgement

For quantitative evaluators random assignment designs are a bit like the nectar of the Gods: once you've had a taste of the pure stuff it is hard to settle for the flawed alternatives. In what follows, we often use the random assignment design - in which individuals or units which are potential candidates for the intervention are randomly assigned to be in the treatment group, which is subject to the intervention, or to the control which is not subject to any special intervention. (Of course, random assignment does not have to be to a null treatment for the controls; there can be random assignment to different levels of treatment or to alternative modes of treatment). The key benefit of a random assignment design is that, as soon as the number of subjects gets reasonably large, there is a very low probability that any given characteristic of the subjects will be more concentrated in the treatment group than in the control group. Most important, this hold for unmeasured characteristics as well as measured characteristics Thus when we compare average outcomes for treatments and controls we can have a high degree of confidence that the difference is related to the treatment and not to some characteristic of the subjects. The control group provides a secure counterfactual as, aside from the treatment, the control group members are subject to the same forces which might affect the outcome as are those in the treatment group: they grow older just as treatment group members do, they face the same changes in the risks of unemployment or increase in returns to their skills, they are subject to the same broad social forces that influence marriage and family practices.

We realize that this standard is very difficult, often impossible, for evaluations of community-wide initiatives to meet. But we use it in order to obtain reliable indications of the type and magnitude of errors which can occur when this best design is not feasible[4]. Unfortunately, there appear to be no clear guidelines for selecting second-best approaches but a recognition of the character of the problems may help set us on a path to developing such guidelines.


The Nature of the Unit of Analysis

For most of the programs that have been rigorously analyzed by quantitative methods to date, the principle subject of program intervention has been the individual. When we turn to community-wide initiatives, however, the target of the program and the unit of analysis usually shifts away from just individuals to one of several possible alternatives. The first, with which we already have some experience, is where the target of the program is still individuals but it is individuals within geographically bounded areas. While the individuals are still the targets of the intervention the fact that they are to be defined as being within a geographically bounded unit is intentional because it is expected that interactions among individuals or changes in the general context will generate different responses to the program intervention than would treatment of isolated individuals.

Another possible unit of analysis is families. We have had some experience with programs in which families are the targets for intervention and where the proper unit of analysis remains the families rather than sets of individuals analyzed independently of their family unit. This would, of course, be the case with for example family support programs. These become community-wide initiatives when the set of families to be considered are defined as within geographically bounded areas and eligibility for the program intervention somehow relates to those geographical boundaries. Many of the recent community-wide interventions seem to have this type of focus, a focus on families within geographically bounded areas.

Another possibility for community initiative is where the target and unit of analysis are institutions rather than individuals. Thus within a geographically bounded area an attempt might be made to have a program which targets particular sets of institutions, the schools, the police, the voluntary agencies, the health providers and to generate changes in the behavior of those institutions per se. Then the institution becomes the relevant unit of analysis.

The reason for stressing the importance of being clear about the unit of analysis is that it can make considerable differences in the basic requirements for the statistical analysis used in the evaluation. Quantitative analyses focus on the frequency distribution of the outcome and we use our statistical theory in order to make probabilistic statements about the particular outcomes that we observe. The theory is based on the idea that a particular process has generated outcomes that have a random element in them. The process does not generate the same result every time but rather a frequency distribution of outcome values. When we are using these statistical methods to evaluate the impact of programs we are asking whether the frequency distribution of the outcome has shifted because of the effect of the program. Thus a statistically significant difference in an outcome associated with a program is a statement that the outcome we observe from the units subject to the program intervention has a very low probability of coming from a distribution which is the same as the distribution of that outcome for the counterfactual group. So if the community, in some sense, is the unit of analysis and we're looking at, for example, the incidence of low birth weight children in the community, then we need to have information about the frequency distribution across communities of the percentage of low birth weight babies.

The unit of analysis becomes critical because of the ability to make these probability statements about effects using statistical theory depends on the size of the samples. So if the community is the unit of analysis then the sample size will be the number of communities in our samples. If the court systems are the unit of analysis and we're asking about changes in incarceration rates generated by court systems and we're changing courts in one community in some way and not in the other, then we want to know about the frequency distribution across different court systems of incarceration rates and the size of the sample would be the number of such systems that are observed.


The Problem of Boundaries

When we're talking about community-wide initiatives we're often talking about cases where geographical boundaries define the unit or units of analysis. Of course the term community need not imply specific geographic boundaries. Rather it might have to do with, for example, social networks. What constitutes the community may vary depending upon what type of program process or what type of outcome we are talking about. The community for commercial transactions may be quite different from the community for social transactions. The boundaries of impact for one set of institutions, let us say the police, may be quite different from the boundaries for impacts of another set of institutions, let us say schools or healthcare networks.

We will not attempt here of full discussion of how boundaries of communities or neighborhoods might be defined[5]. We quote some insights which illustrate the complexity of the issue of community or neighborhood boundaries: "..differentiated subareas of the city are recognized and recognizable...neighborhoods are perhaps best seen as open systems, connected with and subject to the influence of other systems... individuals are members of several of these systems at once...delineation of boundaries is a product of individual cognition, collective perceptions, and organized attempts to codify boundaries to serve political or instrumental aims... local community may be seen as a set of (imperfectly) nested neighborhoods...recognition of a neighborhood identity and the presence of a 'sense of community' seems to have clear value for (1) supporting residents' acknowledgement of collective circumstances and (2)providing a basis and motivation for collective action...neighborhoods are experienced differently by different populations [and}...are used differently by different populations"[6].

It is suggested that a neighborhood or community might be defined for the purposes of a program by reference to some of the following principles:

"[1]Match the place to the intervention. [2] Identify the relevant stake holders. [3] Determine the appropriate change agents. [4] Determine the necessary capacity to foster and sustain change."[7]

Note that many of the recent community-wide initiatives have as one of their principle concerns the "integration of services". These integration efforts run right up against the problems of boundary definitions since the catchment areas for various types of service units intersect or fail to intersect in complicated ways in any given area.

For the purposes of evaluation, these boundary problems introduce a number of complex issues. First, where the evaluation uses as a before-and-after design (we discuss alternative types of designs for evaluations in detail below), i.e., a counterfactual based on measures of the outcome variables in the same area in a period before the intervention to be compared with such measures in a period after intervention, the problem of changes in boundaries may arise. These could occur either because some major change in the physical landscape occurs, e.g., a new highway bisects the area, a major block of residences are torn down for a major trash-to-steam plant to be built, or because the data collection method is based on boundaries that are shifted, e.g. redistricting of schools, changing of police districts. Similar problems would arise where a comparison-community design is used for the evaluation and similar changes occur either in the treatment community or the comparison community.

Second, inflow and outflow of people across the boundaries of the community has to be dealt with in the evaluation. Some of the people who have been exposed to the treatment migrate out of the community and unless follow up data are collected on these migrants, some of the treatment effects may be misestimated. Similar, in-migrants enter the area during the treatment period and have had less exposure to the treatment and may "dilute" the treatment effects measured (either negatively or positively).

Third, one of the most serious problems which evaluations of community-wide initiatives face is the limited availability of regularly collected small-area data. The decennial Census is the only really complete data source which we have which allows to measure population characteristics at the level of geographically defined small areas. In the inter-censal years, the best we can do in most cases is to extrapolate or interpolate. For the nation as a whole, regions, states and standard metropolitan statistical areas when can get some regularly reported data series on population and industry characteristics. For smaller areas, we can not obtain reliable, regularly reported measures of this sort. We will suggest below some steps which might be taken to try to improve our measurements in small geographic areas, but at present, this remains one of the most serious handicaps faced in quantitative monitoring of the status of communities.


The Basic Requirements for Statistical Inference in Evaluations

With apologies to those who are well versed in the subject, we thought it would be useful to quickly review the basic elements that go into the design for the statistical analysis involved in a quantitative evaluation.

1. A Theoretical Model

First, there should be some, however primitive, theoretical model that links the intervention elements to a set of outcomes. The simplicity of such a model can be even as rudimentary as: this group got the treatment that group did not get the treatment so the two groups should differ on this outcome measure of the treatment; it will increase this outcome and decrease that outcome. This type of model is often complained about being simply a "black box" model where the treatment is only crudely defined as a "black box" that some people are in and some people are not and what happens inside that box, the process by which the treatment is transformed into the outcome, is not specified. More refined theoretical models will detail the treatment elements and behavioral processes on which they impinge and how that impingement could change the behavioral outcomes of interest.

2.The Variance in the Outcome Variable

The second key element in the design and the statistical analysis is some estimate of the normal variance in the outcome measure or measures. As already emphasized, we know that the outcomes will have some frequency distribution. How spread out that frequency distribution typically is in the absence of any intervention is critical information since it tells us, in some sense, how much "noise" there will be in the outcome measures. The larger the variance in the outcome the harder it will be to detect the effect of the treatment. For example, the employment rate in the community tends to have relatively small variance whereas the income of individuals in a community tends to have a very high variance. It is therefore much harder to detect changes in the average income than it is to detect changes in employment rates.

The adequacy of the theoretical model plays a role here as well. To the degree that the theoretical model specifies the whole set of variables which influence the outcome in the absence of the treatment and those variables typically explain a good deal of the variance in the outcome variable, using these variables in an explanatory model reduces the "noise" in the background. What becomes relevant is how much variance is left in the outcome after we have taken into account the effects of other measured variables. For example, the variance in earnings might be $15,000 but when we include measures of the individual's education, age, gender, ethnicity, the residual variance in $10,000. So the better the model in explaining the outcome, the lower the residual variance and the easier it will be to detect the effects of any treatment.

3. The Desired or Expected Size of Response

The next element in statistical design is the size of the response to the treatment which the evaluation would seek to detect. This could be the size of response that is the minimum desired size of response or it could be the size of response that, based on previous studies, is the expected size of the response. This may seem a bit peculiar because it asks for a prior specification of exactly the answer one is looking for through the evaluation analysis. This element is necessary, however, in order to try to assure the statistical design will be sufficient to insure that if the desired size of response does in fact occur, the statistical test will yield the conclusion that the response was statistically significantly different from zero.

For example, suppose that the objective was to lower the incidence of low birth weight children in a given neighborhood and the treatment would be judged to be successful if it lowered the incidence of low birth weight children by three percentage points. Then one would want to be certain that if the true effect of the treatment were on average to lower the incidence of low birth weight children by three percentage points that the sample size is big enough so that there is a good chance that this sample would conclude that the effect of the treatment was at least greater than zero.

It is an irony of this particular aspect of statistical design that the smaller the desired or expected size of response the bigger must be the sample in order to detect it, other things being equal: it takes more resource to detect a small effect than it does to detect a big effect.

Policy relevance can enter into the determination of the desired or expected size of response since one could say, for example, if the treatment lowers the incidence of low birth weight children by only half a percentage point that would not be sufficient to be meaningful in the policy realm and therefore we don't need to have a statistical design and sample size large enough to detect an effect that small.

4.Number and Size of Treatments

The next characteristic of statistical design is the number and the size of the treatments. While in the simplest cases we have one type of treatment, in many cases there are multiple dimensions to the potential treatment and it may be useful to vary those dimensions of the treatments. For example, in the early negative income tax experiments there were two characteristics of the treatment that were systematically varied, the level of the basic income guarantee and the rate of reduction in the negative income tax payments as income of the family increased. These two dimensions were varied in size to get a combination of different plans being tested within one experiment. Not surprisingly, depending exactly on how the number of treatments is set up, the requirements for sample size and the data collection for the evaluation become considerably larger.

The size of the individual treatments of course has an affect on expected size of response. In some experiments one is varying the level of the treatment and looking for a "dosage" effect. So, for example, in the simple case one might be trying to lengthen the school year and see what the effects of that are but added information could be obtained by lengthening it by different amounts for different treatment groups to try to see what the "dosage" effect of lengthening the school year would be.

5. Power Analysis

In developing a statistical design for an evaluation one carries out what is called a statistical power analysis. What this does is to provide an estimate of the chances of finding a significant effect of a given size with a given probability. The power analysis is a more conservative procedure than simply indicating the minimum detectable response. The power analysis recognizes that whatever the true average response, samples drawn to test the treatment will, if done repeatedly, generate a frequency distribution around a mean response and one can not be sure before carrying out the experiment where in that distribution of outcomes the particular sample taken will lie. So the power analysis takes this into account in a more conservative fashion.

We have reviewed these basic requirements for statistical analysis design in order to provide a framework against which we will discuss a number of characteristics of alternative designs. Hopefully this discussion will help understanding of how particular problems in a given approach relate to this basic design model.


Problems with Outcome Measures

Community-wide initiatives present particular problems in defining what major outcome measures the evaluation should focus on. In many past evaluations in the social policy area the major outcome variables have been relatively straight forward and agreed upon, for example, the level of employment, the rate of earnings, the test scores of children, the incidence of marriage and divorce, the incidence of low birth weight children, arrests and incarcerations, school continuation rates or drop out rates, birth outcomes. For community-wide initiatives, these traditional type of outcome measures may not be the primary outcome measures or may be regarded as ultimate outcome measures but ones which may not show detectable effects in the short term. For example, in the famous Perry Pre-school study the long term outcomes are now often talked about, employment, earnings, delinquency, but obviously at the start of the evaluation these outcomes could not be directly measured. This may be true for some of the community initiatives as well where it may be felt that during the period of the short term evaluation, it is unlikely that traditional outcome measures will show much change even though it may be hypothesized that in the long run they will. For community initiatives, then, we need to distinguish intermediate outcomes and final outcomes.

In addition, in community initiatives there may be types of outcome measures that have not been used traditionally but are regarded as outcomes of sufficient interest in and of themselves, regardless of whether they link to more traditional outcome measures eventually. Particularly where the object of the community initiative is a change in institutional behavior it may be that some of the more traditional individual outcome measures are considered of secondary interest. For example, if an institution is open longer hours or disburses more funds or reduces its personnel turn over, these might be outcomes of interest in their own right and not as intermediate outcomes.

Finally, we would want to make a careful distinction between input measures - or process measures - and outcome measures. For instance an input measure might be the number of people enrolled in a GED program. Whereas the outcome measure might be the number of people who passed their GED exam or, even further, what the final outcome is in terms of their employment and earnings. Process measures might be changes in the organizational structure such as provisions of more authority to classroom teachers in determining the curriculum content rather than having superintendents or school boards determine the curriculum content. This might be considered a process measure whereas the effect on student achievement would be the ultimate outcome measure of interest.

One might try to use the set of principles outlined by Brown and Richman at the outset of this paper and translate the principles into the sets of categories we have just talked about. First, starting from the statement of principles and development assumptions, could we define a set of final outcome variables upon which the "success or failure" of a community-wide initiative might be judged? Second, could we derive from these a set of intermediate measures that we think would be related to the ultimate long term outcomes but which would be more measurable in the short-term? Third, could we distinguish from these principles those measures which would be input and process measures rather than outcome measures?

As one seeks to address these questions it becomes clear that it is important to try to determine as best as possible the likely audience for the evaluation results. The criteria for what are important outcomes to be measured and evaluated is likely to vary with that audience. Will the audience in mind, for example, be satisfied if it can be shown that a community-wide initiative did indeed involve the residents in a process of identifying and prioritizing problems through a series of planning meetings, but it could not be shown that this process lead to changes in school outcomes or employment outcomes or changes in crime rates in the neighborhood? Academic, foundation staff, policy-makers and administrators are likely to differ greatly in their judgement of what outcomes provide the best indicators of success or failure.

Another dimension of this problem is the degree to which the audience is concerned with the outcomes for individuals vs. the outcomes for place. This of course is an old dilemma in neighborhood change going back to the time of urban renewal programs. In these programs the geographical place may have been transformed by removing the poor people and replacing them through a gentrification process with a different population; place was perhaps improved but people were not. Or to contrast it at the other extreme, the Gatreux process moved low income people from the center city to the suburban fringe and it is judged their lives were improved but of course the places that they left were if anything in worse shape after they left.


Important Studies which Demonstrate the Problem of Selection Bias

There are two sets of studies now available which illustrate the seriousness of the problems which can arise when comparison groups are constructed by means other than random assignment. These studies both start with data generated through true random assignment experiments. The results of the random assignment experiments are used as the standard of what the "true" estimates of the effects of the program are. In each of the studies alternative comparison groups are then constructed in a variety of ways and another set of estimates of the effects of the program on the outcome variable are made using the constructed comparison group in place of the control group. A test is then made by looking at the estimated effect of the program using the constructed comparison groups and comparing that with the "true impact estimates" taken from the experiment.

The first set of studies were based on the National Supported Work Demonstration which ran between 1975 and 1979 in eleven cities across the United States. This was a subsidized employment program for four target groups: ex-addicts, ex-offenders, high school dropouts and women on AFDC. Two sets of investigators working independently used these experimental data and combined them with non-experimental data to construct comparison groups[8]. Both sets of investigators looked at various ways of matching the treatment subjects from the experiment with counterparts taking from other data sources. In constructing the comparison group from other data sources they followed the method that had been used by other investigators previously to study employment and training programs such as the concentrated Employment and Training Act (CETA).

The major conclusion from this set of important studies was that the constructed comparison groups provided unreliable estimates of the true program impacts on employment and earnings, and that none of matching techniques used looked particular superior one to the other - there was no clear second-best.

The second major set of studies were carried out more recently using data from MDRC's Work\Welfare studies in several states[9]. This work is even more closely relevant to the problems we address here and we will discuss some of the results in more detail later in the paper. What the investigators did was to use the treatment group from the actual Work\Welfare experiments and then for constructed comparison groups they used the control groups from other locations or other time periods.

The Work\Welfare experiments were carried out in several states. This made it possible for the researchers to try to construct comparison groups by using the treatment group from one state and the control group from another state. In addition, they were able to use information for several of the programs where the program experiment was carried out with several different offices within the same state so they could use a treatment group from one geographic location within the state or city and the control group from another geographic location within the state or city. Finally, because the samples were large enough, they could take the treatment group from one time period at a given site and the control group from another time period to get what they call "across-cohort studies"; they would have the treatment from one cohort in time in the same site and control group from another cohort in time.

Investigators in this study tested a variety of different ways of trying to improve upon the construction of the comparison group. They then looked at the estimates of experimental effect on employment rates using the true results and comparing that with the results using the constructed comparison groups. They tried not only different combinations of constructed groups but also different ways of matching on measured characteristics and they tried some sophisticated specification tests which had been suggested by Heckman[10] and others to eliminate some of the constructed comparison group which the test suggest are less well matched.

The results showed substantial differences in the magnitude of the estimated impact between the true experimental results and the results based on constructed comparison groups. In many cases not only the magnitude of the effect is different but the actual statistical inference is different, that is, whether the impact is statistically significant or whether its sign is positive or negative. Their results seem to indicate that, at least for these data, comparison groups constructed from difference cohorts in the same site performs somewhat better than the other types of comparison groups.

The importance of these two sets of studies is that they indicate that the problem of bias arising when comparison groups are constructed by any other method than random assignment are likely to be quite serious. They show that statistical controls using measured characteristics are in most cases inadequate to overcome that bias. This, then, is not just a theoretical possibility, but it can be empirically shown in these real life experiments to have been a very real possibility. Investigators could have been seriously misled in their conclusions about the effectiveness of these programs had they used methods other than random assignment to construct their comparison groups. Even more important, we should remember that these studies are looking at the effectiveness of alternative methods of creating comparison groups ex post but when one is developing a design for an evaluation one must make a priori judgments about the extent of bias that the results might in the end show. In most cases, one would not have the luxury of using a specification test afterwards to eliminate certain types of constructed comparison groups. You would have to guess beforehand whether the type you have chosen is likely to fall in that group that would be rejected by the test or not.


Types of Comparison Groups and Experiences With Them To Date

1. Constructed Groups of Individuals:

Constructed comparison groups of individuals was the most often used method of evaluation prior to the development of the use of random assignment in large scale social policies studies in the 1970's and its much wider use in a range of programs in 1980's. We will only talk briefly about these methods based on individuals since they are of less relevance to the issues of evaluation of community-wide initiatives.

The earliest type of constructed group was a before and after, or "pre-post", design. Measurements were made on the individuals before they entered the treatment and then measures made after they entered the treatment and following the conclusion of the treatment. Impacts were measured as the change from before program to after program. This design has long been recognized as highly vulnerable to changes in individuals that occur naturally through life processes. In some cases it is simply changes associated with aging. For example, respect to criminal behavior there is a known decline in criminal behavior with age, a phenomena referred to as "Aging Out". With respect to employment and training, often program eligibility is based on a period of unemployment prior to program entry. In these cases, we know that mistaken inference can easily occur since, for any group of people currently unemployed, the usual processes of job search which go on in the absence of any program would result over time in some percentage of those, usually a very high percentage, becoming employed or re-employed. One cannot untangle the program effects from those of usual job finding processes.

Another type of constructed comparison group that has been used is that of non-participants in the program as comparisons with those who participated in the program. This was used in early evaluations of the Jobs Corps. A more recent example is the evaluation of the special supplemental food program for women, infant and children WIC(Devaney, 1990). In this study a comparison was made between welfare recipients who participated in the WIC Program and those who did not. Another case is the study of the National School Lunch Program and School Breakfast Programs (Burghardt et al, 1993). An attempt was made to evaluate the dietary impact of the program by using non-participants in the program in the same sites as the sites of which the School Lunch Participants Program data was gathered.

This type of design has long been recognized as subject to serious bias due to selection on unobserved variables. Usually there is a reason why the individuals have participated in the program or not participated in the program. In some cases this can be the individual's motivation, in some cases it can be subtle selection procedures followed by the program administrators. If that selection for either reason is on characteristics that would affect the final outcome, and these characteristics are unmeasured, then estimating the impacts by the difference between the participant of the non-participant group will be subject to bias which could run in either direction.

Several major studies have sought to use existing survey data as a source of getting and drawing individuals for the comparison group, most commonly used source of information is the U.S. Census Bureau's Current Population Survey (CPS), which has large national samples of individuals. Comparison groups are usually constructed by matching the characteristics of the individuals in the treatment group to individuals in the CPS. This procedure was used in a number of evaluations of employment training programs reported in the 1980's (Bloom 1987, Ashenfelter and Card 1985, Bassi 1983-84, Bryant & Rupp 1987, Dickinson, Johnson and West 1986), where program enrollment data was often used in combination with the CPS data or data from Social Security records. These data would generally be regarded as desirable as they would give an opportunity to have a long series of earning observations on individuals prior to the time of program eligibility as well as during program eligibility. In the studies by Maynard and Fraker the CPS and Social Security data were used in combination.

Different methods of matching were used in these various studies in some cases cells were defined for characteristics and samples were drawn from the CPS according to the cells of the characteristics matrix in which they fell to match the proportion of the treatment group that fell in those kinds of characteristics cells. In some of the studies, a procedure called the "Nearest Neighbor Match" that was created by Mahalanobis was used to try to get a better match.

Some other existing survey data files have also been used for constructing comparison groups. For example; the Panel Study on Income Dynamics was used (in addition to the CPS) by LaLonde in his study based on the Supported Work data. A good deal more could be said about the details of different methods of matching and attempts to use the sample selection corrections of the Heckman type. But we will not go into much of that here.[11]

To try to get around the problem of the influence of unobserved variables, analysts, since the late 1970s, have relied on methods of statistically correcting for potential bias. The methods used are based primarily on those developed by James Heckman, currently at the University of Chicago. The basic approach was to try to model the selection process, that is, to develop a statistical equation which would predict the probability of being selected to be in the treatment group or in the comparison group. While the approach proposed could work in certain situations, it has turned out in experience that it cannot generally be relied upon to deal with the problem of unobserved variables. Understanding the problem of unobserved variables and the weakness of any methodologies other than random assignment for dealing with this problem is central to the appreciation of the difficulties that are to be faced in the evaluation of community wide initiatives. We will touch on this repeatedly in what follows.

2.Constructed Comparisons Institutions

In a few cases, where the primary unit of analysis has been an institution attempts have been made to construct comparison groups on the basis of institutions. These procedures come closer to the problems of community wide initiative evaluations. The major example we have is the school dropout studies which are currently under way (Dynarski et al 1993). While in some cases, there is in this study random assignment of individuals to a school dropout prevention program and to a control group, for part of the study it was deemed not possible to carry out random assignment within the school so an attempt was made to find other schools that could be used as a comparison group in judging the effectiveness of the dropout program. As noted several times above, after the schools had been initially matched, survey data were collected from students, parents and school administrators. A comparison of these data showed that in fact the schools were quite different in their actual operational aspects so that, in spite of being demographically similar, the schools were operationally quite different. Note that in this case even though the student outcomes are the ultimate subject of the study, i.e., whether the kids dropout or not, the institution had to be the unit of comparison because the context was felt to be important in determining school dropout outcomes. Therefore, to create a counterfactual, one wanted to find an environment which would have been similar to that in the treatment schools.

In one study there were a large enough number of schools to at least attempt a quasi random assignment of schools to treatment and control groups. Twenty two schools were first matched on socio-economic characteristics then randomly assigned within matched pairs to treatment and control. (Flay etal 1985).

3.Comparison Communities

There are several examples of attempts to use communities as the unit for building the comparison group. The idea is superficially quite appealing: find a community that is much like the one in which the new treatment is being tested and then use this community to trace how the particular processes of interest or outcomes of interest evolve compared to that in the "treatment community". As we will see, however, in practice this simple procedure has lots of pitfalls.

a.Treatment Site Predetermined

In most cases the treatment site has been predetermined before the constructed comparison site is selected. An example of this type of project is the Youth Incentive Entitlement Pilot Project (YIEPP) which was described at the outset of this paper. These four sites were matched with sites in other communities in other cities, matching on weighted characteristics such as the labor market, population characteristics, high school dropout rate, socio-economic conditions and geographic proximity to the treatment site. In this case, however, unforeseen changes in the comparison sites made their validity as counterfactuals for the comparison sites extremely doubtful. For example, Cleveland, which was paired with Baltimore, had an unexpected improvement in its labor market; events such as court-ordered school busing and teacher strikes made the usefulness of other comparison sites extremely questionable.

The Employment Opportunity Pilot Project (EOPP) was a very large scale employment opportunity program, which began in the late 1970's and carried on a bit into the 80's, focused on chronically unemployed adults and families with children. It also used constructed comparison sites as part of its evaluation strategy. Once again there were problems with unexpected changes in comparison sites. For example, Toledo which had major automobile supplies manufacturers, was subject to a downturn in that industry. Further, out of 10 sites, one had a major hurricane, a second had a substantial flood and a third had a huge unanticipated volcanic eruption.

A project currently getting under way, the Healthy Start Evaluation, will also use comparison sites. Two comparison sites are being selected for each treatment site (Devaney and Morano 1993). In developing comparison sites investigators have tried to add to the more formal statistical matching by asking local experts whether the comparison sites tentatively selected make sense in terms of population and service environment.

The evaluation of Community Development Corporations, being carried out by the New School for Social Research under Mercer Sullivan's direction, has selected comparison neighborhoods within the same cities as the three CDC sites which they are evaluating.

b.Treatment and Comparison sites Randomly Assigned

There are a couple of examples where the treatment sites were not predetermined but rather were selected simultaneously with a selection of the comparison sites. The largest such evaluation is that of the State of Washington's Family Independence Program (FIP). This is an evaluation of a major change in the welfare system of the state (Long and Wissoker 1993). The evaluators, having decided upon a comparison group strategy, created east/west and urban/rural stratifications within the state in order to obtain a geographically representative sample. Within five of these subgroups pairs of welfare offices, matched on local labor market and welfare caseload characteristics, were chosen and randomly allocated to either treatment (FIP) or control (AFDC) status (p.3). This project produced apparent results that surprised the researchers: increased utilization of welfare and reduced employment whereas the intent of the reform was to reduce welfare use and increase employment. The researchers themselves do not put much weight on the possibility that these results spring from the failure of a comparison site method, but that possibility certainly is there.

The Alabama Avenues to Self-Sufficiency through Employment and Training Services (ASSETS) Demonstration uses a similar strategy for the selection of demonstration and comparison sites, except that only three pairs were chosen, The primary sampling unit was the county, and counties were matched on caseload characteristics and population size (Davis). Results from this study look questionable when compared to a similar study which was done with random assignment of individuals in San Diego. In San Diego the estimated reduction in food consumption following Food Stamp cashout was much less.

c.Problems of Spillovers, Crossovers and In and Out Migration

Where comparison communities are used there are potential problems which arise either because of structural features in proximity or the movement of individuals.

Often investigators have chosen communities in close physical proximity to the treatment community. This is justified on the grounds of helping to equalize regional influences. However, this proximity can cause problems. First, economic, political and social structures often create specialization of function within a given region: one area provides most of the manufacturing activities and the other the services, or one generates mostly single family dwellings while the other features multi-unit structures, one is dominated by Republicans the other by Democrats, one captures the State Employment Services office and the other gets State Police barracks. These can be subtle differences which can generate different patterns of evolution of the two communities. Second, spillover of services and people can occur from the treatment community to the comparison community, so the comparison community is "contaminated" - either positively by obtaining some of the services or governance structure changes generated in the comparison community or negatively by the draining away of human and physical resources into the now more attractive treatment community.

Two features of the New School's CDC study make it less susceptible to these types of problems. First, the services being examined relate to housing benefits which are not easily transferable to nonresidents. Second, the CDC's in the study were not newly established, so to a large extent it can be assumed that people had already made their housing choices based on the available information (though even these prior choices could create a selection bias of unknown and unmeasured degree).

An example where this spillover effect was more troublesome was in the evaluation of The School/Community Program for Sexual Risk Reduction Among Teens (Vincent, 1987). This was an education-oriented initiative targeted at reducing unwanted teen pregnancies. The demonstration area was designated as the western portion of a county in South Carolina, using school districts as its boundaries. One of the four comparison sites was simply the eastern portion of the same county. The other three comparison sites used were matched on socio-demographic characteristics.

The two halves of the county were matched extremely well on factors that might influence the outcome measures as the entire county was considered to be quite homogeneous (Vincent, p.3382). However, a good deal of the information in this initiative was to be disseminated through a media campaign, and the county shared one radio station and one newspaper. Moreover, some of the educational sites, such as certain churches and workplaces, served or employed individuals from both the western and eastern parts of the county (p.3386). Obviously, a comparison of the change in pregnancy rates between these two areas will not provide a pure estimate of program impact.

In-migration and out-migration of individuals are a constant feature in communities. At the treatment site, these might be considered "dilutions of the treatment". In-migration could be due to the increased attraction of services provided or it could just be a natural process which will weaken the homogeneity of community values and experiences. Out-migration means loss of some of the persons subject to the treatment. If one looks only at the stayers in the community there is a selection bias arising from both migration processes. One cannot be sure whether the program treatment itself influenced the extent and character of in and out migration.

d. Dose-Response.

Evaluators might choose to analyze a system like the South Carolina Pregnancy Prevention project from the perspective dose-response effects. In other words, these areas could be viewed as three different groups: the western part of the county, the eastern part of the county, and the 3 noncontiguous comparison counties. Each of these received a different level of treatment: full, moderate, and little to none.

If one examines just the crude absolute changes in numbers, this theory seems to play out in a logical way. The comparison communities' estimated pregnancy rates (EPRs) stayed the same or increased, while the eastern portion reduced its rates slightly and the western portion was more than halved (p.3385). Of course these estimates should be accepted with caution given the general lack of statistical rigor (small sample size, failure to control statistically for even observed differences between communities).

Another example of dose-response methodology is an evaluation of a demonstration targeted at the prevention of alcohol problems (Casswell, 1989). Six cities were chosen and then split into two groups based on socio-demographic similarity. Within these groups, each received a treatment of varying intensity. One was exposed to both a media campaign and the services of a community organizer, one had only the media campaign, and the third had no treatment. In this way researchers could examine the effect of varying levels of intervention intensity to determine, for instance, if there existed an added benefit to having a community organizer available (or if the real impact came from the media campaign). It should be noted, however, that random assignment of cities within groups had to be sacrificed in order to avoid possible spillover effects from the media campaign.

Most important, this procedure does not get around the underlying problem of comparison communities - the questionable validity of the assumption that once matched on a set of characteristics the communities would have evolved over time in essentially the same fashion with respect to the outcome variables of interest. If this assumption does not hold then the "dose of treatment" will be confounded in unknown ways with underlying differences among the communities, once again a type of selection bias.

e. Pre-Post, Using Communities

As was noted with respect to individuals, contrasting measurements before exposure to the treatment with measurements after exposure to the treatment is a method which has been often been advocated. This procedure can also be applied with communities as the unit of analysis. The attraction of this approach is that the structural and historical conditions which might affect the outcome variables that are unique to this location are directly controlled for.

Often a pre-post design simply uses a single pre-period measurement of relevant variables as a baseline to compare with the post-treatment measure of the same variables. However, since it is recognized that communities as well as individuals change over time, it is usually argued that it is best to have multiple measures of the outcome variable in the pre-treatment period so as to allow an estimate of the trajectory of change of the variable. This procedure is often referred to as an "interrupted time-series", with the treatment taken to be the cause of the interruption (see for example McCleary and Riggs 1982).

This approach would be stronger the better is the researcher's ability to model the process of change in a given community over time. We discuss the evidence on ability to model community change below. Note also that this approach depends on having time-series measures of variables of interest at the community level and therefore runs into the problem of the limited availability of small area data measuring variables consistently over many time periods, noted above; we are often limited to the decennial censuses for small area measurements.

The major problem with pre-post designs is that events other than the treatment - e.g. a plant closing, collapse of a transportation network, reorganization of health providers - impinge on the community during the post-treatment period and affect the outcome variable and these effects would be attributed to the treatment unless there is a strong statistical model which can take these exogenous events into account.

We have been unable to locate examples of community pre-post designs using times series - though we feel there must be some examples out there. The EOPP (Brown, etal. 1983), used as one of its analysis models, a mixture of time series and comparison communities to estimate program impacts. The model had pre and post measures for both sets of communities and estimated the impact as the difference in percentage change pre to post between the treatment site and their comparison site.

For the Youth Fair Chance demonstration (Dynarski, etal. 1994) the proposed a design will use both pre and post measures and comparison sites.

f. Methods of selecting comparison communities

The most common method for selecting comparison communities is to attempt to match areas on the basis of selected characteristics which are believed, or have been shown, to affect the outcome variables of interest. Usually, a mixture of statistical weighting and judgmental elements enters into the selection.

Often a first criterion is geographic proximity - same city, same metropolitan area, same state, same region - on the grounds that this will minimize differences in economic or social structures and changes in area wide exogenous forces.

Sometimes an attempt is made to match on service structure components in the pre-treatment period, e.g., similarities in health service provision.

Most important, usually, is the statistical matching on demographic characteristics. In carrying out such matching the major data source is the decennial Census, since this provides characteristic information even down to the block group level (a subdivision of Census tracts). Of course, the further the time period of the initiation is from the year in which the Census was taken, the weaker this matching information will be. One study used 1970 Census data to match sites when the program implementation occurred in the very end of the decade and found later the match was quite flawed.

Since there are many characteristics on which to match, some method must be found for weighting the various characteristics. If one had a strong statistical model of the process that generates the outcomes of interest than this estimated model would provide the best way to weight together the various characteristics measures. We are not aware of any case in which this has been done. Different schemes for weighting various characteristics measures have been advocated and used. A currently popular one is the Mahalanobis distance measure, mentioned above for the case comparison groups constructed for individuals[12].

In a few cases there are time trend data at the small area level on the outcome variable which cover the pre-intervention period. For example, in recent years birth record data have become more consistently recorded and made publicly available, at least to the zip-code level. In some areas, AFDC and Food Stamp receipt data aggregated to the Census tract level are available. The Healthy Start evaluation proposes to attempt to match sites on the basis of trends in birth data.

f. A Reprise on Friedlander-Robins Findings

Having reviewed a variety of methods for finding and using comparison communities it may be worthwhile to look briefly at some results from the Friedlander-Robins studies which provide at least some idea of the possible relative magnitude of problems with several of the methods.

Recall that this study used data from a group of work-welfare studies. In the base studies themselves random assignment of individuals was used to create control groups but Friedlander and Robins used data drawing the treatment group from one segment and control group data from a different segment ( thereby "undoing" the random assignment). It was then possible to compare the effects estimated by the treatment-comparison group combination with the "true effects" estimated from the random assignment estimates of treatment-control group differences (the same treatment group outcome is used in each difference estimate).

We reproduce here part of one table from their study:

Comparison of Experimental and Nonexperimental Estimates of the Effects of Employment and Training Programs on Employment Status

                                                Comparison Group Specification
 
                              Across Site/     Across Site/      Within Site/    Within Site/ 
                               Un-Matched        Matched        Across Cohort    Across Office

All Pairs of 
Experimental and 
Nonexperimental Estimates      
    

Number of Pairs                    96              24                24                16                                                                    

Mean Experimental Estimate       .056            .056              .045              .069

Mean Absolute Experimental       .090            .076              .034              .044
Nonexperimental Difference

Percent with Different Inference   47%             38%               29%               13%

Percent with Statistically         70%             58%                4%               31%
Significant Difference
(10% Level)


(Friedlander, Daniel and Philip K. Robins, "Estimating the Effect of Employment and Training: An Assessment of Some Nonexperimental Techniques," Table 3   (p.13)

The data are drawn from four experiments carried out in the 1980s (Arkansas, Baltimore, San Diego, Virginia). The outcome variable is whether employed (the employment rate ranged from a low of .265 in Arkansas to a high of .517 in Baltimore). Across the top of the table there is a brief description of how the comparison group was constructed using four different schemes for construction.

In the first two columns, the two across-site methods use the treatment group from one site, e.g. Baltimore, with the control group from another site, e.g. San Diego, used as the comparison group. In the second column the term "matched" indicates that each member of the treatment group was matched with a member of the comparison group using the Mahalanobis "nearest neighbor" method and then the estimates of the impact were made as the difference between the treatment group and the matched comparison group. In the first column no such member by member match was done, however, in the regression equation in which the estimate of the impacts are made there are variables for characteristics included and this controls for measured differences in characteristics between the two groups.

The Within-Site/ Across-cohort in column three builds on the fact that the samples at each site were enrolled over a fairly long time period and it was, therefore, possible to split the sample in two parts, those enrolled before a given date - called the "early cohort" and those enrolled after that date - the "late cohort". The treatment group from the "late cohort" is used with the control group from the "early cohort" as their comparison group. This approximates a pre-post design for a study.

Finally, in column four, for two of the sites the work-welfare program was implemented through several local offices. It was possible, therefore, to use the treatment group from one office with the control group from the other office as a comparison group. This procedure approximates a matching of communities in near proximity to each other.

The first row of the table gives the number of pairs tested. This is determined by the number of sites, the number of outcomes (the employment outcomes at two different post-enrollment dates were used), the number of sub-groups (broken down by whether AFDC applicants or AFDC current recipients). The number of pairs gets large because each site can be paired with each of the three other sites. The smaller number of pairs in the within-site/across-office occurs because there were only two sites with multiple offices.

The next row gives the means of the experimental estimates, i.e., the "true impact estimates" from the original study of randomly assigned treatment-control differentials. Thus for example, the experimental estimates of treatment-control differences in employment rates across all four sites was a 5.6% difference in the employment rate of treatments and controls.

The next row compares the results of the estimates using the constructed comparison groups to the "true impact", experimental, estimates , averaged across all pairs. For example, the mean absolute difference between the "true impact" estimate and those obtained by the constructed comparison groups across-site/unmatched was .09, that is the difference between the two sets of estimates was on average more than 1.5 times the size of the "true impact"!

The next row tells the percentage of the pairs in which the constructed comparison group estimates yielded a different statistical inference than the "true impact" estimates. A different statistical inference occurs when only one of the two impact estimates is statistically significant or both are statistically significant but with opposite signs. A 10 percent level of statistical significance was used.

The fifth row indicates the percent of the pairs in which the estimated impacts are statistically significantly different from each other.

For our purposes, we focus on rows three and four. Row three tells us that under every method of constructing comparison groups the constructed comparison group estimates (called non-experimental in the table) differ from the "true impact" estimates by a magnitude of over 50 percent of the magnitude of the "true impact".

Row four tells us that in a substantial number of cases the constructed comparison group results led to a different inference, i.e. the "true impact" estimates indicated that the program had a statistically significant effect on the employment rate and the constructed comparison group estimates that it had no impact or vice versa or that one said the impact was to increase the employment rates at a statistically significant level and the other said that it decreased the employment rate at a statistically significant level.

Now we focus more closely on columns three and four because these are the types of comparisons that are likely to be more relevant for community-wide initiatives: as already noted, the within-site/across-cohort approximate a pre-post design in a single community and within-site/across-office approximates a close-neighborhood-as-a-comparison group design.

It appears that these designs are better than the across-site designs in that, as indicated in row three, the size of the absolute difference between the "true impact" and the constructed comparison group estimates is much smaller and is smaller than the size of the true impact. However, the difference is still over 50 percent the size of the "true impact". The magnitude of the difference is important if one is carrying out a benefit-cost analysis of the program. A 4.5 percent difference in employment rates might not be sufficiently large to justify the costs of the program but a 7.9 percent difference might make the benefit-cost ratio look very favorable; a benefit-cost analysis with the average "true impact" would have led to the conclusion that the social benefits of the program do not justify the costs whereas the average constructed comparison group impact (assuming that it was a positive .034 greater) would have led to the erroneous conclusion that the program did provide social benefits which justify its costs.

When we move to row four we have to be a bit more careful in interpreting the results because the sample sizes for the column three and four estimates are considerably smaller than those for the column one and two cases. For example, the entire treatment group is used in each pair in columns one and two but only half the treatment group is used in columns three and four. Small sample size makes it more likely that both the random assignments estimates and the constructed comparison group estimates will be found to be statistically insignificant. Thus, inherently the percent with different statistical inference should be smaller in columns three and four. Even so, for the within-site/across-cohort nearly 30 percent of the pairs the constructed comparison group estimates would lead to a different - and therefore erroneous - inference about the impact of the program. For the within-site/across-office estimates 13 percent led to a different statistical inference. Is this a tolerable risk of erroneous inference? I would not think so, but others may feel otherwise.

There are a couple of additional points about the data from this study which should be born in mind. First, this is just one set of data analyzed for a single relatively well understood outcome measure, whether employed or not. There is no guarantee that the conclusions about relative strength of alternative methods of constructing comparison found with these data would hold up for other outcome measures. Second, in the underlying work\welfare studies the population from which both treatment group members and control group members were drawn were very much the same,i.e., applicants or recipients of AFDC. Therefore even when constructing comparison groups across sites one is assured that one has already selected persons whose employment situation is so poor they need to apply for welfare. In community wide initiatives, the population being dealt with would be far more heterogenous. There would be a far wider range of unmeasured characteristics which could affect the outcomes and, therefore, the adequacy of statistical controls (matching or modeling) in assuring comparability of the treatment and constructed comparison groups could be much less.


Statistical Modeling of Community Level Outcomes

One approach that has been tried in the evaluation of the effects of programs measured at the level of communities or larger is statistical modeling of community level outcomes. In these cases, the procedure is to use past data on the outcome variable to estimate a statistical model and then use that model to generate a form of counterfactual - what would have happened to that outcome at the community level had the program not been instituted. The measured outcomes in the program community are then compared to the values predicted from the statistical model in order to assess the impact of the program on the outcome of interest.

1.Time-series Modeling

Time series models of community level outcomes have long been advocated as a means of assessing the effects of program innovations or reforms[13]. In the simplest form, the time-series on the past values of the outcome variable for the community is linearly extrapolated to provide a predicted value for the outcome during and after the period of the program intervention.

In a sense, the pre-post designs discussed above are a simple form of this type of procedure. It has been recognized for a long time that the simple extrapolation design is quite vulnerable to error because even in the absence of any intervention community variables rarely evolve in a simple linear fashion.

Some attempts have been made to improve on the simple linear form by introducing some of the more formal methods of time-series modeling[14]. Introducing non-linearities in the form can allow for more complex reactions to the program intervention (McCleary and Riggs 1982). Another attempt uses pre-program measures of cohorts as a time-series of comparison group values used with the in-program treatment measures for previous cohorts (McConnell, 1982).

The problem with these methods is that they do not explicitly control for variables other than the program intervention which may have influenced the outcome variable.

2.Multi-variate Statistical Modeling

Some attempts have been made to estimate multi-variate models of the community level outcome variables in order to generate counterfactuals for program evaluation[15]. We have not been able to find examples of such attempts at the community level but there are several examples of attempts to estimates case-load models for programs (such as AFDC and Food Stamps) at the national and state level (Grossman 1985, Beebout and Grossman 1985, Garsky 1990, Garsky and Barnow 1992,Mathematica Policy Research..Puerto Rico: Volume II 1985). Most analysts have considered these results of these models to be unreliable for program evaluation purposes. For example, effects of changes in the low wage labor market appeared to have swamped the effects controlled for in models of the AFDC caseload in New Jersey leading to implausible estimates of the effects of an AFDC reform in New Jersey. Note that these models would have to attempt to separate out variables that are likely to effect the outcome variable but which would not themselves be affected by the program intervention and then to measure those variables in the intervention community during the course of the program and/or post-program period. For example, as noted in the New Jersey case, one would have to have obtained good measures of how the demand for low wage labor is affected at the level of the community in order to estimate the statistical model and then obtain measures of those variables during the period of the program for that community and use those measures in the statistical model to generate the counterfactual. Recall in the examples discussed above how comparison communities in EOPP were affected by floods, hurricanes and volcanic eruptions or in YIEPP where court-ordered school desegregation occurred in the comparison community. Adequate statistical modeling would have to attempt to incorporate such factors.

Statistical modeling at the community level also runs up against the problem of the limited availability of small area data, particularly provided on a consistent basis over several periods of time or across numerous communities. Such data are necessary both for the estimation of the statistical model of the community level outcome and for the projection of the counterfactual value of the outcome for the program period, e.g. if the model includes local employment levels as affecting the outcome then data on local employment during the program period must be available to use in the model.

A general problem in using this approach, which we will return to in the concluding section, is that there has been so little quantitative study of community level data. The development of good statistical models at the community level will require more extensive efforts to bring together community level data and to understand what factors influence how communities evolve.

Types of Hypotheses Which Could Be Tested

In this section we outline the types of hypotheses with respect to community wide initiatives which could be tested in various types of evaluative situations. We will discuss hypotheses in broad generic terms before considering some more directly tied to possible community-wide initiative concerns.

It is important to be clear at the outset of this section that for the most part we are discussing hypothesis testing where random assignment has been possible. Thus we assume as background that the fundamental problems of developing community level counterfactuals discussed above are already assumed and here we are discussing added problems which arise according to the type of hypothesis to be tested. In a few places we remind the reader about the more fundamental community selection bias problems but we want to avoid continually repeating that this is an underlying problem.

a. Single outcome from a single treatment

The situation which most closely follows the classical experimental design is one in which there is a single outcome variable which is hypothesized to be affected by a single simple treatment. For example, the birth weight of children is hypothesized to be affected by the provision of guaranteed minimum level of cash income to the pregnant mother. (We use this example because, in fact, that was one of the unforeseen outcomes in one of the Negative Income Tax experiments in the 1970s, see Keherer and Wohlin ...). The outcome, birth-weight, is easy to measure and relatively well monitored. The treatment is about as straightforward as any we can think of, though even in this case there can be numerous complications of definition and implementation. If this is not a community-wide initiative, then one could use random assignment of individual women to getting the guarantee or not getting it and thereby create a treatment group and control group. If the guarantee were made community-wide then one would face all the problems of creating an adequate constructed comparison group which have been outlined above.

Note that we have no complex theory here of the mechanisms through which the treatment affects the outcome; this is the "black box" approach. We could hypothesize some simple mechanisms through which the treatment might operate and then seek to monitor related processes, e.g., the mothers use the better income to improve their diet, but we do not test these mechanisms directly. For example, an alternative mechanism might be the reduced stress on the mothers due to the removal of uncertainty about income.

It is usually straightforward to extend this type of situation to the case where there are hypothesized to be multiple outcomes affected by the single treatment. For example, it could be hypothesized that the guaranteed income would improve not only birthweight of newborns but also the school performance of school age children in the household (another largely ignored outcome in several of the Negative Income Tax experiments). Most of the work-welfare experiments hypothesized, and measured, effects on both employment and receipt of welfare

b.Single outcomes from multiple treatments

Single outcomes could be affected by different structures of treatments. Multiple treatments can be generated by systematically varying parameters of a single type of treatment. An example of this is the National Health Insurance experiment. Health insurance was the type of treatment and the parameters which were systematically varied were the levels of deductables and co-payments (as well as assignment to an HMO). The major outcome of interest was expenditures on medical care. Random assignment of individuals to insurance plans with different values for these parameters permitted tests of the independent effects of co-payments and deductibles. Critical here is the systematic structuring and variation of the parameters of interest; sufficient independent variation of deductables and co-payments among the groups of individuals was necessary in order to estimate the effects of varying one while holding the other constant. There was no "null treatment" in this case; everyone had health insurance, the groups varied in the level of out-of-pocket cost per unit of utilization of medical services.

Again, it was also possible to test hypotheses with respect to more than one outcome. In this example, researchers could test the effects of the treatment parameter variations on the types of medical services utilized, e.g. hospital or outpatient, medication, tests, specific procedures. Interestingly, health status was not, originally, a principle outcome hypothesized to be affected, and, indeed, the original design of the experiment had no provision for attempting to measure health status. In the end, however, extensive and innovative research was done to develop better measures of health status and a few effects of the health insurance parameters on health status were detected (see Newhouse, etal. 1993).

c.Internal treatment differences: length of stay, participants and non-participants

One of the aspects of formal evaluations which adhere to rigorous standards of random assignment which has been most vexing to operators of programs being evaluated and to policy-makers who seek guidance from evaluations are the limitations on testing hypotheses about the effects differences in exposure to the treatment within the group which was randomly assigned to the treatment category - as opposed to the control group. We call these "internal treatment differences".

Of greatest common concern are differences in outcomes between those in the treatment group who actually participate in the program and receive services and those who do not participate. It appears sensible to most persons to evaluate a hypothesis about the effects of the treatment by seeing what happens to those who actually received services. The rigorous inference standards, however, require that the non-participants be included with the participants as part of the treatment group and compared to the control group in order to estimate the effects of the treatment on the outcome of interest. Naturally, to many it appears that the treatment is "diluted" by the inclusion of non-participants. The reason that rigorous inference standards call for this procedure is the same problem we have been discussing throughout this paper: how does one find an appropriate counterfactual?

The problem in this case is to isolate the appropriate subset of the control group who would have participated had they been offered the treatment. Once again, the difficulty is with selection bias caused by unmeasured variables which affect both participation and the outcome. To repeat from early discussion: the unmeasured effects could go either way. Perhaps those who participated are better motivated or on the contrary perhaps they are those that had fewer alternative opportunities or less "gumption" to get out and "do it on their own". Or the selection may have been generated by aspects of the treatment such as the decisions of the program operators to make greater efforts to enroll the "better candidates" or on the contrary to assure receipt of services by those "most in need".

In certain limited circumstances, it is possible to estimate differences between participants and non-participants but it requires one very strong assumption: that the process that resulted in the offer of the opportunity to receive the treatment did not itself generate behavior that affects the outcome of interest. If that assumption holds than it can be assumed that the proportion of non-participants and their average outcome would be the same in both control and experimental group and the average outcome for the non-participants in the treatment group can be assumed for the same proportion of the control group and subtracted from the average outcome for the whole control group (with variance of the estimates appropriately corrected)[16]. As an example of why this strong assumption might not hold, consider some of the work\welfare programs which have been based on mandatory entrance into the program process. After random assignment those assigned to the treatment group faced the threat of sanctions - reduction or elimination of welfare payments - if they did not participate in given program activities. Those who do not participate undoubtedly make greater effort to find alternatives to welfare than would their equivalents in the control group who do not face the threat of such sanctions.

In addition to the problems of rigorous statistical inference, another consideration leads to an argument that one should estimate impacts with both non-participants and participants included as the treatment group. Usually we are interested in how a given program, which is the treatment, will affect a given population and part of their response to the program is the decision to participate, to take up the services offered. Unless we can somehow force people to utilize the services, the estimates of the programs impact should take into account the likely nonparticipation. However, as some have emphasized[17], it may be important to try to understand better the factors that affect program participation decisions. These decisions are undoubtedly affected to some degree by the way the program is presented, by the way it is implemented, by the "reputation" it develops. To the degree that these aspects can be controlled through policy then modifications of the program could induce different levels of participation in different groups. Better understanding of participation decisions could also lead to more effective quantitative modeling of participation decisions. Such models, if highly effective, might permit use of better "selection bias correction" methods similar to those outlined above.

The types of problems created by participation phenomenon carry over into the issue of the effects of the length of stay or length of exposure. Here also program operators and policy-makers argue for estimates of differences in impacts associated with a longer exposure to the treatment; "don't those that stay longer get more services and therefore do better"? Once again, standards of rigorous inference preclude sound estimates of the effects of lengthy of stay; how do we separate out those in the control group who would have stayed longer so they may be compared to the long stayers in the treatment group? Why do some members of the treatment group stay longer?

We can use their measured characteristics to "match" them with control group members but once again it is the unmeasured variables which may bias the estimates: motivation operating either for or against long staying, program operator actions to encourage particular individuals to stay or to leave early. Once again, one can try to model quantitatively the determinants of length of stay and correct for bias but the chances for success in this are quite small. Also, from the policy perspective, estimates of the effects of length of stay would only be useful to the extent that one had the policy instruments which would cause individuals to stay longer[18].

These same problems would arise with other forms of internal treatment differences such as choice of different training streams within a given training program. Since the choices occur after random assignment, unmeasured variables may influence how the choices are made and would bias any estimates derived from differences between treatments and controls[19].

d.Interaction Effects

In this section we will discuss several types of interaction effects: those between characteristics of participants and the treatment, those among various dimensions of one treatment or multiple types of treatments, those among difference types of participants (or institutions). It seems evident that arguments for community-wide interventions are based on assumptions about the importance of several of these types of interaction effects. Obtaining estimates of some of these interactions is relatively straightforward within a context in which random assignment is possible[20]. We will discuss these first.

i.Subgroups

The groups involved in the intervention study can be divided in a variety of ways into sub-groups based on their pre-program characteristics, e.g.,ethnicity, level of education, gender. Hypotheses regarding differences in the effects of the treatment can be tested by estimating separately for each group the difference in the outcome variable for those in the treatment group and the control group. The important differentiation between this and the just previously discussed situation is that the characteristics defining the subgroups were determined prior to entrance into the treatment or control group (they are exogenous to the determination of treatment status) so that there is no opportunity for selection on unmeasured variables. The only problem in this case is the reduction in sample size as the total study sample is broken down into smaller subgroups. The smaller the sample size, the bigger must be the impact on the outcome variable to pass the test for statistical significance; there is a greater chance that even though there is a sizeable difference in the outcome between treatment group members and control group members it will be judged to be statistically insignificance.

ii.Treatment interactions

Often there are multiple dimensions to a treatment or there is more than one type of treatment being administered. For example, a training program will also provide sex education program and the outcome of interest is number of births and the interaction to be tested is whether the combination of training and sex education has a greater effect in reducing births than each program taken alone. What is required to estimate the effects of these types is random assignment to different treatment combinations, e.g., training alone, training plus sex education and sex education alone[21]. Estimates of the interaction effects can be made by comparing effects in each group separately (or equivalently in an estimating equation including both linear and interaction terms). Once again, the major problem is assuring that there is sufficient sample size in the various groups to have statistical power adequate to find statistically significant interaction effects of the size relevant for the investigators interest (policy or scientific).

iii. Group interactions

Now we consider the type of interaction most relevant to community-wide initiatives, those involving interactions among individuals and between individuals and institutions which modify the impact of the treatment. Brown and Richmond illustrate this concern: "Too often in the past, narrowly defined interventions have not produced long-term change because they have failed to recognize the interaction among physical, economic and social factors that create the context in which the intervention may thrive or flounder"[22].

Commentators have classified such interactions in a variety of ways: contagion or epidemic effects, social capital, neighborhood effects, externalities and social comparison effects (some of these sometimes treated as sub-categories of others). We have not taken the time to carefully catalog and reorder these classification (though such an analysis might help with an orderly development of evaluation research). We simply give examples of some broad categories of group interactions which might be of concern to evaluators of community initiatives.

iv.networks and group learning

The importance of associational networks has been increasingly emphasized in the literature on communities and families. The general idea that response to an intervention can be conditioned by the associational networks stems most simply from the idea that information about the form of the intervention, how it treats individuals in various circumstances, is likely to be passed from individual to individual and as a result the group learning about the intervention is likely to be faster and greater than would be the learning of the isolated individual. This in turn may condition the individual response to the intervention. Different network structures would induce different degrees of group learning and therefore different responses.

Stronger forms of interaction from networks would fall under what some have called "norm formation"[23]. Here one might mean both the way pre-existing norms either impede or facilitate response to the intervention or the way group learning in response to the intervention reshapes group norms. For example, the existence of "gang cultures" may impede interventions or some interventions may seek to reshape the norms of the "gang culture" to cause them to facilitate other aspects of the intervention.

Finally, some interventions may seek to operate directly on networks, having social network change as either an intermediate or final outcome of interest.

The evaluation problems will differ depending on how these associational networks are considered. For example, suppose the objective is to test how different associational networks affect response to a given intervention. If networks are measured and classified prior to the intervention than individuals could broken into different sub-groups according to network type and subgroup effects could be analyzed in the usual manner just described above (the network category is exogenous to the intervention).

To the extent that the character of associational networks are an outcome variable (intermediate or final) of interest, they can be measured and the impact of the intervention upon them analyzed in the same fashion as for other outcome variables; measurement of the associational networks or those in the treatment group can be compared to the networks of those in the control group. Here the problems are primarily those associated with the reliability and consistency of measures of associational networks and what the properties of those measures are (what is their normal variance, how sensitive are they likely to be to impact from the intervention) as they relate to the adequacy of the sample design for evaluation.

Notice that the previous paragraphs take the network to be something that can be treated as a characteristic of the individual and the individual the unit of analysis. These analyses could be carried out even when there is not a community-wide intervention. Most would argue, however, that the group learning effects are really most important when groups of people, all subject to the intervention, interact. When this is the case, we are immediately thrown into the problems covered above under the discussion of using communities for constructed comparison groups; since random assignment of individuals to the treatment or control group is precluded when one wishes to have groups of individuals potentially in the same network treated, testing for this form of interaction effect will be subject to the same problems of selection bias outlined above.

v.Interactions through formal and informal institutions

Most interventions take the form of alteration of some type of formal institution that affects the individuals: a day care center, a welfare payment, an education course. A given broad type of intervention can be delivered through different types of formal structures - e.g. income support can come through a cash payment or an in-kind (food stamps) payment. The interactions of formal institutions with broad treatment of this type have been evaluated - e.g. in studies of food stamps cash-outs. However, most of those concerned with community-wide initiatives appear to be more interested in either the way the formal institutional structure in a given community conditions the individuals responses or with the access to or behavior of the formal institutions themselves as outcomes of the intervention.

With respect to the former concern, some studies seek to have the formal institutional structure as one of the criterion variables by which communities are matched and thus seek to neutralize the impact of interactions of formal institutions and the treatment. Both the Healthy Start and the School Dropout studies have already been mentioned as examples in which matching formal institutional structures are concerns in selection of comparison sites and we have already mentioned the problems of measurement and the limits of statistical gains from such attempted matches.

With respect to access to or behavior of formal institutions as outcomes there are different problems. First, access as an outcome variable might be relatively straight forward to measure and it may be easy to estimate the effects of the intervention on it, e.g., the number of doctors visits of pregnant women, participation in bilingual education programs. With respect to behavior of institutions the question arises as to whether the institution itself is the primary unit of analysis. When the institution itself is the unit of analysis then we must face anew all the aspects of sample design if we wish to use formal statistical inference concerning the behavior of an institution: what is the measure of behavior of interest; what is its normal variance; how many units can be subject to the intervention treatment; can we do random assignment or can we rely on constructed comparison groups; can sufficient sample size be attained; and, more deeply, is the underlying behavior of the institution generated by a common stable process?

Informal institutions are also subjects of interest. The associational networks discussed above are surely examples, as are gangs. But there are informal economic structures which also fall in this category. The labor market is an informal institution whose operations interact with the intervention and condition its impact. This can be most concretely illustrated by reference to a problem sometimes discussed in the literature on employment and training programs: "displacement". The basic idea is that workers trained by a program may enter the labor market and become employed but if there is already involuntary unemployment in the relevant labor market, total employment may not be increased because that worker simply "displaces" a worker who would have been employed in that job had the newly trained worker not shown up[24]. An evaluation with a number of randomly assigned treatment and control group members which is small relative to the size of the relevant labor market would be unable to detect these "displacement" effects if they did occur because their numbers are too small relative to the size of the market; the trained treatment group member is not likely to show up at exactly the same employer as the control group member would have. It has been argued by some that use of community-wide interventions in employment and training would provide an opportunity to measure the extent of such "displacement effects" because the size of the intervention would be large relative to the size of the local labor market, indeed one of the hopes for the YIEPP was that it provide such an opportunity. But as the experience with YIEPP, described above, illustrates, the use of comparison communities called for in this approach is subject to a number of serious pitfalls[25].

iv.interactions with external conditions

Some attempts have been made to see how changes in conditions external to an intervention which are experienced commonly by the treatment and control group members have conditioned the response to the treatment. For example, in the National Supported Work Demonstration attempts were made to see if the response the to the treatment (suppported work) varied systematically with the level of local unemployment. In this case there were no statistically significant differences in response but researchers felt this may well have been due to the weakness of statistics on the city by city unemployment rate.

e.Dynamics

Evaluations based on formal statistical methods have, to our knowledge, attempted to deal directly with issues of dynamics in very partial and limited ways. Usually we use the term dynamics to apply to the time dimension of either the treatment or the response.

The classical experimental paradigm calls for a well identified treatment applied consistently to all the members of the treatment group. We recognize that there are dynamic aspects of most treatment implementations and often suggest that evaluations not begin their observational measurement during the initial period of program build-up because it is felt that the treatment regime is not yet stabilized. . More realistically, however, we recognize that, for nearly all social interventions, treatment regimes are really not stable and consistent. Perhaps some of the cash transfers (the negative income tax) approximated this condition since the rules for transfer amount determination remained constant over time, but most interventions are delivered through some administrative structures and these administrative structures evolve and change over time for a whole host of reasons of their own. Thus the best we can do in most cases is to say: where there is random assignment there is a control subject for each treatment subject and whatever was happening on average to the treatment subjects, under the broad general conceptual treatment, e.g. training, we can measure its effects relative to the controls. Discrete sequential experiments could be planned a priori where sequences are dependent on prior stage outcomes[26]. How to evaluate the dynamics of changes in treatment in response to learning about the implementation of the treatment, where the alterations in treatment are largely a matter of local response remains, to our knowledge, largely ignored problem[27].

There are a few attempts to measure dynamics in the response to treatments. Many employment and training programs have carried out post-program measurements at several points in time in order to attempt to measure the time path of treatment effects. These time paths are important for the overall cost-benefit analyses of these programs because the length of time over which benefits are in fact realized can greatly influence the balance of benefits and costs; the rate of decay of benefits has been an important issue in cost-benefit analyses of training and employment programs . Studies have shown both cases in which impacts appear in the early post-program period and then fade out quickly thereafter (as is often claimed about the effects of Headstart) and cases in which no impacts are found immediately post-program but emerge many months later (e.g. in the evaluation of the Job Corps).

At the other end, some of the attempts to improve upon evaluation of education, training and employment programs have tried to use estimates of the pre-program time path of the variable to be used as an outcome and to attempt to assure that comparison groups adequately "match" in terms of such time paths[28]. Of course, the interrupted time series design discussed above deals with dynamics in this sense.

We do not know of any attempts to trace with rigorous statistical methods dynamic patterns of response and feedback effects operating overtime through interactions of individuals with each other or institutions nor of any suggestions of what methods might be employed to do so.


Steps in Development of Better Methods

There are no strong recommendations which we can make concerning how best to approach the problem of evaluation of community-wide initiatives. In situations where random assignment of individuals to treatment and control groups is precluded there is no surefire method for assuring that the evaluation will not be subject to problems of selection bias when constructed comparison groups must be used - whether individuals or communities - to create the counterfactual and pre-post designs remain vulnerable to exogenous shifts in the context which may affect outcome variables in unpredictable (and often undetectable) directions. As of now, we do not see clear indications of what second-best methods might be recommended, nor have we identified particular situations which make a given method particularly vulnerable.

It is important to stress, once again, that the vulnerability to bias in estimation of the impacts of interventions should not be taken lightly. First, the few existing studies of the problem show that the magnitude of errors in inference can be quite substantial even when the most sophisticated methods are used. Second, the bias can be in either direction: we may not only be led to conclude that an intervention has had what we consider to be positive impacts when in fact it had none, we may also find ourselves confronted with impact estimates which indicate, due to bias, that the intervention was actually harmful; we may be misled either to promote policies which in fact use up resources and provide few benefits or we may be led to discard types of interventions as unsuccessful which actually have underlying merit. Once such biased quantitative findings are in the public domain, it is very hard to get them dismissed, to prevent them from influencing policy decisions, even when we have strong intuition that they are biased.

Beyond these rather dismal conclusions and admonitions, the best we can suggest at this time are some steps which might be taken to improve our potential for understanding how communities evolve over time and hope that that better understanding will help us to create methods of evaluation which are less vulnerable to the types of bias we have pointed out.

a.Improve Small Area Data

We have stressed at several points that detailed small area demographic data are very hard to come by except at the time of the decennial census. Increasingly, however, records data are being developed by a wide variety of entities which can be tied to specific geographic areas (geo-coded data). One type of work which might be fruitfully pursued is to combine various types of records data with two or more Censuses to try to develop models in which the trends in the Census data can be related to the time-series of the records data[29]. Cross-section correlations of base period records data with Census variables could be combined with the time-series of the records data to see how well they could predict the end period Census demographics for given small geographic areas.

Our experience with availability of records data at the state-level (when working on the design of the evaluation of Pew's Children's Initiative) convinced us that there are far more systems-wide records being collected - in many cases with individual and geographic area level information - than we would have thought. Much of the impetus for the development of these data systems comes from the Federal government in the form of program requirements (both for delivery of services and for accountability) and, more importantly, from the Federal financial support for systems development.

Evaluations of employment and training programs have already made wide use of Unemployment Insurance records and these records have broad coverage of the working population. More limited use has been made of Social Security records. In a few cases, it has been possible to merge Social Security and Internal Revenue Service records. Birth records collection has been increasingly standardized and some investigators have been able to use time series of these records tied to geographic location. The systems records, beyond these four, cover much more restricted populations, e.g., welfare and food stamps, Medicaid and Medicare, WIC.

More localized record systems which present greater problems of developing comparability are education records and criminal justice records. However, in some states statewide systems have been, or are being, developed to draw together the local records.

We are currently investigating other types of geo-coded data that might be relevant to community-wide measures. Data from the banking systems have become increasingly available as a result of the Community Reinvestment Act (HUMDA data). Local real estate transaction data can sometimes be obtained but information from Tax Assessments seems harder to come by.

In all of these cases, whenever it is desired to obtain individualized data, problems of confidentiality present substantial barriers to general data acquisition by anyone other than public authorities. Even with the Census data, for many variables, one cannot get data at a level of aggregation below block group level.

b.Enhance Community Capability to Do Systematic Data Collection

We believe that it is possible to pull together records data of the types just outlined to create community databases which could be continuously maintained and updated. These data would provide communities with some means to keep monitoring, in a relatively comprehensive way, what is happening in their areas. This would make it possible to get better time-series data with which to look at the evolution of communities. To the degree that communities could be convinced to maintain their records within relatively common formats, an effort could be made to pull together many different communities to create a larger data base which would have time-series, cross-section structure and would provide a basis for understanding community processes.

Going a step beyond this aggregation of records, attempts could be made to enhance the capability of communities to gather new data of their own. These could be anything from simple surveys of physical structures based on externally observed characteristics (type of structure, occupied, business or organization, public facility, etc.) carried out by volunteers within a framework provided by the community organization to full-scale household surveys on a sample or on a census basis.

c. Create a Panel Study of Communities

As already noted above, if many communities used common formats to put together local records data one would have a time-series, cross-section database potential. In the absence of that, admittedly unlikely, development, it might be possible to imitate the several nationally representative panel studies of individuals (The Panel Study on Income Dynamics, The National Longitudinal Study of Youth, High School and Beyond, to name the most prominent) which have been created and maintained, in some cases since the late 1960s. Here the unit of analysis would be communities - somehow defined. The objective would be to provide the means to study the dynamics of communities. They would provide us with important information on what the cross-section and time-series frequency distributions of community level variables look like, important ingredients, we have argued above, for an evaluation sample design effort with communities as units of observation. This would provide the best basis for our next suggestion, work on modeling community level variables.

Short of creating such a panel study, some steps might be taken to at least get Federally funded research to try to pull together across projects information developed on various community level measures. There are increasing numbers of studies where community level data are gathered for evaluating or monitoring programs or for comparison communities. We noted above several national studies which were using a comparison site methodology (Health Start, Youth Fair Chance, the School Dropout Study) and some gains might be made if some efforts of coordination resulted in pooling some of these data.

d.Modeling Community Level Variables

As we mentioned above, statistical modeling might provide the basis for generating more reliable counterfactuals for community initiatives; a good model would generate predicted values for endogenous outcome variables for a given community in the absence of the intervention by using historical time series for that community and such contemporaneous variables as are judged to be exogenous to the intervention. At least such models would provide a better basis for attempting matching of communities if a comparison community strategy is attempted.

e.Develop Better Measures of Social Networks and Community Formal and Informal Institutions.

We have not studied the literature on associational networks in any depth, so our characterization of the state of knowledge in this area may be incorrect. However, it seems to us that considerably more information on and experience with different measures of associational networks is needed, given their central role in most theories relating to community-wide processes.

Measures of the density and character of formal institutions appear to us to have been little developed - though, again, we have not searched the literature in any depth. There are industrial censuses for some subsectors. We know of private sector sources that purport to provide reasonably comprehensive listings of employers. Some Child Care Resource and Referral Networks have tried to create and maintain comprehensive listings of child care facilities. There must be comprehensive listings of licensed health care providers. Public Schools should be comprehensively listed. However, when for recent projects we have talked about how one would survey comprehensively formal institutions, what to draw on for a sampling frame was not at all clear.

Informal institutions present even greater problems. Clubs, leagues, volunteer groups, etc. are what we have in mind. Strategies for measuring such phenomena on a basis which would provide consistent measures over time and across sites needs to be developed.

f.Tighten Relationships between Short-term (intermediate) Outcome Measures and Long-term Outcome Measures.

Inability or unwillingness to wait for the measurement of long term outcomes is a problem which many studies of children and youth, in particular, face. Increasingly we talk about "youth trajectories". Perhaps again, good comprehensive information, which we are not aware of, exist linking many short term, often softer, measures of outcomes to the long-term outcomes further along the trajectory. We find ourselves time and again asking what do we know about how that short term measure, participation in some activity, say, Boy Scouts, correlates with a long term outcome, say employment and earnings? Even more rare is information on how program induced changes in the short term outcome are related to changes in long term outcomes. We may know that the level of a short term variable is highly correlated with a long term variable but not know if that short term variable is changed to what degree does that correlate with a change in the long term variable. Thus we believe systematic compilations of information about short term and long term correlations for outcome variables would be very helpful and could set an agenda for more data gathering on these relationships where necessary.

g.More Studies to Determine the Reliability of Constructed Comparison Group Designs.

We have stressed the importance of information provided by the two sets of studies (Fraker, Maynard and LaLonde and Friedlander and Robins) which used random assignment data as a base and then constructed comparison groups to test the degree of error in the comparison group estimates. It should be possible to find more situations in which this type of study could be carried out. First, the replication of such studies should look at variables other than employment or earnings as outcomes to see if there is any difference in degrees of vulnerability according to the type of outcome variable and/or a different type of intervention. Second, more such studies would give us a far better sense of whether indeed the degree of vulnerability of the non-experimental methods is persistent and widely found in a variety of data sets and settings.


Appendix

Annotated Examples of Studies Using Various Evaluation Strategies

Counterfactual from Statistical Modeling

Bhattacharyya, M.N., and Layton, Allan P. "Effectiveness of Seat Belt Legislation on the Queensland Road Toll -- An Australian Case Study in Intervention Analysis." Journal of the American Statistical Association 74 (1979):596-603.

This is an evaluation of the effect of three separate laws enacted between 1969 and 1972 in Australia which made compulsory both the installation of seatbelts into new cars and the wearing of seatbelts. The variable used to measure effectiveness of the legislation as the quarterly number of (relevant) road deaths from 1950 to 1976.

Significance of the impact of each law was determined by a comparison of the forecasted levels of deaths in the post fit period (which only included the time up until the next law for the first two laws) to the actual observations. The first two laws showed no significant lack of fit, however the 1972 law (mandatory seat belt wearing) had significantly different results in reality than were predicted by the model so it was judged to be effective.

The study makes assumptions about the constancy of accident related variables and the noise structure, the lack of other major interventions occurring simultaneously, and that the first two laws caused an exponential decline in the number of cars without seatbelts.

The model (later augmented to account for the effect of the changing nature of the population of vehicles -- a growing percentage has seat belts) was judged to be inadequate because it predicted a permanent steady decline in the number of deaths, hypothetically until it reached zero, which is not feasible.

Researchers then tried a causal model, in which they used the volume of driving as an independent variable. Gasoline consumption was used as a proxy for the volume of driving. The model accounts for the transitional effect after the first law of the conversion of all cars to those with seat belts. The noise structure in this model accounts for the autocorrelations between the observations. The qualitative results were the same, but with greater significance.


Grossman, Jean Baldwin. "The Technical Report for the AFDC Forecasting Project for the Social Security Administration/ Office of Family Assistance." Mimeographed. Princeton: Mathematica Policy Research, February 1985.

Beebout, Harold, and Grossman, Jean Baldwin. "A Forecasting System for AFDC caseloads and Costs: Executive Summary." Mimeographed. Princeton: Mathematica Policy Research, February 1985.

These studies used data from 1975 to 1984 to develop caseload forecasting models which the government uses to predict future AFDC caseloads and expenditures. Judgemental method and the analytical method were compared, differentiated by the amount of reliance on the knowledge and experience of the forecaster.

Both national and state-by-state models were created, with the national model performing slightly better (less variance).


Garasky, Steven. "Analyzing the Effect of Massachusetts' ET Choices Program on the State's AFDC-Basic Caseload." Evaluation Review 14 (1990):701-710.

Evaluation of Massachusetts Employment and Training (ET) Choices program. Modeled what the caseload would have been in the absence of the program using data from the first quarter of 1976 through the third quarter of 1983 when ET Choices was implemented.


Garasky, Steven, and Barnow, Burt S. "Demonstration Evaluations and Cost Neutrality: Using Caseload Models to Determine the Federal Cost Neutrality of New Jersey's REACH Demonstration." Journal of Policy Analysis and Management 11(1992):624-636.

Evaluation of the change in costs incurred by a switch to the NJ REACH (Realizing Economic Achievement) program. Cost neutrality was required to keep the program in operation.

Developed AFDC caseload projection models to ensure that, within a tolerance band, the new demonstration did not exceed the costs of the prior AFDC program. Pre-intervention data used to derive the model was from 1978-87.

The program was studied from the time it was phased in in the first three counties in October of 1987 until the following October when the cost neutrality negotiations were opened up again. The models calculated caseload savings that the federal government considered to be too large to be plausible.


Kaitz, Hyman B. "Potential Use of Markov Process Models to Determine Program Impact." In Research in Labor Economics, edited by Farrell E. Bloch, pp. 259-283. Greenwich: JAI Press, 1979.

Attempts to measure impact on labor force participation. The study describes use of longitudinal data (pre- and post-program) on labor force mobility of participants to model equilibrium labor force patterns with Markov processes (which assume that labor force participation in one period depends only on behavior in the previous period). Adjusts for aging and economic conditions.

To the extent that there aren't any labor force data for a particular subgroup (they were young, in school, in the military, in prison) or that it cannot be considered representative (volatility of youth behavior), this approach will not be useful. The study gives some empirical examples from the Public Employment Program, 1971-1973.

The author shows effects of loosening some of the model's strict assumptions and offers alternative techniques and extensions of the basic model.


Mathematica Policy Research. "Evaluation of the Nutrition Assistance Program in Puerto Rico: Volume II, Effects on Food Expenditures and Diet Quality." Mimeographed. Princeton: Mathematica Policy Research, 1985 (see also Fraker, Thomas; Devaney, Barbara;and Cavin, Edward. "An Evaluation of the Effect of Cashing Out Food Stamps on Food Expenditures." American Economic Review 76 (1986):230-239.)

An evaluation assessing the effect of a food stamp cashout program (Nutrition Assistance Program) in Puerto Rico (which in 1982 replaced the food stamp program which had been effect since 1974).Data are taken from two food intake surveys, 1977 and 1984.

The authors modeled the food stamp caseload before the program was implemented. Then after the program went into effect, impacts were measured by comparing the current estimates of food expenditures to the expenditures that would have occurred in the absence of the cashout, as predicted by the model. The authors also modeled the participation decision in order to adjust for selection bias.


McCleary, Richard, and Riggs, James E. "The 1975 Australian Family Law Act: A Model for Assessing Legal Impacts." In New Directions for Program Analysis: Applications of Time Series Analysis to Evaluation, edited by Garlie A. Forehand. New Directions for Program Evaluation, number 16, a publication of the Evaluation Research Society, Scarvia B. Anderson, Editor-in-Chief, San Francisco: Jossey-Bass, Inc., December 1982.

An evaluation of the impact of the Australian Family Law Act (which provided for a form of no-fault divorce) on divorce rates.

Annual data from 1946 to 1979. Used a compound impact model of the form developed by Box and Tiao (JASA, 1975) which estimates both temporary and permanent shifts in the series. Statistically significant impacts on divorce rates were found for both temporary and permanent components.


McConnell, Beverly B. "Evaluating Bilingual Education Using a Time Series Design." In Applications of Time Series Analysis to Evaluation, edited by Garlie A. Forehand. New Directions for Program Evaluation, number 16, a publication of the Evaluation Research Society, Scarvia B. Anderson, Editor-in-Chief, San Francisco: Jossey-Bass, Inc., December 1982.

Evaluation of the impact of Individualized Bilingual Instruction (IBI) on standardized test scores. The program targeted preschool to 3rd grade children of migrant farm workers. Children were pretested upon entering the program and then retested after every 100 days of attendance. The test scores were standardized by the child's age.

Accumulation of pre-test scores of children entering at various ages was collected for 4 years and then used as a representation of how children in this sample would have fared without the bilingual program. The results indicate that three years of this program should, on average, improve the skills of these children to a level competitive with children whose first language is English.


Comparison Group derived from survey data

Ashenfelter, Orley "Estimating the Effect of Training Programs on Earnings." Review of Economics and Statistics 60 (1978):47-57.

An evaluation of the impact of the Manpower Development and Training Act (MDTA) program on participant earnings. Studied trainees who began participation in the first quarter of 1964, using data from 1961 to 1969 (had data from before this period, tried 61,62, and 63 as alternate base years for the model). Data for treatment group was drawn from program records augmented with Social Security (SS) earnings data. Data for comparison group from the Continuous Work History Sample (CWHS).

Pre-program earnings trends were different for treatment and comparison groups (p.51). Attempts were made to adjust for observed differences in earnings functions between groups using regression analysis. No attempt was made to adjust for selection bias. The author points out the problem of truncation of SS records at the maximum taxable amount (p.56).


Ashenfelter, Orley, and Card, David. "Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs." Review of Economics and Statistics 67(1985):648-60.

An evaluation of the impact of the Comprehensive Employment and Training Act (CETA) on participant earnings. The authors studied the 1976 cohort of enrollees using data from 1970 to 1978. Data for treatment group was from the Continuous Longitudinal Manpower Survey (CLMS). Data for comparison group was from the Current Population Survey (CPS). Stratified random sample taken from those screened for eligibility criteria (p.649). Comparison group resampled to reflect the age distribution of CETA participants. Pre-program earnings trends different for treatment and comparison groups (p.51).

The authors used a components of variance model with a random growth component and a selection rule for the participants in an attempt to control for observed differences in earnings functions as well as possible selection bias. Estimates are highly sensitive to changes in model specifications.


Bassi, Laurie J. "Estimating the Effect of Training Programs with Nonrandom Selection." Review of Economics and Statistics 66(1984):36-43 (see also, Bassi, Laurie. "The Effect of CETA on the Post-Program Earnings of Participants." The Journal of Human Resources 18 (1983):539-56.)

An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. Fiscal year 1976 enrollees were studies. The author uses 1973 and 1974 as base years and follows through 1978.

Screening and stratified (or cell-)matching was used to construct a comparable comparison group. Attempts were made to control for both selection bias and "creaming" problem through fixed effects models.

The author discusses problems of meager specification of CLMS data, truncation of SS earnings, contamination of CPS with CETA participants and other data problems.


Barnow, Burt S. "The Impact of CETA Programs on Earnings." The Journal of Human Resources 22 (1987):157-193

Reviews 6 CETA studies: Westat (x2), Bassi, Bloom/McLaughlin, Dickinson/Johnson/West, and Geraci.


Bloom, Howard S. "What Works for Whom? CETA Impacts for Adult Participants." Evaluation Review 11 (1987):510-527.

An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. Studied participants who entered the program between 1/75 and 6/76 and followed through 1978 (unclear what they used as the base year although there are some graphs using data as far back as 1964 and since they're using the CLMS they should have SS data back to 1951). Random selection of comparison group was done from CPS subject to certain eligibility criteria and a time-varying fixed effects model was used.


Bryant, Edward C., and Rupp, Kalman. "Evaluating the Impact of CETA on Participant Earnings," Evaluation Review 11(1987):473-92.

An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. The authors examined FY 1976 and FY 1977 cohorts using data through 1979 and base years of 1972 and 1973 respectively. They screened for eligibility and then used stratified matching to construct comparison group.

The authors claim matching strategy is robust to model specification and used an autoregressive earnings function.


Dickinson, Katherine P.; Johnson, Terry R.; and West, Richard W. "An Analysis of the Sensitivity of Quasi-Experimental Net Impact Estimates of CETA Programs." Evaluation Review 11 (1987):452-472.

An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. The authors examined 1976 enrollees through 1978 considering use of 1972, 1973, and 1974 as base years.

They screened for eligibility and then used statistical, or "nearest neighbor," matching.

The authors used a basic OLS model to control for measurable differences and then compared with estimates from an a model which uses a symmetric-difference estimator -- which should control for selection decision and individual specific fixed and random effects.

They found that impacts estimates were sensitive to employment status of sample members prior to treatment period (p.469), and to the analytical model used, but robust to the matching procedure used. They admit that it is hard to determine which of the alternate estimates found through sensitivity analysis is the "correct" one.


Finifter, David H. "An Approach to Estimating Net Earnings Impact of Federally Subsidized Employment and Training Programs." Evaluation Review 11 (1987):528-47.

An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. The author studied the cohort from FY 1976.

He used stratified matching to construct comparison group.

The author pooled data for 9 years (p.530), from 1970-1978 (p.535)

Separate regressions for comparison and treatment groups.

Used a "pooled cross-section time-series model that controls for year-specific and individual-specific (fixed) effects" (p.536)


Within Site Comparison Groups

Burghardt, John; Gordon, Anne; Chapman, Nancy; Gleason, Philip; and Fraker, Thomas. "The School Nutrition Dietary Assessment Study: Dietary Intakes of Program Participants and Nonparticipants." Mimeographed. Princeton: Mathematica Policy Research, October 1993. (see also the other reports from this study including: Data Collection and Sampling; School Food Service, Meals Offered, and Dietary Intakes; Summary of Findings)

An evaluation of the impact on dietary intakes of the National School Lunch Program (NSLP) and the School Breakfast Program (SBP), both of which are voluntary programs. Data was collected from surveys and interviews conducted during the period of February to May of 1992 (p.xi).

Participants were contrasted with non-participants.

The study used a multi-stage stratified sample weighted by probability of selection of schools and individuals within schools - 626 schools, 350 districts, 45 states; random selection of districts, 3 schools per district, 10 students per school. One day of interviews.

The analytic model adjusts for measured differences in individual, individual's family, and the characteristics of the school and community. n attempt to adjust for selection bias was made by accounting for the participation decision (joint model) -- instrumental variables approach. The authors tried alternative models with separate equation for participants and nonparticipants, each of which was adjusted for selection bias (using 2-stage approach) and got similar results. They tested the assumption that identifying variable in participation equation doesn't influence outcome of interest. They examined variance of estimates with different combination of identifying variables and compared estimates from selection bias adjusted models with those from non-adjusted models. The results showed that it is hard to get rid of this kind of selection biases the models were sensitive to slight variations in assumptions and consequently the authors caution against using the results as a true measure of impact


Devaney, Barbara; Bilheimer, Linda; and Schore, Jennifer. "The Savings in Medicaid Costs for Newborns and their Mothers from Prenatal Participation in the WIC Program." Vols. 1 and 2, Mimeographed. Princeton: Mathematica Policy Research, April 1991.

An evaluation of The Special Supplemental Food Program for Women, Infant, and Children (WIC) which offered prenatal services and food benefits to pregnant Medicaid beneficiaries.

The study examined all Medicaid-covered births (during 1987 for four states, and the first half of 1988 for a fifth state) in terms of the costs of the prenatal program relative to the savings in post-partum care for 60 days after birth. The savings was measured as the regression-adjusted difference between post-partum costs for the voluntary participants in the program versus these costs for nonparticipants.

The evaluators attempted to account for selection bias through the use of maximum likelihood estimates of a joint model of costs and participation. They had difficulty isolating a variable which influenced the participation decision but did not affect Medicaid costs (partially because of the limited nature of their data). Consequently, the difference in participation propensity for each group was quite small, and the model was not robust to even minor specification changes.

The authors wanted to examine the effect of length of duration in program on outcome measures however this effect was confounded with gestational age.


Jiminez, Emmanuel and Kugler, Bernardo. "The Earnings Impact of Training Duration in a Developing Country: An Ordered Probit Selection Model of Columbia's Servicio Nacional de Aprendizaje (SENA)." The Journal of Human Resources 22 (1987):228-247.

An evaluation of a Columbian training program, SENA. The authors first modeled decision to participate in long courses, short courses, or not at all (trichotomous variable). Then they used the results of that model in an OLS earnings model. They examined these results and found that impact estimates not accounting for these decisions would have overestimated program effects.

Data were derived from a survey conducted between 1979-1981 which has information on both SENA trainees and SENA participants who were similar to participants in terms of the types of firms in which they worked.

Earnings model robust to certain specifications changes.


Kiefer, Nicholas M. "Federally Subsidized Occupational Training and the Employment and Earnings of Male Trainees." Journal of Econometrics 8 (1978):111-25.

The study provides evaluations of the MDTA program for a two and a half year period beginning in 1969. The sample was taken from ten major Standard Metropolitan Statistical Areas (SMSAs) for both trainees and eligible non-participants. Groups were matched on age race and gender.Separate estimates were derived for blacks and whites.

The author modeled earnings as a function of estimated earnings plus a quadratic equation of weeks of participation in the program. He used a Heckman (1976) technique to control for correlation between the probability of employment and earnings (to help correct for the problem of zero earnings for those unemployed). He tested for correlation between the selection into the program and the error term from the earnings equation and found it not to be significant.

Results show small negative effects on earnings for blacks and insignificant effects for whites.


Kiefer, Nicholas. "Population Heterogeneity and Inference from Panel Data on the Effects of Vocational Training." Journal of Political Economy 87 (1979):p213-26.

Same sample as JOurnal of Econometrics study. The author here used a model which includes both individual and time effects; doesn't assume that they are orthogonal to the regressors. He finds substantial cross-sectional bias.


Cooley, Thomas M.; McGuire, Timothy W.; and Prescott, Edward C. "Earnings and Employment Dynamics of Manpower Trainees: An Exploratory Econometric Analysis." Research in Labor Economics, edited by Farrell E. Bloch, pp. 119-148. Greenwich: JAI Press, 1979

An evaluation of MDTA participants in 1969, 1970, and 1971 cohorts.

The authors claim the use of no-shows for comparison group is superior to survey data, first,on theoretical grounds - for them to have enrolled in the program indicates substantial similarity to the treatment group; second, because the autocorrelation function of earnings very similar to that of trainees; and, third, because they can control for unobservable differences better with this comparison group. The authors argue for simpler, more robust models which require fewer assumptions.


Matched site comparison groups -- no modeling

Buckner, John C., and Chesney-Lind, Meda. "Dramatic Cures for Juvenile Crime: An Evaluation of a Prisoner-Run Delinquency Prevention Program." Criminal Justice and Behavior 10 (1983):227-247.

An evaluation of a delinquency deterrent project based on Rahway's Scared Straight program. The sample is a one year follow-up of the first 100 male and 50 female participants (and comparison group members) in the program which began in August 1979. The authors used matched comparison groups: they manually went through prison records to match individual by individual on certain key characteristics (gender, age, arrest record, etc.). They found higher rates of post-program arrests leading to charges among males who participated in the program.


Duncan, Burris; Boyce, W. Thomas; Itami, Robert; and Puffenbarger, Nancy. "A Controlled Trial of a Physical Fitness Program for Fifth Grade Students." Journal of School Health 53 (1983):467-471.

An evaluation of a nine-month (lasting the school year) physical fitness program implemented in the fall of 1979. Tests were taken by subjects prior to the implementation, at the end of the school year, and at the beginning of the following school year (after the summer break with no treatment).

Two fifth grade classes were studied, one from each of two neighboring schools: one received the program, one did not. The authors found no significant differences between distribution of some key characteristics (age, gender, height), but significant differences between others (ethnicity, weight, and weight for height). Significant differences in the pretest for only one measure.

Significantly higher improvement was found for the treatment group on four of the nine tests. Differences had shrunk slightly by the time of the second post-test.


Evans, Richard; Rozelle, Richard; Mittelmark, Maurice; Hansen, William; Bane, Alice; and Havis, Janet. "Deterring the Onset of Smoking in Children: Knowledge of Immediate Physiological Effects and Coping with Peer Pressure, Media Pressure, and Parent Modeling." Journal of Applied Social Psychology 8 (1978):126-135.

An evaluation of a 10 week program to prevent youths from starting smoking (smokers excluded from study sample). Sample consisted of 759 students entering 7th grade from 10 schools.

There were treatment levels: 1) Full treatment -- 4 educational videotapes, feedback (updates on classes smoking behaviors), testing (attitudes, behaviors); 2) Just feedback and testing; 3) Just testing; 4) Control -- only pre and post-testing (as opposed to 4 post-tests of other treatments).

Two schools: combined populations and randomly assigned each student to one of the four levels (possible spillover/contamination)

Eight schools: 2 schools assigned to each treatment level

No good discussion of differences in estimates using randomly assigned individuals as opposed to assigned schools.


Flay, Brian; Ryan, Katherine B.; Best, J. Allen; Brown, K. Stephen; Kersell, Mary W.; d'Avernas, Josie R.; and Zanna, Mark P. "Are Social-Psychological Smoking Prevention Programs Effective? The Waterloo Study." Journal of Behavioral Medicine 8 (1985):3759.

An evaluation of a smoking prevention program. Children participated primarily over the first 3 months of their sixth grade school year (1979-80) and then received additional sessions in their 7th and 8th grade years.

Twenty matched schools were randomly allocated to treatment or control status. Schools were matched on size, socioeconomic characteristics, and urban/rural designation. (Discusses one matched pair but it is unclear whether all were matched as pairs or some as groups with similar characteristics).

Pretest showed no significant differences in gender or individual, peer, parental, or sibling smoking behaviors between groups.

Separate estimates were made for subgroups defined by smoking behavior at pretest. The greatest effects were found for those classified at the outset as "experimenting." The authors tried to model smoking behavior at the school level using a binomial regression model, but it the fit was bad. Separate estimates for subgroups defined by risk level showed significant favorable impacts for treatment students at high-risk.

Discusses issue of unit of analysis (p.41).


Freda, Margaret Comerford; Damus, Karla; and Merkatz, Irwin R. "The Urban Community as the Client in Preterm Birth Prevention: Evaluation of a Program Component." Social Science Medicine 27 (1988):1439-1446.

An evaluation of a pre-term birth videotape intervention aimed at increasing community awareness about the problem of preterm births.

Sample of 10 Community Boards; randomly allocated to 5 to treatment, 5 to control. Studied from June 1986 to August 1986.


Hurd, Peter D.; Johnson, C. Anderson; Pechacek, Terry; Bast, L. Peter; Jacobs, David R.; and Luepker, Russell V. "Prevention of Cigarette Smoking in Seventh Grade Students." Journal of Behavioral Medicine 3 (1980):15-28.

Evaluation of a smoking prevention program.(mentions three monitoring points of October, December, and May, and notes that October was the baseline, but doesn't specify a year)

The study sample consisted of the seventh grade classes of four schools in a district. Two schools assigned to treatment, two to control status. Assignment was not random, picked a high-income and a low-income and a high smoking rate and a low smoking rate school for each group.

Statistically significant baseline differences in smoking behavior between treatment and comparison groups.


McAlister, Alfred; Perry, Cheryl; Killen, Joel; Slinkard, Lee Ann; Maccoby, Nathan. "Pilot Study of Smoking, Alcohol and Drug Abuse Prevention." American Journal of Public Health 70 (1980):719-721.

An evaluation of a program to prevent drug and alcohol abuse in adolescents. Observations of the participants took place over 21 months, from 1977-1979. The treatment group was in a junior high school which was targeted as a problem school. The comparison school was defined as a demographic match, plus it was close by and the administrators were willing to cooperate with the program. There were similar rates of parental smoking and preprogram student smoking between groups. Favorable statistically significant differences in trends between groups were found for several outcomes.


Perry, Cheryl L.; Killen, Joel; and Slinkard, Lee Ann. "Peer Teaching and Smoking Prevention Among Junior High Students." Adolescence 15 (1980):277-281.

An evaluation of Project CLASP (Counseling Leadership About Smoking Pressures).

The treatment group was an entire 7th grade class who received instruction through the end of their 8th grade year (the 77/78 and 78/79 school years). Used self report of smoking behavior at three points: 9/77, 6/78, 12/78). The comparison group comprised of 7th grade class from two schools in a neighboring community. No discussion of matching strategy or preprogram comparisons between groups.

Significant differences were found between treatment and combined comparison groups. However, one of the schools, when examined separately, was not significantly different in terms of smoking behavior for the week prior to testing.


Perry, Cheryl L.; Telch, Michael J.; Killen, Joel; Burke, Adam; and Maccoby, Nathan. "High School Smoking Prevention: The Relative Efficacy of Varied Treatments and Instructors." Adolescence 18 (1983):561-566.

An evaluation of a program to prevent high school smoking. Five classes in each of four schools randomly assigned to one of six treatment combination (two kinds of instruction, three kinds of curricula). Treatment during first two weeks of 3/80. Assessments in 2/80 and 5/80.

No significant differences between instruction means or curricula means.

Possible interaction between curricula and instructor.

Smoking behaviors appear to have changed between pre and post tests however there is no control (no treatment) to measure this shift against.


Perry, Cheryl L.; Mullis, Rebecca M.; and Maile, Marla C. "Modifying the Eating Behavior of Young Children." Journal of School Health 55 (1985):399-402.

An evaluation of a nutritional education program conducted in the fall of 1982. Food recalls taken in 9/82 and 12/82. The study sampled from four elementary schools: 8 third and fourth grade classrooms from two of the schools assigned to treatment status, 8 matched classrooms in the other two schools acted as comparisons.

Groups matched on school size, socioeconomic status, etc. and no significant differences were found between these preprogram characteristics for the two groups.

Results showed significant differences between groups.

(Mentions adjusting for age and gender (p.401), but unclear how exactly this was done.)


Vincent, Murray L.; Clearie, Andrew F.; and Schluchter, Mark D. "Reducing Adolescent Pregnancy Through School and Community-Based Education." Journal of the American Medical Association 257 (1987):3382-3386.

An evaluation of a teen pregnancy prevention program in a rural site which took place between 9/82 and 9/87. Treatment site is one half of a county. Comparison sites are the other half of the same county as well as three other communities in the state matched on sociodemographic similarity.

Spillover effects in the contiguous comparison community led to unintended dosage effects.

The program appears to have been very successful. The study measures the change in average estimated pregnancy rates from the pre-program period (1981-1982) to two post-implementation periods (1983-1985, 1984-1985) and compares this against the same change in non-intervention sites. When using the change between pre-program and the average 1984-1985 rate, there is a 35.5% drop (statistically significant) in estimated pregnancies. The other half of the county had a non-significant drop, one of the other three counties had a non-significant gain, and the other two comparison counties had statistically significant gains.


Zabin, Laurie S.; Hirsch, Marilyn; Smith, Edward A.; Streett, Rosalie; and Hardy, Janet B. "Evaluation of a Pregnancy Prevention Program for Urban Teenagers." Family Planning Perspectives 18 (1986):119-123.

An evaluation of a pregnancy prevention program for urban teens. Program began 11/81 and the clinic opened in 1/82. Services were available through 6/84. The study sample consisted of two junior high schools and two high schools in the Baltimore school district. Treatment schools served a more highly disadvantaged, all black population. The study only examines the black students in the more racially diverse comparison schools. Substantial (doesn't say if statistically significant) differences between groups at baseline were found. The authors estimates impacts at the school-wide level and finds some favorable statistically significant results.


Matched site comparison groups -- with modeling

Casswell, Sally, and Gilmore, Lynnette. "An Evaluated Community Action Project on Alcohol." Journal of Studies on Alcohol 50 (1989):339-346.

An evaluation of an alcohol problems prevention program which was conducted between 1982 and 1985 in New Zealand. Six communities were divided into two groups of three based on socio-demographic characteristics. Each community in a group was allocated (not randomly for fear of spillover effects from the media campaign) to a treatment status: control, media campaign, media campaign and community organizer.

The authors used principal components analysis and a three-way ANOVA to test for the effects of age and gender on the main outcome measures. Used contrasts to test for significant differences 1) in city characteristics at baseline, 2) in characteristics over the course of the evaluation, and 3) in the change over time between treatment pairs (p. 342). Small, but statistically significant, favorable impacts found for the intensive treatment level.


Guyer, Bernard; Gallagher, Susan S.; Chang, Bei-Hung; Azzara, Carey V.; Cupples, L. Adrienne; and Colton, Theodore. "Prevention of Childhood Injuries: Evaluation of the Statewide Childhood Injury Prevention Program (SCIPP)." American Journal of Public Health 79 (1989):1521-1527.

An evaluation of an injury prevention program implemented between 9/80 and 6/82. A hospital-based surveillance system monitoring the incidence of specific injuries was in place between 9/79 and 8/92. Also telephone surveys were conducted in 8/80 and 8/82. Five treatment and five comparison groups were matched on sociodemographic characteristics. The study used 1970 Census data to match but by 1980 a number of changes (rates of minorities and low-income) had taken place in the communities making them less comparable. Also baseline surveys indicate that high levels of injury prevention behaviors already existed making effects harder to show. Design made it difficult to untangle effects of different components.

The authors use of a 2 factor (community effect, time effect) ANCOVA model with socio-economic status (SES) as the covariate.


Farkas, George; Olsen, Randall; Stromsdorfer, Ernst W.; Sharpe, Linda C.; Skidmore, Felicity; Smith, D. Alton; and Merrill, Sally (ABT). "Post-Program Impacts of the Youth Incentive Entitlement Pilot Projects." New York, NY: Manpower Demonstration Research Corporation, June 1984 (see also Gueron, Judith. "Lessons from a Job Guarantee: The Youth Incentive Entitlement Pilot Projects." New York, NY: Manpower Demonstration Research Corporation, June 1984. And, Farkas, George; Smith, D. Alton; and Stromsdorfer Ernst W. "The Youth Entitlement Demonstration: Subsidized Employment with a Schooling Requirement." The Journal of Human Resources 18 (1983):557-573.)

An evaluation of the Youth Incentive Entitlement Pilot Project (YIEPP), which operated between 1978 and 1980. It was an employment entitlement program with required high school attendance targeting low-income youths (ages 16-19) in order to improve their long-term employment opportunities and earnings potential through education and guaranteed employment.

Each of the four large-scale pilot sites which were chosen to be evaluated was matched with a comparable comparison site. Pilot sites determination attempted to create a representative sample -- e.g. ethnic and geographic diversity -- while maintaining a adequate balance between costs and sample size. Matching was based on weighted variables which were thought to have potential influence on the outcomes such as characteristics of the labor market, population, high school drop-out rate, socio-economic conditions, and geographic proximity. Variables were weighted based on the strength of their predictive power. Regression analysis was used to control for the remaining differences between groups using three sets of variables: demographic, individual specific characteristic (e.g. prior earnings history), and a constant and a treatment dummy variable.

The authors attempted to control for selection bias through the collection of longitudinal data on the eligible population in both the demonstration and comparison sites. There were four waves of surveys, including surveys for nonrespondents and remote movers, and school records. This sampling scheme also allowed a closer examination of the participation decision.

Unforeseen changes in labor market conditions, institutional structures (busing, teachers' strikes), and other political problems seriously hampered the strength of impact estimates. One of the pilot sites, Denver, was reduced to a limited-slot program due to implementation problems and was consequently dropped from the evaluation.

The lack of long-term post-program data and the interaction between site and ethnic effects are noted as other major analytical problems.

Data from the nonrespondent surveys (collected during waves three and four) was used to test for attrition bias; no effect was found on the substantive results of the study.


Dynarski, Mark, and Corson, Walter. "Technical Approach for the Evaluation of Youth Fair Chance." Proposal -- has been accepted by DOL. Princeton: Mathematica Policy Research, June 1994.

A proposed design for the evaluation of Youth Fair Chance (YFC), a collection of saturation programs broadly aimed at increasing the employment opportunities of youths in high-poverty communities.

The proposal is to examine pre-program trends in demonstration communities using 1980, 1990 Census data for matching communities (p.39) -- cluster analysis (p.60). One to one match on poverty level, geographic proximity, characteristics linked to outcome measures, baseline values of outcome measures, representativeness (demographic/service environment). (p.55). The researchers will also examine face validity of the match through discussions with site experts. (p.39).

Baseline data will be collected for all participants when they enter.

Power analysis will be done to determine adequate sample size for reasonable statistical power (also for subgroups). Multiple regression with OLS for continuous dependent variables and probit or logit with maximum likelihood estimation for dichotomous dependent variables to control for measurable differences is proposed for outcomes analysis.

No discussion about how to deal with selection bias.

(note: The following is not a community-wide study.)

Mallar, Charles; Kerachsky, Stuart; Thornton, Craig; Long, David. "Evaluation of the Economic Impact of the Job Corps Program: Third Follow-Up Report." Mimeographed. Princeton: Mathematica Policy Research, September 1982 ( see also, Long, David A.; Malla, Charles D.; and Thornton, Craig V. D. "Evaluating the Benefits and Costs of the Job Corps." Journal of Policy Analysis and Management 1 (1981):55-76.)

An evaluation of Job Corps, a voluntary residential job training program aimed at positively influencing outcomes of disadvantaged youth. Evaluation period was 1977-1981. Original study sample (participants during the spring of 1977) was followed for approximately 4 years. Comparison groups consisted of matched youths in areas with limited knowledge of program.

Multiple regression was used to attempt to control for both observed and unobserved. Comparison group chosen based on sequential matching procedure --first matched sites, next matched individuals within sites: Found areas with minimal Job Corps participation and then assigned "selection probabilities" based on similarities to heavily saturated Job Corps areas in terms of socio-economic characteristics such as race and income level. Each of these variables was chosen based on its power in a multiple regression equation to predict a Corpsmember's home region. (Eliminated sites in close geographic proximity). Youths within these designated comparison sites were then assigned selection probabilities in a similar manner. Sampling units for comparison groups were zip code areas (3 digit for rural, 5 digit for urban). Samples were large enough to ensure 90% chance of detecting statistically significant changes.

For choosing comparison sites used Census data, for individuals within comparison sites, data was obtained from drop-out lists from the high schools and records from the local employment agencies.

An error-components model was used in analysis to account for the correlation of individual specific error terms over time. The analysts used only variables that would not be affected by Job Corps participation (if the variable possibly could be affected, the they used a lagged value) -- 2-staged model first estimates individual error components using OLS and then substitutes into a generalized LS model, controls for varying lengths of follow-up and missing data. The analysts modeled participation as well to adjust for selection bias -- most of the variables in this equation the same as the main outcome equations, however it has a slightly different functional form and includes two proxy variables for knowledge of the Job Corps program, separate estimates are calculated for relevant subgroups

Since participants sampled at a point in time (rather than following a baseline sample of enrollees) -- sample over represents participants who stayed in the program longer.

Results showed effects on employment and earnings and on criminal behavior.

A new national evaluation of Job Corps using random assignment is currently underway.


*Ketron. "Final Report of the Second Set of Food Stamp Workfare Demonstration Projects." Mimeographed.Wayne, Penn.: Ketron, September 1987.

An evaluation of Food Stamp Workfare, a mandatory employment training program for food stamp recipients in demonstration areas. Mandatory nature of program should reduce most selection bias. The analysts compared outcomes of participants to food stamp recipients in other sites who, while subject to work registration rules, didn't have the stringent requirements of Workfare.

Sampled first-time referrals (to avoid over representing long-term AFDC dependent individuals) -- referred during the period March to April 1981 (p.15).

Maximum amount of follow-up time was 9 months, minimum of 3 months.

Comparison group members in matched sites consisted of those referred to food stamp work registration during early 1981 (p.18). They were subject to work registration rules which were less stringently implemented -- sanctions less consistently applied --than in Workfare sites.

Sites were matched on variables shown through regression analysis to be strong predictors of the outcome measures (p.14), including characteristics of the local areas, population, and food stamp caseload. Individuals were chosen within these sites who were referred during the same period as Workfare participants and matched again on characteristics expected to influence outcome measures (p.18). Comparison sample then weighted to make it representative of individuals in demonstration sites (.19)

The analysts calculated separate regression-adjusted estimates for each subgroup, and then formed a weighted average of these estimates (p.79). The used as independent variables only those characteristics which were used to stratify sample (those used to match comparison sites and level of participation) (p.81, p.C.2)

A variance (or error) components model with generalized LS estimation methods (p.C.2), was used for analysis and alternative specifications yielded similar results. The analysts performed a variety of sensitivity tests (p.96-104) based on varying assumptions and definitions of measures. Narrow range of estimates for some measures but not others. Generally narrower for men than for women.

The study examines pre-implementation behavior of both group in terms of outcome measures (p.24) (e.g. path of food stamp receipt for a year prior -- p.102). It examines time patterns of food stamp receipt and employment prior to referral (p.101): similar for males,not as similar for females.

The analysts argue that with respect to possible non-respondent bias: 1) difference in response rates by subgroup generally fairly small, so potentially not a problem 2) also, regression model should control for some potential biases (p.98) , 3) however, the evaluators performed simulations based on different hypotheses as to the experience of non-respondents and found estimates to be "quite sensitive" (p.100)

The authors performed cost/benefit analysis from both government and social perspectives and analyzed incentive vs. training effects.

San Diego site seemed to be a problem -- it's inclusion/exclusion creates statistically significant changes in estimates for females -- differences in program implementation there (10 day vs. 30 day job search period everywhere else) but also it was the "only site located in a very large urban center" (p.112) and it had participated in the first set of demonstrations, so the staff had experience.


Polit, Denise; Kahn, Janet; and Stevens, David. "Final Impacts from Project Redirection." Mimeographed. New York, NY: Manpower Demonstration Research Corporation, April 1985

An evaluation of Project Redirection, a program targeted at low income adolescents who were either pregnant or had children. Original demonstration implemented in 4 sites in 1980 and ended in 1983. First sample enrolled between 8/80 and 3/81 and were followed for two years. Sample was later expanded to include those who enrolled between 3/81-1/82 (no baseline data on this sample was available as the decision to include them in the evaluation was made after they had already enrolled). Sample II was also followed 24 month.

Comparison groups were matched cities on socio-economic and geographic characteristics. Matched teens within cities based on eligibility and recruited these comparison group members in a similar fashion to the way that program participants were recruited. Used stratified matching to balance age, ethnicity, baseline similarity, receipt of services from teen parenting programs.

The analysts used an ANCOVA model to adjust for measured differences between treatment and comparison groups. Separate estimates were made for various subgroups (however, site and ethnicity confounded to a large degree).

Problems arose due to an increase in availability of competing services for comparison group members. Examines attrition bias were made.


Steinberg, Dan. "Induced Work Participation and the Returns to Experience for Welfare Women: Evidence from a Social Experiment." Journal of Econometrics 41 (1989):321-340.

An evaluation of Work Equity (which ran between 7/78 and 3/81) a mandatory training/job search program for new AFDC clients and current clients with changes in exemption status between 7/78 and 8/80. Work experience data from 4 years prior to baseline. Participants were followed to two years after baseline.

Work Equity replaced WIN in St. Paul and 7 neighboring communities, The comparison group was Minneapolis and communities in close proximity which still were operating under WIN

Generalized analysis of covariance was used, accounting for attrition and endogenously missing data, Five simultaneous equations were estimated modeling 1) attrition between periods, 2) 1st period selection, 3) 2nd period selection, 4) 1st period log wage, 5) 2nd period log wage

Small sample size seriously hinders the power of significance tests (especially in terms of employment probabilities). The author discusses attrition bias.


Devaney, Barbara; McCormick, Marie; and Howell, Embry. Design Reports for Healthy Start Evaluation: Evaluation Design, Comparison Site Selection Criteria, Site Visit Protocol, Interview Guides. Mimeographed. Princeton: Mathematica Policy Research, 1994.

This is a design for an upcoming evaluation of Healthy Start.

The proposal is to use comparisons sites (two per treatment site), matched on infant mortality rates and trends, location, socio-demographic characteristics, and access to prenatal services (p.22). The study will also contrast participants and non-participants within sites.

Program participation is voluntary.


Brown, Randall; Burghardt, John; Cavin, Edward; Long, David; Mallar, Charles; Maynard, Rebecca; Metcalf, Charles; Thornton, Craig; and Whitebread, Christine. "The Employment Opportunity Pilot Projects: Analysis of Program Impacts." Mimeographed. Princeton: Mathematica Policy Research, February 1983.

An evaluation of the Employment Opportunities Pilot Project (EOPP), which operated between mid-1979 to mid-1981. This was a voluntary program for the most part (in some locations AFDC recipients who had been required to participate in WIN were now required to participate in EOPP, p.184).

The study sampled all adults in low-income households in 1979 to avoid selection bias. It also sampled those who enrolled in EOPP between 2/1/80 and 2/28/81. Data were from the 1st quarter of (12/78-2/)1979 through the last quarter of (9-11/)1981. Matched comparison sites were developed for use in one of the three analysis models. The three Models were (p.138): 1) Percent change in outcomes for treatment vs. comparison sites; 2) Percent change in outcomes for enrollees (participants and non-participants) vs. non-enrollees; 3) Relative employment probabilities between unemployed EOPP enrolles (participants and non-participants) vs. general unemployed low income population (non-enrollees).

Pitfalls include selection bias in the last two models and difficulty in untangling who benefits or loses in the first model. There were general difficulties caused by small sample sizes such as measurement of subgroup effects. This was aggravated by low enrollment (approx. 10%) and participation (approx. 6%) rates among eligibles. The study proposes alternative approaches which were not used because difficult, expensive, and require better data.


Matched Pair Comparison Sites

Long, Sharon K., and Wissoker, Douglas A. "Final Impact Analysis Report: The Washington State Family Independence Program." Draft. Washington, D.C.: Urban Institute, April 1993

An evaluation of the Washington State Family Independence Program (FIP), an initiative implemented in July 1988 which sought to decrease welfare dependence and improve employment potential.

The evaluation used a comparison group strategy: 1) they created east/west and urban/rural stratifications within the state in order to obtain a geographically representative sample, 2) within five of these subgroups pairs of welfare offices, matched on local labor market ad welfare caseload characteristics, were chosen and randomly allocated to either treatment (FIP) or control (AFDC) status (p.3). This strategy could reduce some of the systematic difference between treatment and control groups if sample sizes are large enough.


Davis, Elizabeth. "The Impact of Food Stamp Cashout on Household Expenditures: The Alabama ASSETS Demonstration." In New Directions in Food Stamp Policy Research, edited by Nancy Fasciano, Daryl Hall, and Harold Beebout. Draft Copy. Princeton: Mathematic Policy Research, 1993.

An evaluation of The Alabama Avenues to Self-Sufficiency through Employment and Training Services (ASSETS) Demonstration.

(This design was used for the entire demonstration although the specific report we have is on the evaluation of the food stamp cashout component of the program. Also, this demonstration should not be confused with the Alabama Food Stamp Cash-Out Demonstration which used random assignment and took place during 5/90-12/90.)

Food stamp cashout was implemented in 1990; conducted from 8/91 to 11/91 (p.51). Selection of demonstration and comparison sites was done through identification of three key strata (rural/north, rural/south, and urban), choice of a pair of counties within each strata, and random allocation of treatment or control status to each member of the pair. Counties were matched on caseload characteristics and population size.

Unit of analysis was households.


Institutional Comparison

Dynarski, Mark; Hershey, Alan; Maynard, Rebecca; and Adelman, Nancy. "The Evaluation of the School Dropout Demonstration Assistance Program -- Design Report: Volume I." Mimeographed. Princeton: Mathematica Policy Research, October 12, 1992.

An evaluation design for the School Drop-Out Demonstration Assistance Program (note: "targeted projects" had random assignment, "restructuring projects" had comparison institutions). Evaluation was to be conducted over period 1992-1995.

The design called for matched schools in "clusters" (elementary schools which fed into middle schools which fed into a high school). First, the analysts determined several choices for comparable school clusters based on characteristics correlated with dropout rates: "attendance rates, dropout rates, minority populations, limited English proficiency, free or reduced-price lunches, and standardized test scores" (p.61). Then determined face validity by talking to local staff. Students within schools were sampled randomly (p.73).


NOTES


1. .In an appendix we provide annotated examples of studies using various evaluation strategies. In the text we refer to these examples by author and date.


2. . Public/Private Ventures "Community Ecology and Youth Resilience" April 1994 p.8


3. . P. Brown and H. Richman "Communities and Neighborhoods: How Can Existing Research Inform and Shape Current Urban Change Initiatives?" Background memorandum prepared for the Social Science Research Council Policy Conference on Persistent Poverty Nov. 1993


4. .We recognize that even with random assignment problems remain which we can only address with non-experimental methods, in particular, attrition from the research measurement in follow up periods.


5. .The most thorough discussion of the problems of defining neighborhood or community which we are aware of is Chaskin, Robert J.."Defining Neighborhoods" A background paper of the Neighborhood Mapping Project of the Annie E. Casey Foundation, The Chapin Hall Center for Children, University of Chicago, June 1994. We draw heavily on this piece in this section.


6. .Ibid p.30


7. . Ibid p.47


8. . T. Fraker and R. Maynard"Evaluating Comparison Group Designs with Employment-Related Programs" Journal of Human Resources Winter 1987, R. La Londe "Evaluating the Econometric Evaluations of Training Programs with Experimental Data" American Economic Review September 1986, R. LaLonde and R. Maynard "How Precise are Evaluations of Employment and Training Programs: Evidence from a Field Experiment" Evaluation Review August 1987


9. . D. Friedlander and P. Robins Manpower Demonstration and Research Corporation Working Paper February 1994.


10. . See J.Heckman and J. Hotz "Choosing Among Alternative Nonexperimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training" Journal of the American Statistical Association December 1989


11. . See Fraker and Maynard op. cit. for a discussion.


12.

. The Friedlander and Robins study found little difference between controlling for measured differences in characteristics through a common linear regression model and using pairs matched on the Mahalanobis measure. Faker and Maynard also compare Mahalanobis matches with other matching methods and find no clear indication of superiority.


13. . Campbell, D.T. and Stanley, J.C. Experimental and Quasi-Experimental designs for Research Chicago:Rand McNally 1966


14. . A classic reference for these methods is Box, G.E.P. and Jenkins, G. M. Times Series Analysis: Forecasting and Control. San Francisco: Holden-Day 1976. Several applications of time-series modeling to program evaluation are presented in New Directions for Program Analysis: Applications of Time Series Analysis to Evaluation, edited by Garlie A. Forehand. New Directions for Program Evaluation, number 16, a publication of the Evaluation Research Society, Scarvia B. Anderson, Editor-in-Chief, San Francisco: Jossey-Bass, Inc., December 1982.


15. .There is a rich literature on the closely related development of simulation models used to estimate the likely effects of proposed program reforms in taxes and expenditures. See, for example, Citro, Constance and Hanushek, E.A. Eds., 1991, Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume 1, Washingon, D.C.: National Academy Press


16. . Bloom, Howard, 1984 "Accounting for No-Shows in Experimental Evaluation Designs" Evaluation Review April


17. . Moffitt, Robert and Barbara Wolfe, 1992 "The Effect of the Medicaid Program on Welfare Participation and Labor Supply" Review of Economic and Statistics 74 (4)


18. . In the Supported Work experiment we attempted to test for the effects of length of stay by systematically varying across sites the maximum length of stay on the Supported Work job which individuals were allowed. Ironically, the program operators decided not to utilize the option to let individuals stay longer in the sites in which this was allowed so this treatment variation was never implemented.


19. . In the National Job Training Partnership Act evaluation random assignment to treatment and control was carried out after individuals had been assigned to different training streams, therefore it was possible to make unbiased estimates of difference between for example classroom training and on-the-job training. Even here, however, "dilution" of treatment occurred since some of the treatment group members ended up in different streams than those to which they had initially been assigned. See Bloom, Howard, Larry Orr, George Cave, Stephen Bell and Fred Doolittle 1993 "The National JTPA Study" Report to the U.S. Department of Labor, Bethesda, Md. Abt Associates.


20. . Many of these remarks would apply equally to situations in which constructed comparison groups are used in the sense that the interaction effects themselves do not add further problems of bias beyond those associated with the basic problems of constructed comparison groups.


21. . One need not have one of the treatment combinations alone if there is no interest in its singular effect For example, there may be no interest in (or expectation of) an effect of training alone on births, in which case, one would only have the training plus sex education and the sex education alone groups.


22. . P. Brown and H. Richman, op.cit., p.8


23. . See Garfinkel, Irwin, C. Manski, C. Michalopoulous,1992 "Micro Experiments and Macro Effects" in Manski, Charles and Irwin Garfinkel, edits. Evaluating Welfare and Training Programs Cambridge, Mass.:Harvard University Press


24. . See Hollister, Robinson and Robert Haveman 1991 "Direct Job Creation.." ed.Bjorklund, A, Haveman R., Hollister, R. and Holmlund, B., Labour Market Policy and Unemployment Insurance, Oxford:Clarendon Press for a full discussion of the problems of displacement and attempts to measure it.


25. . The best attempt to measure displacement we known of is J. Crane and D. Ellwood "The Summer Youth Employment Program: Private Job Supplement or Substitute" Harvard Working Paper 1984 but even this has serious problems. It used not comparison sites but data on national enrollments in the Summer Youth Employment Program and data on SMSA labor markets from the Current Population Survey. The national program was large enough to have impacts on local youth labor markets and the time-series data from the CPS made it possible to attempt to create a counterfactual with an elaborate statistical model.


26. . In the medical experimentation literature there is also some discussion about optimal stopping rules which introduce time considerations into decisions about when to terminate clinical trials as information accumulates.


27. . There has been some work on dynamic sample allocation. Here learning effects are introduced sequentially as information flows back about the size of variances of outcome variables, and to some degree initial estimates of response, sequentially enrolled sample can be reallocated among treatments or among sub-groups so as to maximize information obtained from a given level of expenditure on the study. The National Supported Work Demonstration used such a sequential design to a limited degree.


28. . See J.Heckman and J. Hotz "Choosing Among Alternative Nonexperimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training" Journal of the American Statistical Association December 1989


29. . Michael Wiseman of the University of Wisconsin made some partial steps in this direction in work he did for Urban Strategies in Oakland


Go to Russell Sage's home page