Project Diary

  • 500 cities data set shapes of census tracts, census tract by city

    The 500 Cities data is from the 497 largest cities in the United States, plus Burlington, Vermont; Charleston, West Virginia; and Cheyenne, Wyoming so that a city from each state is included in the dataset. The number of cities in each state included in this dataset ranges from 1 to 121. Within these cities, estimates are calculated for 27 chronic disease measures for each census tract in the city. 

    Two discoveries we made during exploration of the census tract data were 1) 500 Cities census tract GIS data did not cover the entire designated census tract (Figure 1), and 2)  some census tracts overlap with more than one city geography (Figure 2). 

    The goal of our study was to connect environmental exposure data on natural disasters to the health outcome data in the 500 Cities dataset. Because some of our environmental data were continuous across an entire census tract, we need to decide how to link the continuous environmental with the small area estimate of health outcomes from the 500 Cities project. Missing areas in the 500 Cities GIS census tract data were due to the presence of non-populated land such as parklands. Our solution was to use the continuous US Census GIS data for the census tract areas. Thus we used the geography of the complete census tract and linked it with the 500 Cities health data using GIS software.

    As cities have grown in population and incorporated and developed land, some cities that were once separated by agricultural buffers are now adjacent. We found that some census tracts included in the 500 Cities dataset cover more than one city. In Figure 2 you can see a census tract that overlaps with the cities of Anaheim and Fullerton, California. Within the 500 Cities dataset, this particular census tract has two rows and two different small-area estimates of health outcomes based on the different demographics of the two cities in which it lies.

    In the 500 Cities dataset, 149 cities share at least one census tract with another city. As a result, there are 232 census tracts that are repeated twice in the dataset, and two tracts are repeated three times. Figure 3 plots the number of shared census tracts for the 149 cities that have tracts that overlap with other cities. For example, Anaheim, California shares 21 census tracts with other cities. This means that 21 of the census tracts in Anaheim are repeated in the dataset. The assumption for this dataset is that researchers will analyze the 500 Cities one city at a time at the census tract level, but when looking at census tracts with a national scope, these duplicate rows are problematic when matching environmental data.

    In addition, census tracts that span more than one city are not necessarily evenly distributed in those cities. In Figure 2 you can see that the majority of the census tract covers land in the city of Anaheim, and a small fraction of the census tract is in Fullerton. How should researchers assign environmental impact estimates based on census tract geographies to the row in the 500 Cities dataset for the Fullerton portion of that census tract? The population in that area is being impacted by only a fraction of the environment of the entire census tract. 

    When addressing the complexity of overlapping census tracts we reviewed three options; 1) assign a shared census tract to one city, 2) delete census tracts that are shared between cities, or 3) assign the same environmental exposure to the listed census tract in both cities. 

    We chose the third solution to preserve the original dimensions of the 500 Cities dataset. Because our environmental exposure of interest was insurance claims associated with natural disasters, we made the assumption that natural disasters that occurred in a shared census tract could reasonably impact the reported mental and physical health of residents in the adjacent city.

    Figure 1 500 Cities census tract GIS data (yellow on left), and entire census tract on right.

     Figure 1 500 Cities census tract GIS data (yellow on left), and entire census tract on right.

    Figure 2 Census tract 6059086701 in California is in both the cities of Anaheim and Fullerton

    Figure 2 Census tract 6059086701 in California is in both the cities of Anaheim and Fullerton, the boundary between these two cities that intersects the orange census tract is highlighted by the green arrow.

    Figure 3 149 cities in the 500 Cities dataset share at least one census tract with another city.

    Figure 3 149 cities in the 500 Cities dataset share at least one census tract with another city. For example, Anaheim, California shares 21 census tracts with other cities. This means that 21 of the census tracts in Anaheim are repeated in the dataset. Forty-three cities in the 500 City dataset share only one census tract with another city in the dataset.

  • Storm event category creation
    In order to answer our research question regarding how natural disasters are associated with mental and physical health, we needed to identify a data source for quantifying exposure to natural disasters. This dataset needed to cover the contiguous United States at the census-tract level so that we could combine it with the 500 cities health dataset.

    One dataset we looked at was the National Oceanic and Atmospheric Administration (NOAA) National Weather Service (NWS) storm database (Murphy 2018). The storms recorded are considered “significant weather phenomena” , “rare, unusual, weather phenomena”, or ‘other significant meteorological events.” Storm events are considered significant if they, “met local/regional/national threshold criteria, or generated impact, or was newsworthy.” Even storms that affected a small area, if deemed significant, should be entered into this database.

    Most storms are geolocated by the nearest center of a village, city, airport, or inland lake, and some have an associated latitude and longitude. This dataset fulfilled our requirement for being of national scope, but the reporting of storms is through local agencies so the NWS does not guarantee the accuracy or validity of the events gathered from outside sources.

    We downloaded NWS storm record files from 1996 through 2015. The NWS storm event database had 51 storm event categories. Our challenge was to reduce the storm event types (e.g. Flood, Tornados, Heat, etc.) to more general groups for our analysis of human health impacts. 

    Our first step was to group marine-related events together, which left 42 storm event types for further categorization. Below is a table of the first attempt at grouping the NWS events into three general categories.  




    Marine Thunderstorm Wind   Drought
     High Surf   Excessive Heat


     Marine High Wind   Dense Smoke
     Seiche   Wildfire
     Rip Current    
     Tropical Storm    
     Marine Strong Wind    
     Marine Hail    
     Tropical Depression    
     Storm Surge/Tide    
     Marine Tropical Storm    
     Marine Dense Fog    
     Marine Hurricane/Typhoon    
     Coastal Flood    

    We decided not to use the NWS data because 1) the data are not uniformly guaranteed to be accurate and valid and 2) the dataset was most reliably at the county level, which made it difficult to match with the 500 cities census tract data (see the Small Business Association data diary entry).

    The NWS Storm Data also includes information about direct and indirect fatalities and injuries, property damage estimates, crop damage, and other related costs. This data set is ideal for a first check on the reported storm events in most US localities. The primary purpose of this database is to inform and support the NWS mission to issue accurate forecasts and warnings for hazardous weather events.


    Murphy, J. D. (2018). NWSI 10-16, Storm Data Preparation. Retrieved from
  • Selecting control variables within census data

    After selecting our health outcomes from the 500 cities data and exploring several environmental datasets to derive data on exposure to natural disasters, we turned our attention to demographic and social variables that could be confounders in our statistical analysis. Confounders are variables that are associated with both the exposure and the outcome that could affect the statistical relationship and potentially cause a spurious relationship in our analysis. We asked ourselves, “What are potential variables that could be associated geographically with where natural disasters occur and are associated with the health outcomes we are assessing?”  A good example of where this might cause a problem in a national analysis in the U.S. is in the Gulf, where hurricanes and tropical storms are common and where levels of poverty are historically higher than other parts of the U.S. (NYTimes 2014).

    Within the field of environmental health, and particularly climate change and health, the degree to which people or communities are affected by climate variability and change is referred to as sensitivity (Gamble et al. 2016). A related concept is adaptive capacity, or the ability of communities or people to adjust to potential hazards, to take advantage of opportunities, or to respond to consequences of environmental exposures (Gamble et al. 2016). Age, sex, and income are three socioeconomic and demographic factors that are strongly and consistently found to be important determinants of sensitivity and adaptive capacity (See Gamble et al. 2016 for a full list of citations).  In particular, children and older adults, communities of color, and low income communities are particularly vulnerable to the health impacts of climate change, including natural disasters.

    In order to include these variables in our analysis, we used the 2010 U.S. Census datatset to create several aggregated variables at the census tract level.  For example, we combined the total population, and the total populations < 18 years of age and > 65 years of age to create a “percent vulnerable population” by census tract. We also created variables to assess the percent of single-parent households and the percent minority population in the census tracts in our study area.

    Gamble, Janet L, John Balbus, Martha Berger, Karen Bouye, Vince Campbell, Karletta Chief, Kathryn Conlon, et al. 2016. “Ch. 9: Populations of Concern.” In The Impacts of Climate Change on Human Health in the United States: A Scientific Assessment, 247–286. Washington, DC: U.S.

    Global Change Research Program. doi:10.7930/J0Q81B0T.NYTimes. 2014. “Mapping Poverty in America.”

  • Small Business Association data: zip codes matching with census tracts

    While searching for data that 1) had a national scope at the census tract level and 2) would allow for an assessment of exposure to natural disasters, we found an article by Sahil Chinoy in the New York Times (2018) titled, The Places in the U.S. Where Disaster Strikes Again and Again. Chinoy used U.S. Small Business Administration (SBA) Disaster Loan Database reimbursement data to quantify the damage of storm events in the United States. 

    The SBA provides low-interest, long-term disaster loans to businesses and homeowners following a declared disaster, many of which are due to extreme storm events such as hurricanes. The SBA records are useful for quantifying the economic and physical loss due to these disasters, has a national scope, and is categorized by zip code. We used the HUD Zip Code Crosswalk Files to map a summation of total verified loss in individual zip codes to census tracts. Several of our team meetings were utilized for working through the challenges of matching zip code and census tract geometries. 

    Zip codes were designed for the efficient delivery of mail and not as geographies for spatial analysis. Because of the varying size and shape of zip codes, the effect of the Modifiable Areal Unit Problem (MAUP) is exacerbated when looking at the patterns of characteristic data drawn from datasets with other geographies such as census tracts. The U.S. Department of Housing and Urban Development (HUD) develops and maintains accurate cross-mapping files for the translation of zip code areas into other geographies. Within the HUD crosswalk file, each zip code is matched with overlapping census tracts. The dataset also includes variables that are ratios of residential, businesses, and other addresses that are located in the zip code and allocated to the total census tracts within the zip code.

    For our study, the multiple SBA data files for years 2000 to 2015 were joined into one file and cleaned of any non-existent zip codes such as NAN or 99999 values. The SBA dataset was merged with a census tract file using the HUD crosswalk files. This file was then grouped by census tract and total annual verified monetary loss values at the zip code level were summed for each census tract. We also calculated the total verified monetary loss totals for the entire study time period for each census tract. These new variables were joined to the 500 cities data for further analysis.

    Figure: US Small Business Administration Total Verified Monetary Loss by Zipcode.

    Figure: US Small Business Administration Total Verified Monetary Loss by Zipcode.

    Wilson, Ron and Din, Alexander, 2018. “Understanding and Enhancing the U.S. Department of Housing and Urban Development’s ZIP Code Crosswalk Files,” Cityscape: A Journal of Policy Development and Research, Volume 20 Number 2, 277 – 294.