Dependence-Preserving Approach to Synthesizing Household Characteristics

192 Transportation Research Record: Journal of the Transportation Research Board, No. 2302, Transportation Research Board of the National Academies, Washington, D.C., 2012, pp. 192–200. DOI: 10.3141/2302-21 S.-C. Kao, C. Liu, and B. L. Bhaduri, Oak Ridge National Laboratory, One Bethel Valley Road, Oak Ridge, TN 37831. H. K. Kim, Dong-A University, 840, Hadan2-Dong, Saha-Gu, Busan 604-714, South Korea. X. Cui, New York Institute of Technology, 1855 Broadway, New York, NY 10023. Corresponding author: H. K. Kim, hoekim@ dau.ac.kr. choice; scheduling linkages between activities, locations, and times; and allowing alternate decision paradigms (1, 2). Accordingly, the activity-based models require realistic agent characteristics and detailed day-to-day activities, at home or elsewhere (3). Because activity-based models allow interaction between individ- uals throughout the entire network instead of simply between traffic analysis zones, the demographic attributes of all individuals need to be collected. Although many transportation planning agencies, such as metropolitan planning organizations, are obligated to provide most detailed and up-to-date traveler characteristics, applicable data are scarce mostly because of high survey costs, low response rate, long data processing time, and privacy concerns (4, 5). Therefore, many researchers focus on synthesizing household and personal demo- graphic data from surveyed samples with behaviorally realistic and theoretically sound approaches. Ideally, such a data set should have a sufficiently large sample size and contain multiple household char- acteristics with correct intervariable correlations between key demo- graphic variables. However, there is a common trade-off between data availability and the extent of spatial coverage. For example, the public use microdata sample (PUMS) provides detailed and use- ful demographic features with almost complete census records (6). However, PUMS is available only at a coarse geographic unit called a public use micro area (PUMA) with a 5% sampling rate; therefore the applications at finer scale are not feasible. By contrast to the PUMS, the Census Summary File 3 (SF3), a collection of summary tables of demographics variables, is available at a much finer block group (BG) scale and has a higher sampling rate (16.7%) (7). How- ever, the SF3 is released only in the form of low dimensional histo- grams (usually less than three) for selected variables. The correlation between variables of interest is not always available. One of the most typical examples of population synthesizer used in the traffic demand model is the iterative proportional fitting (IPF) algorithm implemented in the Transportation Analysis and Simu- lation System (TRANSIMS), developed by Los Alamos National Laboratory during the mid-1990s for providing accurate and com- plete information on traffic effects, congestion, and pollution (8). The IPF approach was proposed by Deming and Stephan and was applied and modified in a number of transportation studies, includ- ing TRANSIMS (9). However, its explicit computational issues have been underlined by several research efforts, such as the existence of empty cells in the contingency table and no simultaneous con- trol of household and variables at the personal levels (10–13). Some research efforts have been made accordingly to reinforce the perfor- mance of the conventional IPF algorithm, and new methodologies have also been introduced and evaluated. When complete household characteristics are synthesized from limited samples, one common approach is to first determine how important each sample is by evaluating the individual sample weights. Dependence-Preserving Approach to Synthesizing Household Characteristics Shih-Chieh Kao, Hoe Kyoung Kim, Cheng Liu, Xiaohui Cui, and Budhendra L. Bhaduri One effective approach to study day-to-day traveler behavior is through the activity-based traffic demand model, in which all travelers are treated as individual agents and interact under a computation-intensive frame- work. Nevertheless, because of high survey costs, low response rate, and privacy concerns, detailed household and personal characteristics are usually unavailable. Various population synthesizers were therefore pro- posed to reconstruct a methodologically rigorous estimate of household characteristics from different surveys. For instance, the iterative propor- tional fitting (IPF) algorithm is used to synthesize the full population from the public use microdata sample (PUMS) and Census Summary File 3 (SF3) in the popular activity-based traffic demand model, TRANSIMS. However, some fundamental limitations of IPF (e.g., zero cells in the contingency table as a result of small sample size) have drawn sufficient attention and resulted in the development of enhanced IPF algorithms and other strategies. This paper proposes a copula-based method to syn- thesize household characteristics that preserves marginal distributions and dependence structure between variables. The proposed method is tested for the state of Iowa, and the results are compared with the IPF approach of TRANSIMS. The synthesized households resulted in the same local SF3 statistics at each block group. But having similar inter- variable correlations as described in the PUMS suggest the applicability of the copula-based approach. Because marginal distributions and depen- dence structure can be faithfully preserved, the proposed method could be a suitable alternative to synthesize realistic agent characteristics for further activity-based traffic demand modeling. Transportation planners and policy makers rely on reasonable esti- mates of dynamic traffic demands to support their decision making. For the most part, the dynamic traffic demand was estimated through conventional trip-based traffic demand models, in which four major components are included: trip generation, trip distribution, modal choice, and trip assignment. Nevertheless, a valid representation of the underlying travel behaviors is lacking. To address this draw- back, researchers started to study activity-based models (i.e., mod- els based on the activities of individual agents). The most critical characteristics of the activity-based approach is that travel demands are derived from simulated activities such as travel initiation; spatial Kao, Kim, Liu, Cui, and Bhaduri 193 These weights are then used as a basis to randomly select samples (can be repeated) back to re-form the population space. For example, if the sampling rate is 5% and all samples have equal weights, the synthesized population will have a high risk of reidentification of the same statistical units in synthetic population data, with each repeated 20 times. Although all surveyed demographic variables (e.g., income, household size, and race) can possibly be preserved, that will result in a great many identical members in the population, which is not pre- ferred for activity-based traffic simulation (i.e., disaggregate micro- scopic simulation) because of nondifferentiable agents. In addition, because the weights are usually computed focusing on a few underly- ing variables and constraints, the confidence should not be extended to all variables. Furthermore, the paired correlation structure between variables is hard to capture. It would be overly ambitious to recon- struct all demographic variables simultaneously, given the curse of dimensionality. Depending on the nature of the problem, the relevant variables should be identified and the synthesizing procedures should be adjusted accordingly. Therefore, a dependence-preserving approach is proposed to syn- thesize household characteristics through copulas. Multiple research efforts have been made in the transportation field using copulas on traffic safety (14, 15) and vehicle type choice and utilization model development (16), but to the best of the authors’ knowledge this paper is the first attempt to generate synthetic household characteris- tics with copulas. The multivariate joint distributions are constructed to generate virtual households, in which surveyed marginal distri- butions and dependence structure are fully preserved. The synthe- sized households will result in the same SF3 statistics at each BG while having similar intervariable correlations as observed in the PUMS. The proposed method is applied for the state of Iowa, and its performance (i.e., marginal distributions and dependence between variables) is evaluated by comparing the mean, median, and correla- tion matrices calculated from the copula-based and IPF approach in TRANSIMS so as to understand the potential strengths and limita- tions. The review of existing population synthesizers is presented next, followed by a description of the methodology and case study. Literature review This section gives an overview of the existing research efforts on the synthetic population representation methods centered on IPF and enhanced IPF algorithms attempting to correct the aforementioned limitations of the IPF algorithm. The newly introduced population synthesizing strategies are enumerated as well. iPF and enhanced iPF algorithms The IPF algorithm is a popular population synthesizer stemming from Deming and Stephan research in 1940 (9). They proposed an iterative least square adjustment method to fill each cell in the contin- gency table with constraining row and column totals obtained from alternative sources under various cases such as when one set or mul- tiple sets of marginal totals are known. The resultant table of data is a joint probability distribution obtained when the probabilities are convergent within an acceptable (predefined) limit (17). As the activity-based traffic demand model has been gaining more attention from transportation planners and researchers, Beckman et al. outlined a methodology for creating a synthetic baseline popula- tion of individuals and households to be used in an activity-based travel demand model (i.e., TRANSIMS) by expanding the original IPF algorithm on the basis of the aggregate (SF3) and disaggregate data (PUMS) (18). That is, although the traditional IPF procedure from Deming and Stephan fits only one BG at a time, Beckman’s IPF can simultaneously consider all BGs making up the PUMA. Also, Duguay et al. introduced an IPF algorithm for synthesizing household transportation survey data in the San Francisco Bay Area in California at the level of traffic analysis zone (19). They found that synthetic households generated with seminal data sources did not appropriately represent the actual demographic characteristics and were strongly biased, so they relied just on the census data, which means that the population synthesizer method can be transferred to other areas. Guo and Bhat discussed the aforementioned two system behav- ioral issues underlying Beckman’s population synthesizer and pro- posed modifications and enhancements (10). Ye et al. underlined the same issues of the conventional IPF algorithm and developed a heu- ristic approach called the iterative proportional updating algorithm for generating a synthetic population while matching household- level and person-level joint distributions of control variables of inter- est (11). Auld and Mohammadian reinforced the conventional IPF algorithm by accounting for multiple levels of analysis units and control variables (household level and person level), concluding that their methodology improved the fit to the person controls at no cost to the fit against the household-level controls (12). Further- more, Evers and Santapaola modified the conventional IPF algo- rithm for combining contingency tables with missing dimensions from a variety of data sources (e.g., traffic data, census data) with an example of traffic count data on German motorways (20). However, some research efforts modified and revised the conven- tional IPF algorithm as one element constituted into their compre- hensive models. Mohammadian et al. proposed the process of developing disaggregate household travel time data for local areas by using the IPF method for population synthesizing with PUMS and SF3 and the Monte Carlo simulation method modeling house- hold travel survey data from the National Household Travel Survey (NHTS) 2001 for which actual travel survey data are not available (4). Wheaton et al. developed a U.S. synthesized human agent database providing a realistic agent population for use in agent-based models (21). That is, they implemented the conventional IPF algorithm to synthesize the households and individuals in the 50 states and the District of Columbia on a county-by-county basis and assigned them into each agent. Each household agent has been randomly located on a GIS map. Arentze et al. applied the conventional IPF as a popula- tion synthesizer for their own rule-based and activity-based travel demand model (Albatross) (13). They modified the IPF to fit the model input requirements, that is, known marginal distributions of individuals are converted to marginal distributions of households on relevant attributes and derived marginal household distributions are used as constraints of a multiway table of household counts. Other Population Synthesizers Besides the conventional and enhanced IPF algorithms, seminal research efforts have been made and are under development depend- ing on data sources and data structure. The typical synthetic popula- tion generation methods reviewed in this paper are the Monte Carlo simulation method, genetic algorithm, small area estimation (SAE) method, and combinatorial optimization method; concise descriptions and their findings are given. 194 Transportation Research Record 2302 Miyamoto et al. developed an agent-based approach for syn- thesizing household data with the Monte Carlo sampling method by limiting the possibility of determining reasonable relationships between the selected attributes (22). Moeckel et al. also took advan- tage of the Monte Carlo simulation method for synthetic population generation by using the multidimensional contingency table pro- duced by the IPF method to control the probability that the specific attributes are selected (23). Synthetic population representation is significantly subject to data contents and format. For instance, Birkin et al. deployed the genetic algorithm as an alternative to the IPF algorithm to synthesize the UK population because the spatial coding in the UK sample of anonymised records is even less discriminating than the PUMS and because the key income variable is not captured in the UK census (24). Long et al. first introduced the SAE method to the synthetic recon- struction of household travel survey data (25). They implemented their method across 107 census tracts in Des Moines, Iowa, with data from the Census Transportation Planning Package and NHTS data sets, finding that the SAE method is a plausible statistical approach to provide reliable and unbiased travel information with several limita- tions in which the SAE framework cannot preserve the distributions of target variables in the disaggregated level. Ryan et al. (26) and Voas and Williamson (27) compared the per- formance of two dominant population synthesis techniques: the IPF and the combinatorial optimization methods, which are far less com- mon in the literature than the IPF method, concluding that the com- binatorial optimization method produces more accurate populations than the IPF method with experimental limitations of a relatively small population and small number of selected attributes. virtuaL HOuSeHOLd SyntHeSizer This section describes an approach to synthesize households via vir- tual household generation instead of resampling. Focusing on several preselected variables, the multivariate joint distribution is constructed from samples by using copulas. The joint distribution is then used to synthesize virtual households for the input data in the activity- based simulation. Although the virtual households do not have the full attributes as samples, the marginal distributions and the entire dependence structure of selected variables can be precisely modeled. The method includes two major steps: household synthesizing and local household fitting. Household Synthesizing On the basis of survey sample data, the first step aims to synthesize vir- tual households that would result in the same statistics with samples. Focusing on d preselected variables of interest {X1, . . . , Xd}, the joint cumulative distribution function (CDF) HX1, . . . ,Xd is constructed from n available samples (x1, . . . , xd)i; i = 1, . . . , n. The applicable variables should be continuous (e.g., household income) or discrete (e.g., num- ber of vehicles in a household) or can be represented in cumulative numerical indices (e.g., highest education in a household). Categori- cal variables that have no numerical orders (e.g., gender) cannot be included directly. In such a case, the conditional CDFs HX1, . . . , Xd | Xd+1 should be constructed and combined for all categorical members. Copulas are used to construct the joint distribution HX1, . . . ,Xd. Copu- las, also known as joint probability distributions with uniform mar- ginals, are a relatively new statistical tool used to model the complete dependence structure. The first use of the term “copula” is attributed to Sklar in a theorem describing how one-dimensional distribution functions can be combined to form multivariate distributions (28). For d-dimensional continuous random variables {X1, . . . , Xd} with joint CDF HX1, . . . ,Xd and marginal CDFs uj = FXj(xj), j = 1, . . . , d, Sklar showed that there exists one unique d-copula CU1, . . . ,Ud such that (28) C u u H x xU U d X X dd d1 11 1 1, ,, (..., ...,. . . , . . .( ) = ( ) )) Because uj can also be interpreted as a transformation of xj from [−∞, ∞] to [0, 1], copula CU1, . . . , Ud is a mapping of HX1, . . . ,Xd from [−∞, ∞]d to [0, 1]d. The consequence of this transformation is that the marginal CDFs are segregated from HX1, . . . ,Xd, and therefore CU1, . . . ,Ud becomes relevant only to the association between variables and gives a complete mathematical characterization of the entire dependence structure. When copulas are used to construct joint dis- tributions, after a most appropriate copula function CU1, . . . ,Ud has been determined, the function is then combined with suitable marginals uj = FXj(xj) to form the joint distribution HX1, . . . ,Xd. Because there is no restriction on the type of applicable marginal distributions, the form of HX1, . . . ,Xd can be fairly flexible. The flexibility is desirable in household synthesizing because most of the demographic variables cannot be modeled by regular parametric probability distributions (e.g., Gaussian or exponential). Figure 1 depicts the virtual household generation, taking house- hold income (X1) and household size (X2) as an example. The FIGURE 1 Virtual household generation: (a) copulas used to generate correlated marginals u1 and u2, (b) compute x1 = F −1 X1 (u1), and (c) compute x2 = F −1 X2 (u2). (a) (b) (c) Kao, Kim, Liu, Cui, and Bhaduri 195 marginal distributions u1 = FX1 and u2 = FX2 and copula C(u1, u2) are estimated from samples (x1, x2)i; i = 1, . . . , n. With C(u1, u2), m correlated marginals (u1, u2)q, q = 1, . . . , m can be generated (shown in Figure 1a). These m correlated marginals (u1, u2)q are then transformed to variables (x1, x2)q by computing the inverse func- tions x1q = F −1 X1(u1q) and x2q = F −1 X2(u2q) (shown in Figure 1, b and c). In other words, the joint distribution HX1,X2 (x1, x2) = C(FX1(x1), and FX2(x2)) can be easily derived and adopted for generation. Further- more, because all joint probability distributions can be expressed in relation to copulas, this approach does not replace but adds to existing methods. More mathematical details about copulas can be found in Nelsen (29). Although most of the current copula applications are bivariate, they can be extended to higher dimensions as well. However, the copulas applicable to higher orders (>2) become limited as a result of the compatibility problem between lower-level marginals (30). For the dimension of interest in this paper (five ∼10 variables), the Gaussian copulas are chosen. Gaussian copulas belong to the family of meta-elliptical copulas adopted by Genest et al. and are an exten- sion of the well-known multivariate normal (MVN) distribution (31). The most important feature of Gaussian (or more generally meta-elliptical) copulas is that they are one of the few applicable parametric copula families when the number of variables becomes large (>4). Gaussian copulas can be expressed as (32) C u u u uU U d d dd1 1 1 1 1 , , , . . . , . . . , . . . ,( ) = ( ) ( )− −Φ φ φ(( ) = ( )( ) ( )( )( )− −Φd X X dF x F xdφ φ1 1 11 2, ( ). . . , where ϕ is the CDF of the standard normal N(0, 1) and Φd is the joint CDF of MVN with mean zero and covariance matrix Σ. Given that ϕ−1(uj) is already normally distributed, the covariance matrix equals the correlation matrix and can be estimated from a rank-based cor- relation coefficient such as Spearman’s r (31, 32). Because there is no analytical solution for Φd, a numerical integration must be performed: Φd u u uφ φ pi− − −∞ − −∞ ( ) ( )( ) = ( )( )∫1 1 1 1 1 1 2, . . . , . . .φ φφ− − − − ( )∫ ∑ ∑−( ) 1 2 1 2 1 12 3 ud d T ddz dzexp ( )z z . . . where z = [z1, . . . , zd]. Although this numerical integration can be achieved relatively easily for smaller dimensions less than or equal to three as in Genest et al., the computation becomes exceedingly time consuming for higher dimensions and a Monte Carlo approach is more efficient (31). Alternatives for computing joint CDFs of MVNs (or more generally for multivariate Student t distributions) can be found in Renard and Lang (32). Existing MVN generators can be found in software such as Matlab or R. Through the speci- fication of correlation matrix Σ, all bivariate mutual dependencies are preserved in Gaussian copulas, and therefore the complicated dependence structure can be modeled parametrically. Although some limitations exist for Gaussian copulas, such as discussed in Genz and Bretz, they are nevertheless of great value in constructing higher-order copulas (33). When estimating marginal distributions FXj(xj) from samples, one can either find a most appropriate parametric probability distribu- tion (e.g., normal, Pearson Type III) or use empirical distribution instead. In this paper the kernel density estimator is adopted to compute nonparametric probability distribution. Spearman’s rank correlation coefficient ri,j between variables Xi and Xj is computed to form correlation matrix Σ. The MVN generator is used to compute correlated marginals (u1, . . . , ud)q. These marginals are then con- verted to variables (x1, . . . , xd)q. If a variable Xk is in discrete format with s possible realizations [a1, . . . , ai, aj, . . . , as], ai < aj for all i < j, Xk is treated as a continuous variable to derive FXk(xk) and CU1, . . . ,Ud. However, because MVN can generate only continuous variables, Xk needs to be transformed back to discrete. The midpoints between s realizations are treated as bounds for variable reassignment (i.e., if (ai−1 + ai)/2 ≤ xk < (ai + ai+1)/2, xk is assigned to be ai). This continuous-discrete transformation is needed if the aim is to use the nice random generation capability of copulas. However, the correlation matrix Σ will be biased as a result of the ties between discrete samples. The generated Σsyn from synthesized data tends to have weaker correlations than the input Σ. Therefore, a numerical iteration procedure is introduced to resolve the issue: 1. The correlation matrix Σ0 is computed from samples (x1, . . . , xd)i. It is the initial input of random number generation and also the objective of interaction. Set i = 1, Σi = Σ0. 2. With Σi as parameters, copula CU1, . . . , Ud is adopted to generate m continuous marginals (u1, . . . , ud)q, q = 1, . . . , m. In this paper, m = 1,000,000 is used. 3. Marginals (u1, . . . , ud)q are converted to (x1, . . . , xd)q. If a vari- able should be discrete, the proper transformation is performed. 4. Σsyn is computed from (x1, . . . , xd)q. Because of the format transformation, Σsyn tends to have weaker correlation coefficients than Σi. 5. Compute ΔΣ = Σ0 − Σsyn. If the maximum absolute element of ΔΣ is less than a given tolerance ε, the iteration stops. Otherwise, set Σi+1 = Σi + ΔΣ, i = i + 1, and repeat Steps 2 through 5. Following this approach, the synthesized household character- istics (x1, . . . , xd)q can be generated. Although each household is virtual and does not actually exist, together the households will reproduce the same marginal distributions and dependence structure with survey data. These synthesized households can be further allo- cated into specific subregions when localized ancillary information is available. Local Household Fitting Typically, the survey samples are available only in coarse extent with lower sampling rate because of privacy considerations. The detailed samples cannot be released for a specific region because that release may lead to the true identity of one of those surveyed. However, geographically specific information is highly desirable for various studies, especially for activity-based transportation microsimulation. Therefore, more localized but less detailed ancil- lary information is consulted for placing synthesized households into appropriate locations. For finer demographic units (e.g., census tracts or BGs), the categorical summaries (histograms) instead of individual samples are usually provided. Although the marginal distributions can be estimated from summary tables, the complete dependence struc- ture between variables is not available. Statistically speaking, it is not possible to know how two variables correlate to each other only from their marginals. A measure of correlation or a depen- dence structure is needed to link the marginals. In some cases, the local multiway summary tables are provided (e.g., gender by age), 196 Transportation Research Record 2302 so they can be used as bases to derive local joint distributions and synthesize the population (multiway tables are the main drivers of IPF-based synthesizers). However, the local multiway tables are provided mostly in lower dimensions (≤4). Smaller sample size at local scale will result in a number of empty cells in the multiway table, which is the main limitation of the IPF approach. Further- more, the empty cells will increase exponentially with dimensions. As a result, the complete correlation structure between variables is usually sacrificed. Most efforts focused on synthesizing the indi- vidual variables (marginals) correctly, but the correlation structure is seldom examined. Table 1 illustrates a two-dimensional example. Summaries of demographic variables A and B are drawn from local tables, in which ai represents the number of households belonging to Ai and bj represents number of households belonging to Bj. The total num- ber of households t can be computed by t = Σni ai = Σnj bj. To create realistic agents for activity-based simulation, the household counts cij that belong to both Ai and Bj need to be estimated. Although it is clear that ai = Σmj cij and bj = Σni cij, cij cannot be solved without addi- tional information on the correlation between A and B. When IPF is used, a similar two-way table is derived from the samples and then applied to estimate cij. In this paper, the synthesized virtual households (x1, . . . , xd)q are used to fill the table. Following the setup in Table 1, assuming A and B are the local variables corresponding to X1 and X2, the fitting procedures are as follows: 1. Set all cij to zeros. 2. A virtual household (x1, . . . , xd)q is synthesized. 3. Identify i and j so that x1 belongs to Ai and x2 belongs to Bj. 4. If Σmj cij < ai and Σni cij < bj, then cij = cij + 1 and (x1, . . . , xd)q is kept. Otherwise, (x1, . . . , xd)q is discarded. 5. Repeat Steps 2 and 3 until all ai = Σmj cij and bj = Σni cij. From the above conceptual example, the synthesized households can be assigned to more specific locations, in which X1 and X2 will match the local summary statistics of A and B and the correlations between variables can be preserved. However, given the different sampling rates, the marginal distribution of all samples could be dif- ferent from the marginal distribution from all corresponding sum- mary tables. This potential problem was traditionally resolved by introducing a weighting factor to each sample, in which the weighted sum of all samples will equal the total of summary tables. In this paper, because the main purpose of weights is covered through local household fitting, no weight adjustment was considered when the virtual household generator was derived. Although only two local constraints are considered in the example, more constraints of inter- est can be imposed as well; however, the computation time may increase with high dimensions. CaSe Study The state of Iowa was chosen as the test bed for virtual household synthesizing. The detailed household demographic samples are drawn from PUMS (6); the samples are based on a 5% sampling rate and are grouped in geographical units called PUMAs. The PUMA is determined in such a way that it must contain approximately 10,000 household samples from 200,000 populations, so the privacy of each respondent is well protected. However, the PUMA also results in coarse spatial extent and therefore is a disadvantage for regional- specific studies. Meanwhile, the local summary tables are obtained from SF3, which are in the geographical units of BGs (7). The SF3 information is based on the census long forms (16.7% sampling rate) and further adjusted with short forms data (100% sampling rate). Therefore, the summary information is deemed to be the most accu- rate public demographic statistics. In this paper, the copula-based virtual household synthesizer is derived from PUMS and then locally fitted to SF3 summaries. The study area is shown in Figure 2, in which 19 PUMAs and 2,634 BGs are analyzed. Overall, 57,464 households (141,179 members) are used to synthesize 1,150,197 virtual house- holds (2,825,716 members). Several potentially relevant household demographic variables are drawn, including the following: X1 = household total income in 1999 (HH_income, units in $), X2 = number of household members (HH_size), X3 = number of workers (workers), X4 = number of vehicles (vehicles), X5 = household highest educational attainment (MAX_education, unit in census education attainment index) derived from individual records, and X6 = household total travel time to work (SUM_Travel_Time, unit in minutes), derived from individual records. For each PUMA, a unique copula-based synthesizer is con- structed. The marginal distributions uj = FXj(xj), j = 1, . . . , 6 are derived by nonparametric kernel density functions, in which the discrete-continuous transformation is considered for HH_Size, Workers, Vehicles, and MAX_Education. The correlation matrix Σ is computed by Spearman’s r and then corrected for formatting issues (tolerance ε set to be 0.002). The Gaussian copulas CU1, . . . ,Ud are then used to synthesize virtual households. At the local level, SF3 summaries from each BG are collected and treated as constraints. However, not every variable has a cor- responding local summary and some variables have different uni- verses (HH_Income and HH_Size: total households, Workers: total families, and Vehicles: total occupied housing units). Without making extra assumptions, only HH_Income and HH_Size summaries are taken as the two local constraints in this case study. Following the local fitting procedures, the virtual households are assigned for each BG. Under a regular desktop computer environment, it takes about 1 to 2 h to complete the full assessment for a PUMA. The computing time will increase if more local constraints are considered. The performance of copula-based virtual household synthesizing is reported in Tables 2, 3, and 4. Table 2 shows the mean and median of PUMS and synthesized results for the entire state of Iowa, PUMA 01500 (suburban Des Moines, Iowa) and PUMA 01000 (rural area between Des Moines and Sioux City, Iowa). For comparison, the IPF algorithm of TRANSIMS is also adopted to synthesize households for PUMAs 01500 and 01000. It can be observed that the mean and median of the copula-based approach are captured closely for the two selected PUMAs, better than the results from the IPF algorithm of TRANSIMS. Although not reported, all the copula-based quantiles TABLE 1 Illustration of Local Household Fitting Problem Variable b1 . . . bj . . . bm a1 c11 . . . c1j . . . c1m . . . . . . . . . . . . ai ci1 . . . cij . . . cim . . . . . . . . . . . . an cn1 . . . cnj . . . cnm Kao, Kim, Liu, Cui, and Bhaduri 197 FIGURE 2 Study area (state of Iowa) (19 PUMAs and 2,634 BGs). TABLE 2 Marginal Variables for Performance of Household Synthesizing Data Set Synthesizer HH_ Income HH_ Size Workers Vehicles MAX_ Education SUM_ Travel_Time Mean Iowa State PUMS 47,131 2.5 1.1 1.9 10.5 23.3 PUMA PUMS 71,054 2.6 1.3 2.0 11.6 27.8 01500 Copula 71,581 2.5 1.3 2.0 11.6 27.7 (Suburban) IPF 82,348 3.4 2.5 2.2 11.7 30.5 PUMA PUMS 42,671 2.4 1.1 2.0 10.1 21.4 01000 Copula 43,780 2.4 1.2 2.0 10.2 22.1 (Rural) IPF 47,868 3.3 2.3 2.3 10.5 26.1 PUMA PUMS 79,079 2.8 1.4 2 (fixed) 11.9 29.3 01500 Copula 70,966 2.6 1.4 2 (fixed) 11.7 27.6 2 vehicles IPF 87,909 3.6 2.6 2 (fixed) 12.0 30.7 PUMA PUMS 46,272 2.5 1.2 2 (fixed) 10.4 22.8 01000 Copula 42,615 2.4 1.2 2 (fixed) 10.2 20.4 2 vehicles IPF 45,013 3.3 2.4 2 (fixed) 10.4 24.0 Median Iowa State PUMS 38,200 2 1 2 11 15.0 PUMA PUMS 57,200 2 2 2 12 23 01500 Copula 57,627 2 1 2 12 23 (Suburban) IPF 71,000 3 3 2 12 25 PUMA PUMS 34,100 2 1 2 10 10 01000 Copula 35,836 2 1 2 10 11 (Rural) IPF 35,000 3 3 2 11 15 PUMA PUMS 64,000 2 2 2 (fixed) 13 25 01500 Copula 59,866 2 2 2 (fixed) 12 24 2 vehicles IPF 77,200 3 3 2 (fixed) 13 25 PUMA PUMS 38,200 2 2 2 (fixed) 10 12 01000 Copula 36,730 2 1 2 (fixed) 10 11 2 vehicles IPF 33,900 3 3 2 (fixed) 10 15 Note: Boldface indicates values closer to PUMS values. 198 Transportation Research Record 2302 are closely modeled (because the copula-based joint distribution can contain different types of marginal distributions). The suburban and rural areas have different demographics in regard to income and travel time. To get a closer look into the synthesized data set, the statistics of those households having exactly two vehicles are also summarized in Table 2. The results also suggest that the pro- posed copula-based approach can generally produce equal or better performance in the fine resolution intervals. Spearman’s rank correlations between six selected variables of PUMAs 01500 and 01000 are reported in Table 3. The results from copulas (C) and the IPF algorithm of TRANSIMS (I) are again com- pared with PUMS (P). Because the correlation matrix is symmetric, only a half side is shown. The rank correlations of PUMA 01500 are shown in the upper-right table and PUMA 01000 in the lower left. All correlations are positive and not close to zero, suggesting that these variables are intercorrelated and should not be analyzed inde- TABLE 3 Rank Correlation Between Variables for Performance of Household Synthesizing PUMA 01500 (Suburban) Variable HH_ Income HH_Size Workers Vehicles MAX_ Education SUM_ Travel_Time PUMA 01000 (Rural) HH_Income — P: 0.437 P: 0.524 P: 0.456 P: 0.437 P: 0.386 — C: 0.358 C: 0.483 C: 0.417 C: 0.437 C: 0.361 — I: 0.389 I: 0.530 I: 0.446 I: 0.516 I: 0.397 HH_Size P: 0.503 — P: 0.681 P: 0.519 P: 0.172 P: 0.431 C: 0.445 — C: 0.660 C: 0.495 C: 0.137 C: 0.408 I: 0.405 — I: 0.500 I: 0.408 I: 0.206 I: 0.368 Workers P: 0.568 P: 0.759 — P: 0.565 P: 0.193 P: 0.566 C: 0.535 C: 0.749 — C: 0.552 C: 0.172 C: 0.555 I: 0.583 I: 0.645 — I: 0.601 I: 0.289 I: 0.607 Vehicles P: 0.479 P: 0.555 P: 0.575 — P: 0.140 P: 0.389 C: 0.453 C: 0.541 C: 0.568 — C: 0.123 C: 0.376 I: 0.510 I: 0.495 I: 0.590 — I: 0.172 I: 0.429 MAX_Education P: 0.401 P: 0.339 P: 0.360 P: 0.281 — P: 0.184 C: 0.381 C: 0.318 C: 0.343 C: 0.269 — C: 0.171 I: 0.397 I: 0.306 I: 0.370 I: 0.338 — I: 0.208 SUM_Travel_Time P: 0.431 P: 0.510 P: 0.616 P: 0.332 P: 0.332 — C: 0.408 C: 0.496 C: 0.612 C: 0.321 C: 0.321 — I: 0.432 I: 0.471 I: 0.643 I: 0.323 I: 0.323 — Note: Rank correlation coefficients of PUMA 01500 are shown at upper right of table; PUMA 00100 are at lower left; boldface indicates values closer to PUMS values. — = not shown because variables are identical; P = PUMS; C = copula; I = IPF. TABLE 4 Rank Correlation for Two Vehicles for Performance of Household Synthesizing PUMA 01500 (Suburban) Variable HH_ Income HH_Size Workers MAX_ Education SUM_ Travel_Time PUMA 01000 (Rural) HH_Income — P: 0.214 P: 0.317 P: 0.446 P: 0.213 — C: 0.199 C: 0.347 C: 0.436 C: 0.249 — I: 0.209 I: 0.064 I: 0.402 I: 0.209 HH_Size P: 0.270 — P: 0.426 P: 0.121 P: 0.240 C: 0.275 — C: 0.550 C: 0.095 C: 0.287 I: 0.175 — I: 0.271 I: 0.120 I: 0.262 Workers P: 0.430 P: 0.588 — P: 0.111 P: 0.463 C: 0.398 C: 0.652 — C: 0.134 C: 0.466 I: 0.211 I: 0.562 — I: 0.040 I: 0.358 MAX_Education P: 0.310 P: 0.270 P: 0.310 — P: 0.099 C: 0.304 C: 0.221 C: 0.254 — C: 0.140 I: 0.297 I: 0.337 I: 0.336 — I: 0.081 SUM_Travel_Time P: 0.297 P: 0.364 P: 0.539 P: 0.284 — C: 0.263 C: 0.362 C: 0.517 C: 0.231 — I: 0.344 I: 0.490 I: 0.499 I: 0.350 — Note: Rank correlation coefficients of PUMA 01500 are shown at upper right of table; PUMA 00100 are at lower left; boldface indicates values closer to PUMS values. — = not shown because variables are identical. Kao, Kim, Liu, Cui, and Bhaduri 199 pendently. The dependence structure between variables needs to be accounted for properly or otherwise the analysis will be biased. To get a closer look into the synthesized data set, the correlation coefficients of those households having exactly two vehicles are also shown in Table 4. Overall, the copula-based approach better syn- thesizes the correlation matrix. By combining the results from Tables 2, 3, and 4, it is clear that the copula-based approach provides better results than the IPF algorithm of TRANSIMS and therefore could be a potential alternative for household synthesizing. The unique capability of copulas helps in preserving marginal distributions and a higher dimension correlation structure and can be used to generate corre- lated virtual households in a convenient fashion. With the presence of more realistic marginal distributions and dependence structure, the interaction between simulation agents will be more credible. If the local fitting step is skipped, the statistics reported in Tables 2, 3, and 4 will be even closer to the sample statistics. However, because the local fitting provides another way to account for the different weights between samples, it is still needed so that the synthesized households will be more meaningful locally. More local constraints can be included when appropriate. COnCLuSiOnS and diSCuSSiOn This paper introduces a copula-based household synthesizer designed to preserve marginal distributions and the intervariable dependence structure among survey samples. As evidenced by the good per- formance in the case study, the virtual Iowa households have the same attributes with known local summaries (SF3 block group sta- tistics) while having similar intervariable correlations as observed in PUMS. By modeling the sociodemographic household attributes with reliable correlations, the proposed copula-based method could be a promising alternative to some existing methods. Compared with the common IPF approach, the new method does not suffer from the empty cell problem in the contingency table and will not result in overly repeated synthesized households for activity-based traffic demand simulation. The new method may be computation- ally more efficient than the cellular automata method, which deals with a dense and uniform dissection of the same traffic network used in TRANSIMS. Therefore, the proposed method could be a suitable alternative to synthesize realistic agent characteristics for further activity-based traffic demand modeling. Although the effectiveness of the proposed population synthesizing method has been validated with the state of Iowa case in this paper, further proof of this method can be pursued by comparing with other household synthesizing approaches and in other study areas more extensively. In addition, if more travel behavior–related sample data such as the National Household Travel Survey and Census Trans- portation Planning Products could be incorporated into the proposed population synthesizing method, it would be possible to provide a more dynamic and precise forecast of traffic demands to traffic plan- ners and policy decision makers. Furthermore, before the generation of household activities, the development of a methodology to spa- tially distribute the virtual households into the study area is required. TRANSIMS executes a household locating program to randomly place synthetic households at the block group level on the traffic net- work depending on various land use characteristics. Future research efforts will develop a novel algorithm to place synthetic households and persons on a much finer spatial data, namely LandScan USA, which provides population distribution at 90 m * 90 m resolution for the entire continental United States (34). aCknOwLedgmentS This research was sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U.S. Department of Energy. This paper has been written by employees of UT-Battelle, LLC, under contract with the U.S. Department of Energy. reFerenCeS 1. McNally, M. G. An Activity-Based Microsimulation Model for Travel Demand Forecasting. Presented at the Conference on Activity-Based Approaches, Eindhoven, Netherlands, 1995. 2. Pas, E. I. State of the Art and Research Opportunities in Travel Demand: Another Perspective. Transportation Research Part A, Vol. 19, No. 5, 1985, pp. 460–464. 3. Cambridge Systematics, Inc. Scan of Recent Travel Surveys. Publica- tion DOT-T-97-08. FHWA, U.S. Department of Transportation, 1996. 4. Mohammadian, A., M. Javanmardi, and Y. Zhang. Synthetic Household Travel Survey Data Simulation. Transportation Research Part C, Vol. 18, No. 6, 2010, pp. 869–878. 5. Greaves, S. P., and P. R. Stopher. Creating a Synthetic Household Travel and Activity Survey: Rationale and Feasibility Analysis. In Transporta- tion Research Record: Journal of the Transportation Research Board, No. 1706, TRB, National Research Council, Washington, D.C., 2000, pp. 82–91. 6. Public Use Microdata Sample (PUMS) Files. American Community Survey (ACS). March 2010. http://www.census.gov/acs/www/Products/ PUMS/index.htm. Accessed Nov. 1, 2011. 7. U.S. Census Bureau. Summary File 3 (SF 3). June 2003. http://www. census.gov/census2000/sumfile3.html. Accessed Nov. 1, 2011. 8. Hobeika, A. TRANSIMS Fundamentals. July 2005. http://gis.uml.edu/ abrown2/epi/popsim/transim/TMIP_TRANSIMS_TRANSIMS%20 Fundamentals.htm. Accessed Nov. 1, 2011. 9. Deming, W. E., and F. F. Stephan. On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals Are Known. The Annals of Mathematical Statistics, Vol. 11, No. 4, 1940, pp. 427–444. 10. Guo, J. Y., and C. R. Bhat. Population Synthesis for Microsimulating Travel Behavior. In Transportation Research Record: Journal of the Transportation Research Board, No. 2014, Transportation Research Board of the National Academies, Washington, D.C., 2007, pp. 92–101. 11. Ye, X., K. C. Konduri, R. M. Pendyala, B. Sana, and P. Waddell. A Methodology to Match Distributions of Both Household and Person Attributes in the Generation of Synthetic Populations. Presented at 88th Annual Meeting of the Transportation Research Board, Washington, D.C., 2009. 12. Auld, J., and A. Mohammadian. Efficient Methodology for Generating Synthetic Populations with Multiple Control Levels. In Transporta- tion Research Record: Journal of the Transportation Research Board, No. 2175, Transportation Research Board of the National Academies, Washington, D.C., 2010, pp. 138–147. 13. Arentz, T., H. J. Timmermans, and F. Hofman. Creating Synthetic Household Populations: Problems and Approach. In Transportation Research Record: Journal of the Transportation Research Board, No. 2014, Transportation Research Board of the National Academies, Washington, D.C., 2007, pp. 85–91. 14. Eluru, N., R. Paleti, R. M. Pendyala, and C. R. Bhat. Modeling Injury Severity of Multiple Occupants of Vehicles: A Copula-Based Multivariate Approach. In Transportation Research Record: Jour- nal of the Transportation Research Board, No. 2165, Transportation Research Board of the National Academies, Washington, D.C., 2010, pp. 1–11. 15. Rana, T. A., S. Sikder, and A. R. Pinjari. Copula-Based Method for Addressing Endogeneity in Models of Severity of Traffic Crash Inju- ries—Application to Two-Vehicle Crashes. In Transportation Research Record: Journal of the Transportation Research Board, No. 2147, Transportation Research Board of the National Academies, Washington, D.C., 2010, pp. 75–87. 16. Spissu, E., A. R. Pinjari, R. M. Pendyala, and C. R. Bhat. A Copula-Based Joint Multinomial Discrete-Continuous Model of Vehicle Type Choice and Miles of Travel. Transportation, Vol. 36, No. 4, 2009, pp. 403–422. 200 Transportation Research Record 2302 17. Iterative Proportional Fitting (IPF): Theory, Method and Examples (Computer Manual 26). School of Geography, University of Leeds, United Kingdom, 1987. 18. Beckman, R. J., K. A. Baggerly, and M. D. McKay. Creating Synthetic Baseline Populations. Transportation Research Part A, Vol. 30, No. 6, 1996, pp. 415–429. 19. Duguay, G., W. Jung, and D. McFadden. SYNSAM: A Methodology for Synthesizing Household Transportation Survey Data. Working Paper 7618. Travel Demand Forecasting Project. Institute of Transportation Studies, University of California, Berkeley, 1976. 20. Evers, L., and D. Santapaola. Use of the Iterative Proportional Fitting Algorithm for Combining Traffic Count Data with Missing Dimen- sions. In Transportation Research Record: Journal of the Transporta- tion Research Board, No. 1993, Transportation Research Board of the National Academies, Washington, D.C., 2007, pp. 95–100. 21. Wheaton, W. D., J. C. Cajka, B. M. Chasteen, D. K. Wagener, P. C. Cooley, and L. Ganapathi. Synthesized Population Databases: A U.S. Geospatial Database for Agent-Based Models. No. MR-0010-0905, RTI Press Method Report. 2009. 22. Miyamoto, K., N. Sugiki, N. Otani, and V. Vichiensan. Agent-Based Estimation Method of Household Micro-Data for Base Year in Land- Use Microsimulation. Presented at 89th Annual Meeting of the Trans- portation Research Board, Washington, D.C., 2010. 23. Moeckel, R., K. Spiekermann, and M. Wegener. Creating a Synthetic Population. Presented at 8th International Conference on Computers in Urban Planning and Urban Management (CUPUM), Sendai, Japan, 2003. 24. Birkin, M. H., A. G. Turner, and B. Wu. A Synthetic Demographic Model of the UK Population: Methods, Progress and Problems. Pre- sented at 2nd International Conference on e-Social Science, Manches- ter, United Kingdom, 2006. 25. Long, L., J. Lin, and W. Pu. Model-Based Synthesis of Household Travel Survey Data in Small and Midsize Metropolitan Areas. In Transporta- tion Research Record: Journal of the Transportation Research Board, No. 2105, Transportation Research Board of the National Academies, Washington, D.C., 2009, pp. 64–70. 26. Ryan, J., H. Maoh, and P. Kanaroglou. Population Synthesis: Compar- ing the Major Techniques Using a Small, Complete Population of Firms. In Geographical Analysis, Vol. 41, 2009, pp. 181–203. 27. Voas, D., and P. Williamson. An Evaluation of the Combinatorial Opti- mization Approach to the Creation of Synthetic Microdata. International Journal of Population Geography, Vol. 6, No. 5, 2000, pp. 349–366. 28. Sklar, A. Fonctions de répartition à n dimensions et leurs marges. Publi- cations de l’Institut de Statistique de l’Université de Paris, Vol. 8, 1959, pp. 229–231. 29. Nelsen, R. B. An Introduction to Copulas. Springer, New York, 2006. 30. Kao, S. C., and R. S. Govindaraju. Trivariate Statistical Analysis of Extreme Rainfall Events via Plackett Family of Copulas. Water Resources Research, Vol. 44, 2008. 31. Genest, C., A. C. Favre, J. Béliveau, and C. Jacques. Metaelliptical Cop- ulas and Their Use in Frequency Analysis of Multivariate Hydrological Data. Water Resources Research, Vol. 43, 2007. 32. Renard, B., and M. Lang. Use of a Gaussian Copula for Multivariate Extreme Value Analysis: Some Case Studies in Hydrology. Advances in Water Resources, Vol. 30, 2007, pp. 897–912. 33. Genz, A., and F. Bretz. Comparison of Methods for the Computation of Multivariate t-Probabilities. Journal of Computational and Graphical Statistics, Vol. 11, No. 4, 2002, pp. 950–971. 34. Bhaduri, B., E. Bright, P. Coleman, and M. Urban. LandScan USA: A High-Resolution Geospatial and Temporal Modeling Approach for Population Distribution and Dynamics. GeoJournal, Vol. 69, 2007, pp. 103–117. The Transportation Demand Forecasting Committee peer-reviewed this paper.

Dependence-Preserving Approach to Synthesizing Household Characteristics

Description

Comments