Title: | Small Area Estimation Using Model-Assisted Projection Method |
---|---|
Description: | Combines information from two independent surveys using a model-assisted projection method. Designed for survey sampling scenarios where a large sample collects only auxiliary information (Survey 1) and a smaller sample provides data on both variables of interest and auxiliary variables (Survey 2). Implements a working model to generate synthetic values of the variable of interest by fitting the model to Survey 2 data and predicting values for Survey 1 based on its auxiliary variables (Kim & Rao, 2012) <doi:10.1093/biomet/asr063>. |
Authors: | Ridson Al Farizal P [aut, cre, cph]
|
Maintainer: | Ridson Al Farizal P <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1 |
Built: | 2025-03-17 06:06:00 UTC |
Source: | https://github.com/alfrzlp/sae.projection |
A dataset from the March 2020 National Socio-Economic Survey (Susenas) KOR Module, conducted in DKI Jakarta, Indonesia, which is held annually, presented at the regency level.
df_susenas_mar2020
df_susenas_mar2020
A data frame with 18842 rows and 38 variables with 6 domains.
Year the survey was conducted
Primary Sampling Unit (PSU)
Secondary Sampling Unit (SSU)
Strata used for sampling
Unique identifier for each respondent
Sample number
Household number
Household member number
Weight from survey
Province code
Regency or municipality code
Urban or rural classification (1: Urban, 2: Rural)
Marital status (1: Married, 0: Other)
Sex (1: Male, 2: Female)
Age of the respondent
Currently attending school (0: No, 1: Yes)
Highest education completed (0: Did not complete elementary school, 1: Elementary school, 2: Junior high school, 3: Senior high school, 4: University/College)
Employment status (1: Employed, 0: Not employed)
Type of employment sector (1: Agriculture, 0: Non-agriculture)
Job position or role
Ownership status of residence (1: Owned, 0: Other)
Floor area of residence (in square meters)
Has pension insurance (0: No, 1: Yes)
Has old-age insurance (0: No, 1: Yes)
Has work insurance (0: No, 1: Yes)
Has life insurance (0: No, 1: Yes)
Receives severance pay (0: No, 1: Yes)
Has a KKS (Kartu Keluarga Sejahtera) card (0: No, 1: Yes)
Is the respondent a recipient of PKH (Program Keluarga Harapan) assistance? (0: No, 1: Yes)
Location where PKH funds are disbursed
PKH funds used for food assistance (0: No, 1: Yes)
PKH funds used for housing assistance (0: No, 1: Yes)
PKH funds used for healthcare assistance (0: No, 1: Yes)
PKH funds used for maternity assistance (0: No, 1: Yes)
PKH funds used for school assistance (0: No, 1: Yes)
PKH funds used for other types of assistance (0: No, 1: Yes)
Receives BPNT (Bantuan Pangan Non-Tunai) program assistance (0: No, 1: Yes)
A dataset from the September 2020 National Socio-Economic Survey (Susenas) Social Resilience Module, conducted in DKI Jakarta, Indonesia, which is held every three years, presented at the provincial level.
df_susenas_sep2020
df_susenas_sep2020
A data frame with 3655 rows and 33 variables with 6 domains.
Unique identifier for each respondent
Sample number
Household number
Household member number
Weight from survey
Province code
Urban or rural classification (1: Urban, 2: Rural)
Marital status (1: Married, 0: Other)
Sex (1: Male, 2: Female)
Age of the respondent
Currently attending school (0: No, 1: Yes)
Highest education completed (0: Did not complete elementary school, 1: Elementary school, 2: Junior high school, 3: Senior high school, 4: University/College)
Employment status (1: Employed, 0: Not employed)
Type of employment sector (1: Agriculture, 0: Non-agriculture)
Job position or role
Ownership status of residence (1: Owned, 0: Other)
Floor area of residence (in square meters)
Has pension insurance (0: No, 1: Yes)
Has old-age insurance (0: No, 1: Yes)
Has work insurance (0: No, 1: Yes)
Has life insurance (0: No, 1: Yes)
Receives severance pay (0: No, 1: Yes)
Has a KKS (Kartu Keluarga Sejahtera) card (0: No, 1: Yes)
Is the respondent a recipient of PKH (Program Keluarga Harapan) assistance? (0: No, 1: Yes)
Location where PKH funds are disbursed
PKH funds used for food assistance (0: No, 1: Yes)
PKH funds used for housing assistance (0: No, 1: Yes)
PKH funds used for healthcare assistance (0: No, 1: Yes)
PKH funds used for maternity assistance (0: No, 1: Yes)
PKH funds used for school assistance (0: No, 1: Yes)
PKH funds used for other types of assistance (0: No, 1: Yes)
Receives BPNT (Bantuan Pangan Non-Tunai) program assistance (0: No, 1: Yes)
Using public transportation (0: No, 1: Yes), which includes motorized vehicles with specific routes
A dataset from the August 2022 National Labor Force Survey (Sakernas) conducted in East Java, Indonesia.
df_svy22
df_svy22
A data frame with 74.070 rows and 11 variables with 38 domains.
Primary Sampling Unit
Weight from survey
province code
regency/municipality code
Strata
Income
Not in education employment or training status
sex (1: male, 2: female)
age
disability status (0: False, 1: True)
last completed education
A dataset from the August 2023 National Labor Force Survey (Sakernas) conducted in East Java, Indonesia.
df_svy23
df_svy23
A data frame with 66.245 rows and 11 variables with 38 domains.
Primary Sampling Unit
Weight from survey
province code
regency/municipality code
Strata
Income
Not in education employment or training status
sex (1: male, 2: female)
age
disability status (0: False, 1: True)
last completed education
The function addresses the problem of combining information from two or more independent surveys, a common challenge in survey sampling. It focuses on cases where:
Survey 1: A large sample collects only auxiliary information.
Survey 2: A much smaller sample collects both the variables of interest and the auxiliary variables.
The function implements a model-assisted projection estimation method based on a working model. The working models that can be used include several machine learning models that can be seen in the details section
projection( formula, id, weight, strata = NULL, domain, fun = "mean", model, data_model, data_proj, model_metric, kfold = 3, grid = 10, parallel_over = "resamples", seed = 1, est_y = FALSE, ... )
projection( formula, id, weight, strata = NULL, domain, fun = "mean", model, data_model, data_proj, model_metric, kfold = 3, grid = 10, parallel_over = "resamples", seed = 1, est_y = FALSE, ... )
formula |
An object of class formula that contains a description of the model to be fitted. The variables included in the formula must be contained in the |
id |
Column name specifying cluster ids from the largest level to the smallest level, where ~0 or ~1 represents a formula indicating the absence of clusters. |
weight |
Column name in data_proj representing the survey weight. |
strata |
Column name specifying strata, use NULL for no strata |
domain |
Column names in data_model and data_proj representing specific domains for which disaggregated data needs to be produced. |
fun |
A function taking a formula and survey design object as its first two arguments (default = "mean", "total", "varians"). |
model |
The working model to be used in the projection estimator. Refer to the details for the available working models. |
data_model |
A data frame or a data frame extension (e.g., a tibble) representing the second survey, characterized by a much smaller sample, provides information on both the variable of interest and the auxiliary variables. |
data_proj |
A data frame or a data frame extension (e.g., a tibble) representing the first survey, characterized by a large sample that collects only auxiliary information or general-purpose variables. |
model_metric |
A yardstick::metric_set(), or NULL to compute a standard set of metrics (rmse for regression and f1-score for classification). |
kfold |
The number of partitions of the data set (k-fold cross validation). |
grid |
A data frame of tuning combinations or a positive integer. The data frame should have columns for each parameter being tuned and rows for tuning parameter candidates. An integer denotes the number of candidate parameter sets to be created automatically. |
parallel_over |
A single string containing either "resamples" or "everything" describing how to use parallel processing. Alternatively, NULL is allowed, which chooses between "resamples" and "everything" automatically. If "resamples", then tuning will be performed in parallel over resamples alone. Within each resample, the preprocessor (i.e. recipe or formula) is processed once, and is then reused across all models that need to be fit. If "everything", then tuning will be performed in parallel at two levels. An outer parallel loop will iterate over resamples. Additionally, an inner parallel loop will iterate over all unique combinations of preprocessor and model tuning parameters for that specific resample. This will result in the preprocessor being re-processed multiple times, but can be faster if that processing is extremely fast. |
seed |
A single value, interpreted as an integer |
est_y |
A logical value indicating whether to return the estimation of |
... |
Further argument to the |
The available working models include:
Linear Regression linear_reg()
Logistic Regression logistic_reg()
Poisson Regression poisson_reg()
Decision Tree decision_tree()
KNN nearest_neighbor()
Naive Bayes naive_bayes()
Multi Layer Perceptron mlp()
Random Forest rand_forest()
Accelerated Oblique Random Forests (Jaeger et al. 2022, Jaeger et al. 2024) rand_forest(engine = 'aorsf')
XGBoost boost_tree(engine = 'xgboost')
LightGBM boost_tree(engine = 'lightgbm')
A complete list of models can be seen at the following link Tidy Modeling With R
The function returns a list with the following objects (model
, prediction
and df_result
):
model
The working model used in the projection.
prediction
A vector containing the prediction results from the working model.
df_result
A data frame with the following columns:
domain
The name of the domain.
ypr
The estimation results of the projection for each domain.
var_ypr
The sample variance of the projection estimator for each domain.
rse_ypr
The Relative Standard Error (RSE) in percentage (%).
Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.
## Not run: library(sae.projection) library(dplyr) library(bonsai) df_svy22_income <- df_svy22 %>% filter(!is.na(income)) df_svy23_income <- df_svy23 %>% filter(!is.na(income)) # Linear regression lm_proj <- projection( income ~ age + sex + edu + disability, id = "PSU", weight = "WEIGHT", strata = "STRATA", domain = c("PROV", "REGENCY"), model = linear_reg(), data_model = df_svy22_income, data_proj = df_svy23_income, nest = TRUE ) # Random forest regression with hyperparameter tunning rf_proj <- projection( income ~ age + sex + edu + disability, id = "PSU", weight = "WEIGHT", strata = "STRATA", domain = c("PROV", "REGENCY"), model = rand_forest(mtry = tune(), trees = tune(), min_n = tune()), data_model = df_svy22_income, data_proj = df_svy23_income, kfold = 3, grid = 10, nest = TRUE ) df_svy22_neet <- df_svy22 %>% filter(between(age, 15, 24)) df_svy23_neet <- df_svy23 %>% filter(between(age, 15, 24)) # Logistic regression lr_proj <- projection( formula = neet ~ sex + edu + disability, id = ~ PSU, weight = ~ WEIGHT, strata = ~ STRATA, domain = ~ PROV + REGENCY, model = logistic_reg(), data_model = df_svy22_neet, data_proj = df_svy23_neet, nest = TRUE ) # LightGBM regression with hyperparameter tunning show_engines("boost_tree") lgbm_model <- boost_tree( mtry = tune(), trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), engine = "lightgbm" ) lgbm_proj <- projection( formula = neet ~ sex + edu + disability, id = "PSU", weight = "WEIGHT", strata = "STRATA", domain = c("PROV", "REGENCY"), model = lgbm_model, data_model = df_svy22_neet, data_proj = df_svy23_neet, kfold = 3, grid = 10, nest = TRUE ) ## End(Not run)
## Not run: library(sae.projection) library(dplyr) library(bonsai) df_svy22_income <- df_svy22 %>% filter(!is.na(income)) df_svy23_income <- df_svy23 %>% filter(!is.na(income)) # Linear regression lm_proj <- projection( income ~ age + sex + edu + disability, id = "PSU", weight = "WEIGHT", strata = "STRATA", domain = c("PROV", "REGENCY"), model = linear_reg(), data_model = df_svy22_income, data_proj = df_svy23_income, nest = TRUE ) # Random forest regression with hyperparameter tunning rf_proj <- projection( income ~ age + sex + edu + disability, id = "PSU", weight = "WEIGHT", strata = "STRATA", domain = c("PROV", "REGENCY"), model = rand_forest(mtry = tune(), trees = tune(), min_n = tune()), data_model = df_svy22_income, data_proj = df_svy23_income, kfold = 3, grid = 10, nest = TRUE ) df_svy22_neet <- df_svy22 %>% filter(between(age, 15, 24)) df_svy23_neet <- df_svy23 %>% filter(between(age, 15, 24)) # Logistic regression lr_proj <- projection( formula = neet ~ sex + edu + disability, id = ~ PSU, weight = ~ WEIGHT, strata = ~ STRATA, domain = ~ PROV + REGENCY, model = logistic_reg(), data_model = df_svy22_neet, data_proj = df_svy23_neet, nest = TRUE ) # LightGBM regression with hyperparameter tunning show_engines("boost_tree") lgbm_model <- boost_tree( mtry = tune(), trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), engine = "lightgbm" ) lgbm_proj <- projection( formula = neet ~ sex + edu + disability, id = "PSU", weight = "WEIGHT", strata = "STRATA", domain = c("PROV", "REGENCY"), model = lgbm_model, data_model = df_svy22_neet, data_proj = df_svy23_neet, kfold = 3, grid = 10, nest = TRUE ) ## End(Not run)
Kim and Rao (2012), the synthetic data obtained through the model-assisted projection method can provide a useful tool for efficient domain estimation when the size of the sample in survey 2 is much larger than the size of sample in survey 1.
This function projects estimated values from a small survey onto an independent large survey using the random forest algorithm. Although the two surveys are statistically independent, the projection relies on shared auxiliary variables. The process includes data preprocessing, feature selection, model training, and domain-specific estimation based on survey design principles.
Projection_rf( data_model, target_column, data_proj, domain1, domain2, psu, ssu, strata, weights, split_ratio = 0.8, metric = "Accuracy" )
Projection_rf( data_model, target_column, data_proj, domain1, domain2, psu, ssu, strata, weights, split_ratio = 0.8, metric = "Accuracy" )
data_model |
The training dataset, consisting of auxiliary variables and the target variable. |
target_column |
The name of the target column in the |
data_proj |
The data for projection (prediction), which needs to be projected using the trained model. It must contain the same auxiliary variables as the |
domain1 |
Domain variables for survey estimation (e.g., "province") |
domain2 |
Domain variables for survey estimation (e.g., "regency") |
psu |
Primary sampling units, representing the structure of the sampling frame from |
ssu |
Secondary sampling units, representing the structure of the sampling frame from |
strata |
Stratification variable in the |
weights |
Weights used in the |
split_ratio |
Proportion of data used for training (default is 0.8, meaning 80% for training and 20% for validation). |
metric |
The metric used for model evaluation (default is Accuracy, other options include "AUC", "F1", etc.). |
A list containing the following elements:
model
The trained Random Forest model.
importance
Feature importance showing which features contributed most to the model’s predictions.
train_accuracy
Accuracy of the model on the training set.
validation_accuracy
Accuracy of the model on the validation set.
validation_performance
Confusion matrix for the validation set, showing performance metrics like accuracy, precision, recall, etc.
data_proj
The projection data with predicted values.
Domain1
Estimations for Domain 1, including estimated values, variance, and relative standard error.
Domain2
Estimations for Domain 2, including estimated values, variance, and relative standard error.
Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.
library(survey) library(caret) library(dplyr) library(themis) library(randomForest) df_sep20 <- df_susenas_sep2020 %>% select(-c(1:6)) df_mar20 <- df_susenas_mar2020 proj_rf <- Projection_rf(data_model = df_sep20, target_column = "uses_public_transport", data_proj = df_mar20, domain1 = "province", domain2 = "regency", psu = "psu", ssu = "ssu", strata = "strata", weights = "weight", metric = "Accuracy")
library(survey) library(caret) library(dplyr) library(themis) library(randomForest) df_sep20 <- df_susenas_sep2020 %>% select(-c(1:6)) df_mar20 <- df_susenas_mar2020 proj_rf <- Projection_rf(data_model = df_sep20, target_column = "uses_public_transport", data_proj = df_mar20, domain1 = "province", domain2 = "regency", psu = "psu", ssu = "ssu", strata = "strata", weights = "weight", metric = "Accuracy")
Kim and Rao (2012), the synthetic data obtained through the model-assisted projection method can provide a useful tool for efficient domain estimation when the size of the sample in survey 2 is much larger than the size of sample in survey 1.
This function projects estimated values from a small survey onto an independent large survey using the random forest algorithm. Although the two surveys are statistically independent, the projection relies on shared auxiliary variables. The process includes data preprocessing, feature selection, model training, and domain-specific estimation based on survey design principles. Additionally, the estimation results incorporate bias correction techniques.
Projection_rf_CorrectedBias( metadata, data_model, target_column, data_proj, domain1, domain2, psu, ssu, strata, weight_proj, weight_model, split_ratio = 0.8, metric = "Accuracy" )
Projection_rf_CorrectedBias( metadata, data_model, target_column, data_proj, domain1, domain2, psu, ssu, strata, weight_proj, weight_model, split_ratio = 0.8, metric = "Accuracy" )
metadata |
The metadata for the dataset, it must contain psu, ssu, strata from small survey dataset. |
data_model |
The training dataset, consisting of auxiliary variables and the target variable. |
target_column |
The name of the target column in the |
data_proj |
The data for projection (prediction), which needs to be projected using the trained model. It must contain the same auxiliary variables as the |
domain1 |
Domain variables for survey estimation (e.g., "province") |
domain2 |
Domain variables for survey estimation (e.g., "regency") |
psu |
Primary sampling units, representing the structure of the sampling frame in both the small and large survey datasets. |
ssu |
Secondary sampling units, representing the structure of the sampling frame in both the small and large survey datasets. |
strata |
Stratification variable in both the small and large survey datasets, ensuring that specific subgroups are represented. |
weight_proj |
Weights used in the |
weight_model |
Weights used in the |
split_ratio |
Proportion of data used for training (default is 0.8, meaning 80% for training and 20% for validation). |
metric |
The metric used for model evaluation (default is Accuracy, other options include "AUC", "F1", etc.). |
A list containing the following elements:
model
The trained Random Forest model.
importance
Feature importance showing which features contributed most to the model’s predictions.
train_accuracy
Accuracy of the model on the training set.
validation_accuracy
Accuracy of the model on the validation set.
validation_performance
Confusion matrix for the validation set, showing performance metrics like accuracy, precision, recall, etc.
data_proj
The projection data with predicted values.
Direct
Direct estimations for Domain 1, including estimated values, variance, and relative standard error.
Domain1_corrected_bias
Bias-corrected estimations for Domain 1, including estimated values, variance, and relative standard error.
Domain2_corrected_bias
Bias-corrected estimations for Domain 2, including estimated values, variance, and relative standard error.
Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.
library(survey) library(caret) library(dplyr) library(themis) library(randomForest) df_susenas_sep2020 <- df_susenas_sep2020 %>% left_join(df_susenas_mar2020 %>% select(psu, ssu, strata, no_sample, no_household), by = c('no_sample', 'no_household'), multiple = 'any' ) metadata <- df_susenas_sep2020 %>% select(1:6, 34:ncol(.)) df_sep20 <- df_susenas_sep2020 %>% select(7:33) df_mar20 <- df_susenas_mar2020 # Example usage of the function: proj_rf_nonbiased <- Projection_rf_CorrectedBias(metadata = metadata, data_model = df_sep20, target_column = "uses_public_transport", data_proj = df_mar20, domain1 = "province", domain2 = "regency", psu = "psu", ssu = "ssu", strata = "strata", weight_proj = "weight", weight_model = "weight_pnl", metric = "Accuracy")
library(survey) library(caret) library(dplyr) library(themis) library(randomForest) df_susenas_sep2020 <- df_susenas_sep2020 %>% left_join(df_susenas_mar2020 %>% select(psu, ssu, strata, no_sample, no_household), by = c('no_sample', 'no_household'), multiple = 'any' ) metadata <- df_susenas_sep2020 %>% select(1:6, 34:ncol(.)) df_sep20 <- df_susenas_sep2020 %>% select(7:33) df_mar20 <- df_susenas_mar2020 # Example usage of the function: proj_rf_nonbiased <- Projection_rf_CorrectedBias(metadata = metadata, data_model = df_sep20, target_column = "uses_public_transport", data_proj = df_mar20, domain1 = "province", domain2 = "regency", psu = "psu", ssu = "ssu", strata = "strata", weight_proj = "weight", weight_model = "weight_pnl", metric = "Accuracy")