Sample Cleaning Workflow
<!DOCTYPE html>
This documents provides and overview of a sample workflow.
Start by importing data from Kobo or anotherweb directory. You may skip this if you already have the data downloaded.
Rename variables
The first important step is to rename variables towards the LCAS convention. For this, we provide a dictionary. If you use the LCAS templates, all variable names should already align with the standard variable names. If you specified your own variable names for data collection, then you can map these to the LCAS standard inside the dictiorary.csv file. Afterward you can use this script to rename the variables.
#load dictionary
f1 <- "data/dictionary_with_questions.csv"
dict <- read.csv(f1)
#load survey data
f2 <- "data/india_rice_17_18.csv"
df <- read.csv(f2)
#rename variables
source("code/rename_lcas.R")
df <- rename_lcas(dict, df)
#write the data with standard variable names
write.csv(df,"outputs/lcas_renamed.csv")
Adding secondary data (e.g. climate,soil) requiring specific geo-locations
For many analyses it is useful to add secondary data including socio-economic and bio-physical variables such as climate, population density, distance to markets and many more. Since this requires precise GPS locations, it is best to run this script before anonymizing the data. But since many variable do not vary in space across small distances such as the anonomyzing offset, it may also be run afterwards.
Other existing households surveys routinely do this and several R packages exist to download and add secondary data to vector data. Here we primarily rely on Robert Heijman’s ‘geodata’ package in R as well as the World Bank’s Living Standards Measurement Survey (LSMS). The functions for adding these additional features are described in the R script file add_secondary_lcas.R.
Anonymization
Raw LCAS data are not safe to share as it endagers the privacy of the respondendts. To anonymize the data we (i)remove the unique ID columns incl. name, father’s name, mobile number, and national ID number and (ii) offset the locations of the GPS datapoints. Offsetting (instead of dropping) the GPS coordinates has the benefit that the data can still be used for spatial analytics, but without identifying specific farmers or fields.
Importantly, the variable names have to be standardized for the functions to work. The R code for anonymizing raw LCAS data can be found in anonymize_lcas.R.
#load LCAS with standard variable names
f <- "outputs/lcas_secondary.csv"
df <- read.csv(f)
#anonymize the data
source("code/anonymize_lcas.R")
df <- anonymize_lcas(df)
#write anonymized data to csv.
write.csv(df,"outputs/lcas_anonymized.csv")
Local land unit (LLU) conversions
Since most farmers use LLUs - the raw data is collected in these with addition to a LLU conversion factor. Since most application require ha for land units - we provide a function that converts all variables collected in LLU towards ha.
The script can be found at code/calc_llu_to_ha.R
#read anoymous lcas dataset
f <- "outputs/lcas_anonymized.csv"
df <- read.csv(f)
#calculate fertilizer rates
source("code/calc_llu_to_ha.R")
df <- calc_llu_to_ha(df)
write.csv(df,"outputs/lcas_ha.csv")
Yield per ha
Most application are intested in yield outcomes. Yields are generally calculated by dividing a farms total production of a crop in that season through the area on which this crop was calculated.
The script can be found at code/calc_yield.R
#read anoymous lcas dataset
f <- "outputs/lcas_ha.csv"
df <- read.csv(f)
#calculate fertilizer rates
source("code/calc_yield.R")
df <- calc_yield(df)
write.csv(df,"outputs/lcas_yield.R")
Fertilizer Rates
Fertilizer application rates are calculated by summing the fertilizer inputs (basal application + top dressings) and multplying each fertilizer input with the percentage of N, P, or K contained in the fertilizer. The total nutrient inputs are then normalized by the area of the field towards a per ha basis.
Specifically we use the following nutrient concentrations (N-P-K):
- Urea: (46-0-0)
- NPK: (12-32-16)
- DAP: (18-46-0)
- TSP: (0-45-0)
The script can be found at code/calc_fert_rate.R
#read anoymous lcas dataset
f <- "outputs/lcas_yield.R"
df <- read.csv(f)
#calculate fertilizer rates
source("code/calc_fert_rate.R")
df <- calc_fert_rate(df)
write.csv(df,"outputs/lcas_fert.csv")
Convert dates to day of year and days since 1980-01-01
Most application require that dates are saved in a numeric format. Although some functions and packages can handle variables formatted as dates, it is often required to use simple numeric fomats. The most straightforward way is to convert dates into days per year. However, when an anlaysis stretches across one calendar year then the re-starting of the counting at 1 can cause issues. For this purpose we provide both the day of the year and the days since 1980-01-01 - a standard practice.
Most importantly, we are looking for planting and harvesting dates.
The script can be found at code/calc_dates.R
#read anoymous lcas dataset
f <- "outputs/lcas_fert.csv"
df <- read.csv(f)
#calculate fertilizer rates
source("code/calc_dates.R")
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
df <- calc_dates(df)
#write full data to csv.
write.csv(df,"outputs/lcas_full_vars.csv")
Reshape wide dataframe to long
Many surveys such as LSMS and DHS store their datasets in long format and per modules. One of the advantages of the LCAS is that it is relatively simple. It normally only surveys one plot (the largest one) per household. This makes it easy to use for analyses workflow that normally require data in wide format. It is also easier to handle for researchers with less experience in handling complicated and relational databases.
The standard data format for LCAS is therefore the wide format. For convenience, we provide here reshaping scripts that convert the wide format LCAS into seperate modules in long format. This might be helpful for compatibility with other surveys and if researchers seek to collect data for multiple plots per household.
The code is stored in data_shaping.R.