dx extract_dataset (UKB) for R

R, UKB

Published

October 27, 2023

“dx extract_dataset” in R


As-Is Software Disclaimer

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.


This notebook demonstrates usage of the dx command extract_dataset for: * Retrieval of Apollo-stored data, as referenced within entities and fields of a Dataset or Cohort object on the platform * Retrieval of the underlying data dictionary files used to generate a Dataset object on the platform

MIT License applies to this notebook.

Preparing your environment

Launch spec:

  • App name: JupyterLab with Python, R, Stata, ML ()
  • Kernel: R
  • Instance type: Spark Cluster - mem1_ssd1_v2_x2, 3 nodes
  • Snapshot: /.Notebook_snapshots/jupyter_snapshot.gz
  • Cost: < $0.2
  • Runtime: =~ 10 min
  • Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID

Install dxpy

extract_dataset requires dxpy version >= 0.329.0. If running the command from your local environment (i.e. off of the DNAnexus platform), it may be required to also install pandas. For example, pip3 install -U dxpy[pandas]

#| trusted: true
system("pip3 show dxpy", intern = TRUE)

Install tidyverse for data processing

Quick note - you will need to read the licenses for the tidyverse in order to make sure whether you and your group are comfortable with the licensing terms.

#| trusted: true
#| eval: false
install.packages(c("readr", "stringr", "dplyr", "glue", "reactable", "janitor", "remotes"))
remotes::install_github("laderast/xvhelper")

Import packages

#| trusted: true
library(dplyr)
library(readr)
library(stringr)
library(glue)
library(reactable)
library(xvhelper)

1. Assign environment variables

#| trusted: true
# The referenced Dataset is private and provided only to demonstrate an example input. The user will need to supply a permissible and valid record-id

# Assign project-id of dataset
# In general, you should use the project id of your UKB project

# Assign dataset record-id
# record id can be either from a dataset or from a cohort
projectid <- "project-XXXXXX"
rid <- "record-G406j8j0x8kzxv3G08k64gVV"
# Assign joint dataset project-id:record-id
dataset <- glue::glue("{projectid}:{rid}")

2. Call “dx extract_dataset” using a supplied dataset

We’ll use the {glue} package to put our bash commands together for dx extract_dataset, and use system() to execute our bash code.

glue::glue() has the advantage of not needing to paste() together strings. The string substitution is cleaner.

#| trusted: true
cmd <- glue::glue("dx extract_dataset {dataset} -ddd")

cmd

Let’s execute our command using system() and then we will list the files that result using list.files(). We generate three files in the directory in JupyterLab storage:

  • dataset_name.codings.csv
  • dataset_name.data_dictionary.csv
  • dataset_name.entity_dictionary.csv
#| trusted: true
system(cmd)
list.files()

Preview data in the three dictionary (*.csv) files

#| trusted: true
#codings_file <- system("ls *.codings.csv", intern = TRUE)
codings_file <- list.files(pattern="*.codings.csv")
codings_df <- read_csv(codings_file, show_col_types = FALSE)
head(codings_df)
#| trusted: true
entity_dict_file <- system("ls *.entity_dictionary.csv", intern=TRUE)
entity_dict_df <- read_csv(entity_dict_file, show_col_types = FALSE)
head(entity_dict_df)

Understanding the Data Dictionary File

The data dictionary is the glue for the entire dataset. It maps:

  • Entity to Fields
  • Fields to Codings
  • Entity to Entity

We’ll use the data dictionary to understand how to building our list of fields, and later, we’ll join it to the codings file to build a list of fields and their coded values.

There are more columns to the data dictionary, but let’s first see the entity, name, title, and type columns:

#| trusted: true
#data_dict_file <- system("ls *.data_dictionary.csv", intern=TRUE)
data_dict_file <- list.files(pattern="*.data_dictionary.csv")
data_dict_df <- read_csv(data_dict_file, show_col_types = FALSE)
data_dict_df <- data_dict_df 

data_dict_df %>%
        select(entity, name, title, type) %>%
        head()

3. Parse returned metadata and extract entity/field names

Let’s search for some fields. We want the following fields:

  • Coffee intake | instance 0
  • Sex (Gender)
  • Smoked cigarette or pipe within last hour | Instance 0

We can use the {reactable} package to make a searchable table of the data dictionary. This will help in finding fields.

Note the search box in the top right of the table - when we have many fields, we can use the search box to find fields of interest. Try searching for Coffee intake and see what fields pop up.

#| trusted: true
data_dict_df <- data_dict_df %>%
    relocate(name, title) %>%
    mutate(ent_field = glue::glue("{entity}.{name}"))

basic_data_dict <- data_dict_df |>
                    select(title, name, entity, ent_field, coding_name, is_multi_select, is_sparse_coding)

reactable::reactable(basic_data_dict, searchable = TRUE)

Another strategy for searching fields: we can use grepl within dplyr::filter() to search for fields that match our criteria.

Note we’re chaining the grepl statements with an OR |.

We’re also concatenating entity and name to a new variable, ent_field, which we’ll use when we specify our list of fields.

#| trusted: true
filtered_dict <- data_dict_df %>%
    filter(grepl("Coffee type", title) | 
           grepl("Sex", title) | 
           grepl("Smoked", title) | 
           grepl("Age at recruitment", title) |
           grepl("main ICD10", title) |
           grepl("Types of transport", title)
          ) %>%
    arrange(title) 

filtered_dict %>%
    select(name, title, ent_field)

Let’s use this subset of fields - we’ll pull the ent_field column, and paste it together into a single comma delimited string using paste:

#| trusted: true
field_list <- filtered_dict %>%
    pull(ent_field)

#field_list <- field_list[200:210]
field_list <- paste(field_list, collapse = ",")
field_list <- paste0("participant.eid,", field_list)
field_list

4. Use extracted entity and field names as input to the called function, “dx extract_dataset” and extract data

Again, we’ll use glue() here for cleaner string substitution.

We’ll extract the cohort information to a file called cohort_data.csv and work with this file for the rest of the notebook.

#| trusted: true
cohort_template <- "dx extract_dataset {dataset} --fields {field_list} -o cohort_data.csv"
cmd <- glue::glue(cohort_template)

cmd

system(cmd)

Preview data in the retrieved data file

We’ll see that the retrieved data contains the integer and character codes. These must be decoded (see below):

#| trusted: true
data_df <- read_csv("cohort_data.csv", show_col_types = FALSE)
head(data_df)

Decoding columns with xvhelper

xvhelper is a little R package that will return the actual values of the returned data.

To use it, you build a coded_col_df using merge_coding_data_dict() and then translate the categorical columns to values using decode_categories(), and then change the column names to R friendly clean ones using decode_column_names().

Note that we need to run decode_df() before we run decode_category()

#| trusted: true
#| eval: false

#install via remotes::install_github()
#install.packages("remotes")
remotes::install_github("laderast/xvhelper")

library(xvhelper)
#| trusted: true
coded_col_df <- xvhelper::merge_coding_data_dict(coding_dict = codings_df, data_dict = data_dict_df)

decoded <- data_df %>%
    xvhelper::decode_single(coded_col_df) |>
    xvhelper::decode_multi_purrr(coded_col_df) |>
    xvhelper::decode_column_names(coded_col_df, r_clean_names = FALSE)
    
head(decoded)
library(ggplot2)

ggplot(decoded) + 
  aes(x=`Coffee type | Instance 0`) +
  geom_bar()
decoded |>
  dplyr::select(`Participant ID`, `Diagnoses - main ICD10`) |>
  tidyr::separate_rows(`Diagnoses - main ICD10`, sep="\\|") |>
  filter(grepl("Chapter", `Diagnoses - main ICD10`)) |>
  count(`Diagnoses - main ICD10`) |>
  arrange(desc(n)) |>
  gt::gt()
decoded |>
  dplyr::select(`Participant ID`, `Diagnoses - main ICD10`) |>
  tidyr::separate_rows(`Diagnoses - main ICD10`, sep="\\|") |>
  filter(grepl("Chapter", `Diagnoses - main ICD10`)) |>
  ggplot() + aes(x=`Diagnoses - main ICD10`) + geom_bar() + 
  theme(axis.text.x = element_text(angle = 90))
#| trusted: true
#| eval: false
write.csv(decoded, file="cohort_decoded.csv")

Save Output to Project

#| trusted: true
#| eval: false
system("dx upload *.csv --destination /users/tladeras/")

Source: GitHub