Title: | An Interpretable Machine Learning-Based Automatic Clinical Score Generator |
---|---|
Description: | A novel interpretable machine learning-based framework to automate the development of a clinical scoring model for predefined outcomes. Our novel framework consists of six modules: variable ranking with machine learning, variable transformation, score derivation, model selection, domain knowledge-based score fine-tuning, and performance evaluation.The details are described in our research paper<doi:10.2196/21798>. Users or clinicians could seamlessly generate parsimonious sparse-score risk models (i.e., risk scores), which can be easily implemented and validated in clinical practice. We hope to see its application in various medical case studies. |
Authors: | Feng Xie [aut, cre] |
Maintainer: | Feng Xie <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.0 |
Built: | 2025-03-02 03:34:57 UTC |
Source: | https://github.com/nliulab/autoscore |
Internal Function: Add baselines after second-step logistic regression (part of AutoScore Module 3)
add_baseline(df, coef_vec)
add_baseline(df, coef_vec)
df |
A |
coef_vec |
Generated from logistic regression |
Processed vector
for generating the scoring table
Internal Function: Automatically assign scores to each subjects given new data set and scoring table (Used for intermediate and final evaluation)
assign_score(df, score_table)
assign_score(df, score_table)
df |
A |
score_table |
A |
Processed data.frame
with assigned scores for each variables
Domain knowledge is essential in guiding risk model development.
For continuous variables, the variable transformation is a data-driven process (based on "quantile" or "kmeans" ).
In this step, the automatically generated cutoff values for each continuous variable can be fine-tuned
by combining, rounding, and adjusting according to the standard clinical norm. Revised cut_vec
will be input with domain knowledge to
update scoring table. User can choose any cut-off values/any number of categories. Then final Scoring table will be generated. Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
AutoScore_fine_tuning( train_set, validation_set, final_variables, cut_vec, max_score = 100, metrics_ci = FALSE )
AutoScore_fine_tuning( train_set, validation_set, final_variables, cut_vec, max_score = 100, metrics_ci = FALSE )
train_set |
A processed |
validation_set |
A processed |
final_variables |
A vector containing the list of selected variables, selected from Step(ii) |
cut_vec |
Generated from STEP(iii) |
max_score |
Maximum total score (Default: 100). |
metrics_ci |
whether to calculate confidence interval for the metrics of sensitivity, specificity, etc. |
Generated final table of scoring model for downstream testing
Xie F, Chakraborty B, Ong MEH, Goldstein BA, Liu N. AutoScore: A Machine Learning-Based Automatic Clinical Score Generator and Its Application to Mortality Prediction Using Electronic Health Records. JMIR Medical Informatics 2020;8(10):e21798
AutoScore_rank
, AutoScore_parsimony
, AutoScore_weighting
, AutoScore_testing
,Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
## Please see the guidebook or vignettes
## Please see the guidebook or vignettes
cut_vec
with domain knowledge (AutoScore Module 5)Domain knowledge is essential in guiding risk model development.
For continuous variables, the variable transformation is a data-driven process (based on "quantile" or "kmeans" ).
In this step, the automatically generated cutoff values for each continuous variable can be fine-tuned
by combining, rounding, and adjusting according to the standard clinical norm. Revised cut_vec
will be input with domain knowledge to
update scoring table. User can choose any cut-off values/any number of categories. Then final Scoring table will be generated. Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
AutoScore_fine_tuning_Ordinal( train_set, validation_set, final_variables, link = "logit", cut_vec, max_score = 100, n_boot = 100, report_cindex = FALSE )
AutoScore_fine_tuning_Ordinal( train_set, validation_set, final_variables, link = "logit", cut_vec, max_score = 100, n_boot = 100, report_cindex = FALSE )
train_set |
A processed |
validation_set |
A processed |
final_variables |
A vector containing the list of selected variables,
selected from Step(ii) |
link |
The link function used to model ordinal outcomes. Default is
|
cut_vec |
Generated from STEP(iii) |
max_score |
Maximum total score (Default: 100). |
n_boot |
Number of bootstrap cycles to compute 95% CI for performance metrics. |
report_cindex |
Whether to report generalized c-index for model evaluation (Default:FALSE for faster evaluation). |
Generated final table of scoring model for downstream testing
Saffari SE, Ning Y, Feng X, Chakraborty B, Volovici V, Vaughan R, Ong ME, Liu N, AutoScore-Ordinal: An interpretable machine learning framework for generating scoring models for ordinal outcomes, arXiv:2202.08407
AutoScore_rank_Ordinal
,
AutoScore_parsimony_Ordinal
,
AutoScore_weighting_Ordinal
,
AutoScore_testing_Ordinal
.
## Please see the guidebook or vignettes
## Please see the guidebook or vignettes
Domain knowledge is essential in guiding risk model development.
For continuous variables, the variable transformation is a data-driven process (based on "quantile" or "kmeans" ).
In this step, the automatically generated cutoff values for each continuous variable can be fine-tuned
by combining, rounding, and adjusting according to the standard clinical norm. Revised cut_vec
will be input with domain knowledge to
update scoring table. User can choose any cut-off values/any number of categories. Then final Scoring table will be generated. Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
AutoScore_fine_tuning_Survival( train_set, validation_set, final_variables, cut_vec, max_score = 100, time_point = c(1, 3, 7, 14, 30, 60, 90) )
AutoScore_fine_tuning_Survival( train_set, validation_set, final_variables, cut_vec, max_score = 100, time_point = c(1, 3, 7, 14, 30, 60, 90) )
train_set |
A processed |
validation_set |
A processed |
final_variables |
A vector containing the list of selected variables, selected from Step(ii) |
cut_vec |
Generated from STEP(iii)
|
max_score |
Maximum total score (Default: 100). |
time_point |
The time points to be evaluated using time-dependent AUC(t). |
Generated final table of scoring model for downstream testing
Xie F, Ning Y, Yuan H, et al. AutoScore-Survival: Developing interpretable machine learning-based time-to-event scores with right-censored survival data. J Biomed Inform. 2022;125:103959. doi:10.1016/j.jbi.2021.103959
AutoScore_rank_Survival
,
AutoScore_parsimony_Survival
,
AutoScore_weighting_Survival
,
AutoScore_testing_Survival
.
## Please see the guidebook or vignettes
## Please see the guidebook or vignettes
AutoScore STEP(ii): Select the best model with parsimony plot (AutoScore Modules 2+3+4)
AutoScore_parsimony( train_set, validation_set, rank, max_score = 100, n_min = 1, n_max = 20, cross_validation = FALSE, fold = 10, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, do_trace = FALSE, auc_lim_min = 0.5, auc_lim_max = "adaptive" )
AutoScore_parsimony( train_set, validation_set, rank, max_score = 100, n_min = 1, n_max = 20, cross_validation = FALSE, fold = 10, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, do_trace = FALSE, auc_lim_min = 0.5, auc_lim_max = "adaptive" )
train_set |
A processed |
validation_set |
A processed |
rank |
the raking result generated from AutoScore STEP(i) |
max_score |
Maximum total score (Default: 100). |
n_min |
Minimum number of selected variables (Default: 1). |
n_max |
Maximum number of selected variables (Default: 20). |
cross_validation |
If set to |
fold |
The number of folds used in cross validation (Default: 10). Available if |
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" (Default: "quantile"). |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. (Default: c(0, 0.05, 0.2, 0.8, 0.95, 1)) Available if |
max_cluster |
The max number of cluster (Default: 5). Available if |
do_trace |
If set to TRUE, all results based on each fold of cross-validation would be printed out and plotted (Default: FALSE). Available if |
auc_lim_min |
Min y_axis limit in the parsimony plot (Default: 0.5). |
auc_lim_max |
Max y_axis limit in the parsimony plot (Default: "adaptive"). |
This is the second step of the general AutoScore workflow, to generate the parsimony plot to help select a parsimonious model.
In this step, it goes through AutoScore Module 2,3 and 4 multiple times and to evaluate the performance under different variable list.
The generated parsimony plot would give researcher an intuitive figure to choose the best models.
If data size is small (ie, <5000), an independent validation set may not be a wise choice. Then, we suggest using cross-validation
to maximize the utility of data. Set cross_validation=TRUE
. Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
List of AUC value for different number of variables
Xie F, Chakraborty B, Ong MEH, Goldstein BA, Liu N, AutoScore: A Machine Learning-Based Automatic Clinical Score Generator and Its Application to Mortality Prediction Using Electronic Health Records, JMIR Med Inform 2020;8(10):e21798, doi: 10.2196/21798
AutoScore_rank
, AutoScore_weighting
, AutoScore_fine_tuning
, AutoScore_testing
, Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
# see AutoScore Guidebook for the whole 5-step workflow data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" out_split <- split_data(data = sample_data, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank(train_set, ntree=100) AUC <- AutoScore_parsimony( train_set, validation_set, rank = ranking, max_score = 100, n_min = 1, n_max = 20, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1) )
# see AutoScore Guidebook for the whole 5-step workflow data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" out_split <- split_data(data = sample_data, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank(train_set, ntree=100) AUC <- AutoScore_parsimony( train_set, validation_set, rank = ranking, max_score = 100, n_min = 1, n_max = 20, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1) )
AutoScore STEP(ii) for ordinal outcomes: Select the best model with parsimony plot (AutoScore Modules 2+3+4)
AutoScore_parsimony_Ordinal( train_set, validation_set, rank, link = "logit", max_score = 100, n_min = 1, n_max = 20, cross_validation = FALSE, fold = 10, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, do_trace = FALSE, auc_lim_min = 0.5, auc_lim_max = "adaptive" )
AutoScore_parsimony_Ordinal( train_set, validation_set, rank, link = "logit", max_score = 100, n_min = 1, n_max = 20, cross_validation = FALSE, fold = 10, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, do_trace = FALSE, auc_lim_min = 0.5, auc_lim_max = "adaptive" )
train_set |
A processed |
validation_set |
A processed |
rank |
The raking result generated from AutoScore STEP(i) for ordinal
outcomes ( |
link |
The link function used to model ordinal outcomes. Default is
|
max_score |
Maximum total score (Default: 100). |
n_min |
Minimum number of selected variables (Default: 1). |
n_max |
Maximum number of selected variables (Default: 20). |
cross_validation |
If set to |
fold |
The number of folds used in cross validation (Default: 10). Available if |
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" (Default: "quantile"). |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. (Default: c(0, 0.05, 0.2, 0.8, 0.95, 1)) Available if |
max_cluster |
The max number of cluster (Default: 5). Available if |
do_trace |
If set to TRUE, all results based on each fold of cross-validation would be printed out and plotted (Default: FALSE). Available if |
auc_lim_min |
Min y_axis limit in the parsimony plot (Default: 0.5). |
auc_lim_max |
Max y_axis limit in the parsimony plot (Default: "adaptive"). |
This is the second step of the general AutoScore workflow for
ordinal outcomes, to generate the parsimony plot to help select a
parsimonious model. In this step, it goes through AutoScore Module 2,3 and
4 multiple times and to evaluate the performance under different variable
list. The generated parsimony plot would give researcher an intuitive
figure to choose the best models. If data size is small (eg, <5000), an
independent validation set may not be a wise choice. Then, we suggest using
cross-validation to maximize the utility of data. Set
cross_validation=TRUE
.
List of mAUC (ie, the average AUC of dichotomous classifications) value for different number of variables
Saffari SE, Ning Y, Feng X, Chakraborty B, Volovici V, Vaughan R, Ong ME, Liu N, AutoScore-Ordinal: An interpretable machine learning framework for generating scoring models for ordinal outcomes, arXiv:2202.08407
AutoScore_rank_Ordinal
,
AutoScore_weighting_Ordinal
,
AutoScore_fine_tuning_Ordinal
,
AutoScore_testing_Ordinal
.
## Not run: # see AutoScore-Ordinal Guidebook for the whole 5-step workflow data("sample_data_ordinal") # Output is named `label` out_split <- split_data(data = sample_data_ordinal, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank_Ordinal(train_set, ntree=100) mAUC <- AutoScore_parsimony_Ordinal( train_set = train_set, validation_set = validation_set, rank = ranking, max_score = 100, n_min = 1, n_max = 20, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1) ) ## End(Not run)
## Not run: # see AutoScore-Ordinal Guidebook for the whole 5-step workflow data("sample_data_ordinal") # Output is named `label` out_split <- split_data(data = sample_data_ordinal, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank_Ordinal(train_set, ntree=100) mAUC <- AutoScore_parsimony_Ordinal( train_set = train_set, validation_set = validation_set, rank = ranking, max_score = 100, n_min = 1, n_max = 20, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1) ) ## End(Not run)
AutoScore STEP(ii) for survival outcomes: Select the best model with parsimony plot (AutoScore Modules 2+3+4)
AutoScore_parsimony_Survival( train_set, validation_set, rank, max_score = 100, n_min = 1, n_max = 20, cross_validation = FALSE, fold = 10, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, do_trace = FALSE, auc_lim_min = 0.5, auc_lim_max = "adaptive" )
AutoScore_parsimony_Survival( train_set, validation_set, rank, max_score = 100, n_min = 1, n_max = 20, cross_validation = FALSE, fold = 10, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, do_trace = FALSE, auc_lim_min = 0.5, auc_lim_max = "adaptive" )
train_set |
A processed |
validation_set |
A processed |
rank |
the raking result generated from AutoScore STEP(i) for survival
outcomes ( |
max_score |
Maximum total score (Default: 100). |
n_min |
Minimum number of selected variables (Default: 1). |
n_max |
Maximum number of selected variables (Default: 20). |
cross_validation |
If set to |
fold |
The number of folds used in cross validation (Default: 10). Available if |
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" (Default: "quantile"). |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. (Default: c(0, 0.05, 0.2, 0.8, 0.95, 1)) Available if |
max_cluster |
The max number of cluster (Default: 5). Available if |
do_trace |
If set to TRUE, all results based on each fold of cross-validation would be printed out and plotted (Default: FALSE). Available if |
auc_lim_min |
Min y_axis limit in the parsimony plot (Default: 0.5). |
auc_lim_max |
Max y_axis limit in the parsimony plot (Default: "adaptive"). |
This is the second step of the general AutoScore-Survival workflow for
ordinal outcomes, to generate the parsimony plot to help select a
parsimonious model. In this step, it goes through AutoScore-Survival Module 2,3 and
4 multiple times and to evaluate the performance under different variable
list. The generated parsimony plot would give researcher an intuitive
figure to choose the best models. If data size is small (eg, <5000), an
independent validation set may not be a wise choice. Then, we suggest using
cross-validation to maximize the utility of data. Set
cross_validation=TRUE
.
List of iAUC (ie, the integrated AUC by integral under a time-dependent AUC curve for different number of variables
Xie F, Ning Y, Yuan H, et al. AutoScore-Survival: Developing interpretable machine learning-based time-to-event scores with right-censored survival data. J Biomed Inform. 2022;125:103959. doi:10.1016/j.jbi.2021.103959
AutoScore_rank_Survival
,
AutoScore_weighting_Survival
,
AutoScore_fine_tuning_Survival
,
AutoScore_testing_Survival
.
## Not run: # see AutoScore-Survival Guidebook for the whole 5-step workflow data("sample_data_survival") out_split <- split_data(data = sample_data_survival, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank_Survival(train_set, ntree=10) iAUC <- AutoScore_parsimony_Survival( train_set = train_set, validation_set = validation_set, rank = ranking, max_score = 100, n_min = 1, n_max = 20, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1) ) ## End(Not run)
## Not run: # see AutoScore-Survival Guidebook for the whole 5-step workflow data("sample_data_survival") out_split <- split_data(data = sample_data_survival, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank_Survival(train_set, ntree=10) iAUC <- AutoScore_parsimony_Survival( train_set = train_set, validation_set = validation_set, rank = ranking, max_score = 100, n_min = 1, n_max = 20, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1) ) ## End(Not run)
AutoScore STEP(i): Rank variables with machine learning (AutoScore Module 1)
AutoScore_rank(train_set, validation_set = NULL, method = "rf", ntree = 100)
AutoScore_rank(train_set, validation_set = NULL, method = "rf", ntree = 100)
train_set |
A processed |
validation_set |
A processed |
method |
method for ranking. Options: 1. 'rf' - random forest (default), 2. 'auc' - auc-based (required validation set). For "auc", univariate models will be built based on the train set, and the variable ranking is constructed via the AUC performance of corresponding univariate models on the validation set ('validation_set'). |
ntree |
Number of trees in the random forest (Default: 100). |
The first step in the AutoScore framework is variable ranking. We use random forest (RF), an ensemble machine learning algorithm, to identify the top-ranking predictors for subsequent score generation. This step correspond to Module 1 in the AutoScore paper.
Returns a vector containing the list of variables and its ranking generated by machine learning (random forest)
Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32
Xie F, Chakraborty B, Ong MEH, Goldstein BA, Liu N. AutoScore: A Machine Learning-Based Automatic Clinical Score Generator and Its Application to Mortality Prediction Using Electronic Health Records. JMIR Medical Informatics 2020;8(10):e21798
AutoScore_parsimony
, AutoScore_weighting
, AutoScore_fine_tuning
, AutoScore_testing
, Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
# see AutoScore Guidebook for the whole 5-step workflow data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" ranking <- AutoScore_rank(sample_data, ntree = 50)
# see AutoScore Guidebook for the whole 5-step workflow data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" ranking <- AutoScore_rank(sample_data, ntree = 50)
AutoScore STEP (i) for ordinal outcomes: Generate variable ranking list by machine learning (AutoScore Module 1)
AutoScore_rank_Ordinal(train_set, ntree = 100)
AutoScore_rank_Ordinal(train_set, ntree = 100)
train_set |
A processed |
ntree |
Number of trees in the random forest (Default: 100). |
The first step in the AutoScore framework is variable ranking. We use random forest (RF) for multiclass classification to identify the top-ranking predictors for subsequent score generation. This step corresponds to Module 1 in the AutoScore-Ordinal paper.
Returns a vector containing the list of variables and its ranking generated by machine learning (random forest)
Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32
Saffari SE, Ning Y, Feng X, Chakraborty B, Volovici V, Vaughan R, Ong ME, Liu N, AutoScore-Ordinal: An interpretable machine learning framework for generating scoring models for ordinal outcomes, arXiv:2202.08407
AutoScore_parsimony_Ordinal
,
AutoScore_weighting_Ordinal
,
AutoScore_fine_tuning_Ordinal
,
AutoScore_testing_Ordinal
.
## Not run: # see AutoScore-Ordinal Guidebook for the whole 5-step workflow data("sample_data_ordinal") # Output is named `label` ranking <- AutoScore_rank_ordinal(sample_data_ordinal, ntree = 50) ## End(Not run)
## Not run: # see AutoScore-Ordinal Guidebook for the whole 5-step workflow data("sample_data_ordinal") # Output is named `label` ranking <- AutoScore_rank_ordinal(sample_data_ordinal, ntree = 50) ## End(Not run)
AutoScore STEP (1) for survival outcomes: Generate variable ranking List by machine learning (Random Survival Forest) (AutoScore Module 1)
AutoScore_rank_Survival(train_set, ntree = 50)
AutoScore_rank_Survival(train_set, ntree = 50)
train_set |
A processed |
ntree |
Number of trees in the random forest (Default: 100). |
The first step in the AutoScore framework is variable ranking. We use Random Survival Forest (RSF) for survival outcome to identify the top-ranking predictors for subsequent score generation. This step correspond to Module 1 in the AutoScore-Survival paper.
Returns a vector containing the list of variables and its ranking generated by machine learning (random forest)
Ishwaran, H., Kogalur, U. B., Blackstone, E. H., & Lauer, M. S. (2008). Random survival forests. The annals of applied statistics, 2(3), 841-860.
Xie F, Ning Y, Yuan H, et al. AutoScore-Survival: Developing interpretable machine learning-based time-to-event scores with right-censored survival data. J Biomed Inform. 2022;125:103959. doi:10.1016/j.jbi.2021.103959
AutoScore_parsimony_Survival
,
AutoScore_weighting_Survival
,
AutoScore_fine_tuning_Survival
,
AutoScore_testing_Survival
.
## Not run: # see AutoScore-Survival Guidebook for the whole 5-step workflow data("sample_data_survival") # Output is named `label_time` and `label_status` ranking <- AutoScore_rank_Survival(sample_data_survival, ntree = 50) ## End(Not run)
## Not run: # see AutoScore-Survival Guidebook for the whole 5-step workflow data("sample_data_survival") # Output is named `label_time` and `label_status` ranking <- AutoScore_rank_Survival(sample_data_survival, ntree = 50) ## End(Not run)
AutoScore STEP(v): Evaluate the final score with ROC analysis (AutoScore Module 6)
AutoScore_testing( test_set, final_variables, cut_vec, scoring_table, threshold = "best", with_label = TRUE, metrics_ci = TRUE )
AutoScore_testing( test_set, final_variables, cut_vec, scoring_table, threshold = "best", with_label = TRUE, metrics_ci = TRUE )
test_set |
A processed |
final_variables |
A vector containing the list of selected variables, selected from Step(ii) |
cut_vec |
Generated from STEP(iii) |
scoring_table |
The final scoring table after fine-tuning, generated from STEP(iv) |
threshold |
Score threshold for the ROC analysis to generate sensitivity, specificity, etc. If set to "best", the optimal threshold will be calculated (Default:"best"). |
with_label |
Set to TRUE if there are labels in the test_set and performance will be evaluated accordingly (Default:TRUE). Set it to "FALSE" if there are not "label" in the "test_set" and the final predicted scores will be the output without performance evaluation. |
metrics_ci |
whether to calculate confidence interval for the metrics of sensitivity, specificity, etc. |
A data frame with predicted score and the outcome for downstream visualization.
Xie F, Chakraborty B, Ong MEH, Goldstein BA, Liu N. AutoScore: A Machine Learning-Based Automatic Clinical Score Generator and Its Application to Mortality Prediction Using Electronic Health Records. JMIR Medical Informatics 2020;8(10):e21798
AutoScore_rank
, AutoScore_parsimony
, AutoScore_weighting
, AutoScore_fine_tuning
, print_roc_performance
, Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
## Please see the guidebook or vignettes
## Please see the guidebook or vignettes
AutoScore STEP(v) for ordinal outcomes: Evaluate the final score (AutoScore Module 6)
AutoScore_testing_Ordinal( test_set, final_variables, link = "logit", cut_vec, scoring_table, with_label = TRUE, n_boot = 100 )
AutoScore_testing_Ordinal( test_set, final_variables, link = "logit", cut_vec, scoring_table, with_label = TRUE, n_boot = 100 )
test_set |
A processed data.frame that contains data for testing purpose. This data.frame should have same format as train_set (same variable names and outcomes) |
final_variables |
A vector containing the list of selected variables,
selected from Step(ii) |
link |
The link function used to model ordinal outcomes. Default is
|
cut_vec |
Generated from STEP(iii) |
scoring_table |
The final scoring table after fine-tuning, generated
from STEP(iv) |
with_label |
Set to TRUE if there are labels in the test_set and performance will be evaluated accordingly (Default:TRUE). |
n_boot |
Number of bootstrap cycles to compute 95% CI for performance metrics. |
A data frame with predicted score and the outcome for downstream visualization.
Saffari SE, Ning Y, Feng X, Chakraborty B, Volovici V, Vaughan R, Ong ME, Liu N, AutoScore-Ordinal: An interpretable machine learning framework for generating scoring models for ordinal outcomes, arXiv:2202.08407
AutoScore_rank_Ordinal
,
AutoScore_parsimony_Ordinal
,
AutoScore_weighting_Ordinal
,
AutoScore_fine_tuning_Ordinal
.
## Please see the guidebook or vignettes
## Please see the guidebook or vignettes
AutoScore STEP(v) for survival outcomes: Evaluate the final score with ROC analysis (AutoScore Module 6)
AutoScore_testing_Survival( test_set, final_variables, cut_vec, scoring_table, threshold = "best", with_label = TRUE, time_point = c(1, 3, 7, 14, 30, 60, 90) )
AutoScore_testing_Survival( test_set, final_variables, cut_vec, scoring_table, threshold = "best", with_label = TRUE, time_point = c(1, 3, 7, 14, 30, 60, 90) )
test_set |
A processed |
final_variables |
A vector containing the list of selected variables, selected from Step(ii) |
cut_vec |
Generated from STEP(iii)
|
scoring_table |
The final scoring table after fine-tuning, generated from STEP(iv) |
threshold |
Score threshold for the ROC analysis to generate sensitivity, specificity, etc. If set to "best", the optimal threshold will be calculated (Default:"best"). |
with_label |
Set to TRUE if there are labels('label_time' and 'label_status') in the test_set and performance will be evaluated accordingly (Default:TRUE). |
time_point |
The time points to be evaluated using time-dependent AUC(t). |
A data frame with predicted score and the outcome for downstream visualization.
Xie F, Ning Y, Yuan H, et al. AutoScore-Survival: Developing interpretable machine learning-based time-to-event scores with right-censored survival data. J Biomed Inform. 2022;125:103959. doi:10.1016/j.jbi.2021.103959
AutoScore_rank_Survival
,
AutoScore_parsimony_Survival
,
AutoScore_weighting_Survival
,
AutoScore_fine_tuning_Survival
.
## Please see the guidebook or vignettes
## Please see the guidebook or vignettes
AutoScore STEP(iii): Generate the initial score with the final list of variables (Re-run AutoScore Modules 2+3)
AutoScore_weighting( train_set, validation_set, final_variables, max_score = 100, categorize = "quantile", max_cluster = 5, quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), metrics_ci = FALSE )
AutoScore_weighting( train_set, validation_set, final_variables, max_score = 100, categorize = "quantile", max_cluster = 5, quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), metrics_ci = FALSE )
train_set |
A processed |
validation_set |
A processed |
final_variables |
A vector containing the list of selected variables, selected from Step(ii) |
max_score |
Maximum total score (Default: 100). |
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" (Default: "quantile"). |
max_cluster |
The max number of cluster (Default: 5). Available if |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. (Default: c(0, 0.05, 0.2, 0.8, 0.95, 1)) Available if |
metrics_ci |
whether to calculate confidence interval for the metrics of sensitivity, specificity, etc. |
Generated cut_vec
for downstream fine-tuning process STEP(iv) AutoScore_fine_tuning
.
Xie F, Chakraborty B, Ong MEH, Goldstein BA, Liu N. AutoScore: A Machine Learning-Based Automatic Clinical Score Generator and Its Application to Mortality Prediction Using Electronic Health Records. JMIR Medical Informatics 2020;8(10):e21798
AutoScore_rank
, AutoScore_parsimony
, AutoScore_fine_tuning
, AutoScore_testing
, Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
AutoScore STEP(iii) for ordinal outcomes: Generate the initial score with the final list of variables (Re-run AutoScore Modules 2+3)
AutoScore_weighting_Ordinal( train_set, validation_set, final_variables, link = "logit", max_score = 100, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, n_boot = 100 )
AutoScore_weighting_Ordinal( train_set, validation_set, final_variables, link = "logit", max_score = 100, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, n_boot = 100 )
train_set |
A processed |
validation_set |
A processed |
final_variables |
A vector containing the list of selected variables,
selected from Step(ii) |
link |
The link function used to model ordinal outcomes. Default is
|
max_score |
Maximum total score (Default: 100). |
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" (Default: "quantile"). |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. (Default: c(0, 0.05, 0.2, 0.8, 0.95, 1)) Available if |
max_cluster |
The max number of cluster (Default: 5). Available if |
n_boot |
Number of bootstrap cycles to compute 95% CI for performance metrics. |
Generated cut_vec
for downstream fine-tuning process STEP(iv)
AutoScore_fine_tuning_Ordinal
.
Saffari SE, Ning Y, Feng X, Chakraborty B, Volovici V, Vaughan R, Ong ME, Liu N, AutoScore-Ordinal: An interpretable machine learning framework for generating scoring models for ordinal outcomes, arXiv:2202.08407
AutoScore_rank_Ordinal
,
AutoScore_parsimony_Ordinal
,
AutoScore_fine_tuning_Ordinal
,
AutoScore_testing_Ordinal
.
## Not run: data("sample_data_ordinal") # Output is named `label` out_split <- split_data(data = sample_data_ordinal, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank_Ordinal(train_set, ntree=100) num_var <- 6 final_variables <- names(ranking[1:num_var]) cut_vec <- AutoScore_weighting_Ordinal( train_set = train_set, validation_set = validation_set, final_variables = final_variables, max_score = 100, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1) ) ## End(Not run)
## Not run: data("sample_data_ordinal") # Output is named `label` out_split <- split_data(data = sample_data_ordinal, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank_Ordinal(train_set, ntree=100) num_var <- 6 final_variables <- names(ranking[1:num_var]) cut_vec <- AutoScore_weighting_Ordinal( train_set = train_set, validation_set = validation_set, final_variables = final_variables, max_score = 100, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1) ) ## End(Not run)
AutoScore STEP(iii) for survival outcomes: Generate the initial score with the final list of variables (Re-run AutoScore Modules 2+3)
AutoScore_weighting_Survival( train_set, validation_set, final_variables, max_score = 100, categorize = "quantile", max_cluster = 5, quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), time_point = c(1, 3, 7, 14, 30, 60, 90) )
AutoScore_weighting_Survival( train_set, validation_set, final_variables, max_score = 100, categorize = "quantile", max_cluster = 5, quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), time_point = c(1, 3, 7, 14, 30, 60, 90) )
train_set |
A processed |
validation_set |
A processed |
final_variables |
A vector containing the list of selected variables, selected from Step(ii) |
max_score |
Maximum total score (Default: 100). |
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" (Default: "quantile"). |
max_cluster |
The max number of cluster (Default: 5). Available if |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. (Default: c(0, 0.05, 0.2, 0.8, 0.95, 1)) Available if |
time_point |
The time points to be evaluated using time-dependent AUC(t). |
Generated cut_vec
for downstream fine-tuning process STEP(iv) AutoScore_fine_tuning
.
Xie F, Ning Y, Yuan H, et al. AutoScore-Survival: Developing interpretable machine learning-based time-to-event scores with right-censored survival data. J Biomed Inform. 2022;125:103959. doi:10.1016/j.jbi.2021.103959
AutoScore_rank_Survival
,
AutoScore_parsimony_Survival
,
AutoScore_fine_tuning_Survival
,
AutoScore_testing_Survival
.
## Not run: data("sample_data_survival") # out_split <- split_data(data = sample_data_survival, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank_Survival(train_set, ntree=5) num_var <- 6 final_variables <- names(ranking[1:num_var]) cut_vec <- AutoScore_weighting_Survival( train_set = train_set, validation_set = validation_set, final_variables = final_variables, max_score = 100, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), time_point = c(1,3,7,14,30,60,90) ) ## End(Not run)
## Not run: data("sample_data_survival") # out_split <- split_data(data = sample_data_survival, ratio = c(0.7, 0.1, 0.2)) train_set <- out_split$train_set validation_set <- out_split$validation_set ranking <- AutoScore_rank_Survival(train_set, ntree=5) num_var <- 6 final_variables <- names(ranking[1:num_var]) cut_vec <- AutoScore_weighting_Survival( train_set = train_set, validation_set = validation_set, final_variables = final_variables, max_score = 100, categorize = "quantile", quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), time_point = c(1,3,7,14,30,60,90) ) ## End(Not run)
Internal Function: Change Reference category after first-step logistic regression (part of AutoScore Module 3)
change_reference(df, coef_vec)
change_reference(df, coef_vec)
df |
A |
coef_vec |
Generated from logistic regression |
Processed data.frame
after changing reference category
AutoScore function for datasets with binary outcomes: Check whether the input dataset fulfill the requirement of the AutoScore
check_data(data)
check_data(data)
data |
The data to be checked |
No return value, the result of the checking will be printed out.
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" check_data(sample_data)
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" check_data(sample_data)
AutoScore function for ordinal outcomes: Check whether the input dataset fulfil the requirement of the AutoScore
check_data_ordinal(data)
check_data_ordinal(data)
data |
The data to be checked |
No return value, the result of the checking will be printed out.
data("sample_data_ordinal") check_data_ordinal(sample_data_ordinal)
data("sample_data_ordinal") check_data_ordinal(sample_data_ordinal)
AutoScore function for survival data: Check whether the input dataset fulfill the requirement of the AutoScore
check_data_survival(data)
check_data_survival(data)
data |
The data to be checked |
No return value, the result of the checking will be printed out.
data("sample_data_survival") check_data_survival(sample_data_survival)
data("sample_data_survival") check_data_survival(sample_data_survival)
Internal function: Check link function
check_link(link)
check_link(link)
link |
The link function used to model ordinal outcomes. Default is
|
Internal function: Check predictors
check_predictor(data_predictor)
check_predictor(data_predictor)
data_predictor |
Predictors to be checked |
No return value, the result of the checking will be printed out.
Compute AUC based on validation set for plotting parsimony
compute_auc_val( train_set_1, validation_set_1, variable_list, categorize, quantiles, max_cluster, max_score )
compute_auc_val( train_set_1, validation_set_1, variable_list, categorize, quantiles, max_cluster, max_score )
train_set_1 |
Processed training set |
validation_set_1 |
Processed validation set |
variable_list |
List of included variables |
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. Available if |
max_cluster |
The max number of cluster (Default: 5). Available if |
max_score |
Maximum total score |
A List of AUC for parsimony plot
Compute mean AUC based on validation set for plotting parsimony
compute_auc_val_ord( train_set_1, validation_set_1, variable_list, link, categorize, quantiles, max_cluster, max_score )
compute_auc_val_ord( train_set_1, validation_set_1, variable_list, link, categorize, quantiles, max_cluster, max_score )
train_set_1 |
Processed training set |
validation_set_1 |
Processed validation set |
variable_list |
List of included variables |
link |
The link function used to model ordinal outcomes. Default is
|
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. Available if |
max_cluster |
The max number of cluster (Default: 5). Available if |
max_score |
Maximum total score |
A list of mAUC for parsimony plot
Compute AUC based on validation set for plotting parsimony (survival outcomes)
compute_auc_val_survival( train_set_1, validation_set_1, variable_list, categorize, quantiles, max_cluster, max_score )
compute_auc_val_survival( train_set_1, validation_set_1, variable_list, categorize, quantiles, max_cluster, max_score )
train_set_1 |
Processed training set |
validation_set_1 |
Processed validation set |
variable_list |
List of included variables |
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. Available if |
max_cluster |
The max number of cluster (Default: 5). Available if |
max_score |
Maximum total score |
A List of AUC for parsimony plot
Compute descriptive table (usually Table 1 in the medical literature) for the dataset.
compute_descriptive_table(df, ...)
compute_descriptive_table(df, ...)
df |
data frame after checking and fulfilling the requirement of AutoScore |
... |
additional parameters to pass to
|
No return value and the result of the descriptive analysis will be printed out.
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" compute_descriptive_table(sample_data) # Report median and IQR (instead of default mean and SD) for Age, and add a # caption to printed table: compute_descriptive_table(sample_data, nonnormal = "Age", caption = "Table 1. Patient characteristics")
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" compute_descriptive_table(sample_data) # Report median and IQR (instead of default mean and SD) for Age, and add a # caption to printed table: compute_descriptive_table(sample_data, nonnormal = "Age", caption = "Table 1. Patient characteristics")
Internal function: Compute risk scores for ordinal data given variables selected, cut-off values and scoring table
compute_final_score_ord(data, final_variables, cut_vec, scoring_table)
compute_final_score_ord(data, final_variables, cut_vec, scoring_table)
data |
A processed |
final_variables |
A vector containing the list of selected variables,
selected from Step(ii) |
cut_vec |
Generated from STEP(iii) |
scoring_table |
The final scoring table after fine-tuning, generated
from STEP(iv) |
Internal function: Compute mAUC for ordinal predictions
compute_mauc_ord(y, fx)
compute_mauc_ord(y, fx)
y |
An ordered factor representing the ordinal outcome, with length n and J categories. |
fx |
Either (i) a numeric vector of predictor (e.g., predicted scores) of length n or (ii) a numeric matrix of predicted cumulative probabilities with n rows and (J-1) columns. |
The mean AUC of J-1 cumulative AUCs (i.e., when evaluating the prediction of Y<=j, j=1,...,J-1).
Generate tables for multivariate analysis
compute_multi_variable_table(df)
compute_multi_variable_table(df)
df |
data frame after checking |
result of the multivariate analysis
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" multi_table<-compute_multi_variable_table(sample_data)
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" multi_table<-compute_multi_variable_table(sample_data)
Generate tables for multivariate analysis
compute_multi_variable_table_ordinal(df, link = "logit", n_digits = 3)
compute_multi_variable_table_ordinal(df, link = "logit", n_digits = 3)
df |
data frame after checking |
link |
The link function used to model ordinal outcomes. Default is
|
n_digits |
Number of digits to print for OR or exponentiated coefficients (Default:3). |
result of the multivariate analysis
data("sample_data_ordinal") # Using just a few variables to demonstrate usage: multi_table<-compute_multi_variable_table_ordinal(sample_data_ordinal[, 1:3])
data("sample_data_ordinal") # Using just a few variables to demonstrate usage: multi_table<-compute_multi_variable_table_ordinal(sample_data_ordinal[, 1:3])
Generate tables for multivariate analysis for survival outcomes
compute_multi_variable_table_survival(df)
compute_multi_variable_table_survival(df)
df |
data frame after checking |
result of the multivariate analysis for survival outcomes
data("sample_data_survival") multi_table<-compute_multi_variable_table_survival(sample_data_survival)
data("sample_data_survival") multi_table<-compute_multi_variable_table_survival(sample_data_survival)
Internal function: Based on given labels and scores, compute proportion of subjects observed in each outcome category in given score intervals.
compute_prob_observed( pred_score, link = "logit", max_score = 100, score_breaks = seq(from = 5, to = 70, by = 5) )
compute_prob_observed( pred_score, link = "logit", max_score = 100, score_breaks = seq(from = 5, to = 70, by = 5) )
pred_score |
A |
link |
The link function used to model ordinal outcomes. Default is
|
max_score |
Maximum attainable value of final scores. |
score_breaks |
A vector of score breaks to group scores. The average
predicted risk will be reported for each score interval in the lookup
table. Users are advised to first visualise the predicted risk for all
attainable scores to determine |
Internal function: Based on given labels and scores, compute average predicted risks in given score intervals.
compute_prob_predicted( pred_score, link = "logit", max_score = 100, score_breaks = seq(from = 5, to = 70, by = 5) )
compute_prob_predicted( pred_score, link = "logit", max_score = 100, score_breaks = seq(from = 5, to = 70, by = 5) )
pred_score |
A |
link |
The link function used to model ordinal outcomes. Default is
|
max_score |
Maximum attainable value of final scores. |
score_breaks |
A vector of score breaks to group scores. The average
predicted risk will be reported for each score interval in the lookup
table. Users are advised to first visualise the predicted risk for all
attainable scores to determine |
Compute scoring table based on training dataset
compute_score_table(train_set_2, max_score, variable_list)
compute_score_table(train_set_2, max_score, variable_list)
train_set_2 |
Processed training set after variable transformation (AutoScore Module 2) |
max_score |
Maximum total score |
variable_list |
List of included variables |
A scoring table
Compute scoring table based on training dataset
compute_score_table_ord(train_set_2, max_score, variable_list, link)
compute_score_table_ord(train_set_2, max_score, variable_list, link)
train_set_2 |
Processed training set after variable transformation |
max_score |
Maximum total score |
variable_list |
List of included variables |
link |
The link function used to model ordinal outcomes. Default is
|
A scoring table
Compute scoring table for survival outcomes based on training dataset
compute_score_table_survival(train_set_2, max_score, variable_list)
compute_score_table_survival(train_set_2, max_score, variable_list)
train_set_2 |
Processed training set after variable transformation (AutoScore Module 2) |
max_score |
Maximum total score |
variable_list |
List of included variables |
A scoring table
Perform univariable analysis and generate the result table with odd ratios.
compute_uni_variable_table(df)
compute_uni_variable_table(df)
df |
data frame after checking |
result of univariate analysis
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" uni_table<-compute_uni_variable_table(sample_data)
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" uni_table<-compute_uni_variable_table(sample_data)
Perform univariable analysis and generate the result table with odd ratios from proportional odds models.
compute_uni_variable_table_ordinal(df, link = "logit", n_digits = 3)
compute_uni_variable_table_ordinal(df, link = "logit", n_digits = 3)
df |
data frame after checking |
link |
The link function used to model ordinal outcomes. Default is
|
n_digits |
Number of digits to print for OR or exponentiated coefficients (Default:3). |
result of univariate analysis
data("sample_data_ordinal") # Using just a few variables to demonstrate usage: uni_table<-compute_uni_variable_table_ordinal(sample_data_ordinal[, 1:3])
data("sample_data_ordinal") # Using just a few variables to demonstrate usage: uni_table<-compute_uni_variable_table_ordinal(sample_data_ordinal[, 1:3])
Generate tables for Univariate analysis for survival outcomes
compute_uni_variable_table_survival(df)
compute_uni_variable_table_survival(df)
df |
data frame after checking |
result of the Univariate analysis for survival outcomes
data("sample_data_survival") uni_table<-compute_uni_variable_table_survival(sample_data_survival)
data("sample_data_survival") uni_table<-compute_uni_variable_table_survival(sample_data_survival)
Print conversion table based on final performance evaluation
conversion_table( pred_score, by = "risk", values = c(0.01, 0.05, 0.1, 0.2, 0.5) )
conversion_table( pred_score, by = "risk", values = c(0.01, 0.05, 0.1, 0.2, 0.5) )
pred_score |
a vector with outcomes and final scores generated from |
by |
specify correct method for categorizing the threshold: by "risk" or "score".Default to "risk" |
values |
A vector of threshold for analyze sensitivity, specificity and other metrics. Default to "c(0.01,0.05,0.1,0.2,0.5)" |
No return value and the conversion will be printed out directly.
AutoScore function: Print conversion table for ordinal outcomes to map score to risk
conversion_table_ordinal( pred_score, link = "logit", max_score = 100, score_breaks = seq(from = 5, to = 70, by = 5), ... )
conversion_table_ordinal( pred_score, link = "logit", max_score = 100, score_breaks = seq(from = 5, to = 70, by = 5), ... )
pred_score |
A |
link |
The link function used to model ordinal outcomes. Default is
|
max_score |
Maximum attainable value of final scores. |
score_breaks |
A vector of score breaks to group scores. The average
predicted risk will be reported for each score interval in the lookup
table. Users are advised to first visualise the predicted risk for all
attainable scores to determine |
... |
Additional parameters to pass to |
No return value and the conversion will be printed out directly.
Print conversion table for survival outcomes
conversion_table_survival( pred_score, score_cut = c(40, 50, 60), time_point = c(7, 14, 30, 60, 90) )
conversion_table_survival( pred_score, score_cut = c(40, 50, 60), time_point = c(7, 14, 30, 60, 90) )
pred_score |
a data frame with outcomes and final scores generated from |
score_cut |
Score cut-offs to be used for generating conversion table |
time_point |
The time points to be evaluated using time-dependent AUC(t). |
conversion table and the it will also be printed out directly.
Internal function: generate probability matrix for ordinal outcomes given thresholds, linear predictor and link function
estimate_p_mat(theta, z, link)
estimate_p_mat(theta, z, link)
theta |
numeric vector of thresholds |
z |
numeric vector of linear predictor |
link |
The link function used to model ordinal outcomes. Default is
|
Internal function survival outcome: Calculate iAUC for validation set
eva_performance_iauc(score, validation_set, print = TRUE)
eva_performance_iauc(score, validation_set, print = TRUE)
score |
Predicted score |
validation_set |
Dataset for generating performance |
print |
Whether to print out the final iAUC result |
Internal function: Evaluate model performance on ordinal data
evaluate_model_ord(label, score, n_boot, report_cindex = TRUE)
evaluate_model_ord(label, score, n_boot, report_cindex = TRUE)
label |
outcome variable |
score |
predicted score |
n_boot |
Number of bootstrap cycles to compute 95% CI for performance metrics. |
report_cindex |
If generalized c-index should be reported alongside mAUC (Default:FALSE). |
Returns a list of the mAUC (mauc) and generalized c-index (cindex, if requested for) and their 95
Extract OR, CI and p-value from a proportional odds model
extract_or_ci_ord(model, n_digits = 3)
extract_or_ci_ord(model, n_digits = 3)
model |
An ordinal regression model fitted using |
n_digits |
Number of digits to print for OR or exponentiated coefficients (Default:3). |
Internal function: Find column indices in design matrix that should be 1
find_one_inds(x_inds)
find_one_inds(x_inds)
x_inds |
A list of column indices corresponding to each final variable. |
Internal function: Compute all scores attainable.
find_possible_scores(final_variables, scoring_table)
find_possible_scores(final_variables, scoring_table)
final_variables |
A vector containing the list of selected variables. |
scoring_table |
The final scoring table after fine-tuning. |
Returns a numeric vector of all scores attainable.
Internal function: Calculate cut_vec from the training set (AutoScore Module 2)
get_cut_vec( df, quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, categorize = "quantile" )
get_cut_vec( df, quantiles = c(0, 0.05, 0.2, 0.8, 0.95, 1), max_cluster = 5, categorize = "quantile" )
df |
training set to be used for calculate the cut vector |
quantiles |
Predefined quantiles to convert continuous variables to categorical ones. (Default: c(0, 0.05, 0.2, 0.8, 0.95, 1)) Available if |
max_cluster |
The max number of cluster (Default: 5). Available if |
categorize |
Methods for categorize continuous variables. Options include "quantile" or "kmeans" (Default: "quantile"). |
cut_vec for transform_df_fixed
Internal function: Group scores based on given score breaks, and use friendly names for first and last intervals.
group_score(score, max_score, score_breaks)
group_score(score, max_score, score_breaks)
score |
numeric vector of scores. |
max_score |
Maximum attainable value of final scores. |
score_breaks |
A vector of score breaks to group scores. The average
predicted risk will be reported for each score interval in the lookup
table. Users are advised to first visualise the predicted risk for all
attainable scores to determine |
Internal function: induce informative missing to sample data in the package to demonstrate how AutoScore handles missing as a separate category
induce_informative_missing( df, vars_to_induce = c("Lab_A", "Vital_A"), prop_missing = 0.4 )
induce_informative_missing( df, vars_to_induce = c("Lab_A", "Vital_A"), prop_missing = 0.4 )
df |
A data.frame of sample data. |
vars_to_induce |
Names of variables to induce informative missing in. Default is c("Lab_A", "Vital_A"). |
prop_missing |
Proportion of missing to induce for each
|
Assume subjects with normal values (i.e., values close to the median) are more likely to not have measurements.
Returns df
with selected columns modified to have missing.
Internal function: induce informative missing in a single variable
induce_median_missing(x, prop_missing)
induce_median_missing(x, prop_missing)
x |
Variable to induce missing in. |
prop_missing |
Proportion of missing to induce for each
|
Internal function: Inverse cloglog link
inv_cloglog(x)
inv_cloglog(x)
x |
A numeric vector. |
Internal function: Inverse logit link
inv_logit(x)
inv_logit(x)
x |
A numeric vector. |
Internal function: Inverse probit link
inv_probit(x)
inv_probit(x)
x |
A numeric vector. |
find_one_inds
, make a design matrix to
compute all scores attainable.Internal function: Based on find_one_inds
, make a design matrix to
compute all scores attainable.
make_design_mat(one_inds)
make_design_mat(one_inds)
one_inds |
Output from |
Internal function: Make parsimony plot
plot_auc( AUC, variables, num = seq_along(variables), auc_lim_min, auc_lim_max, ylab = "Mean Area Under the Curve", title = "Parsimony plot on the validation set" )
plot_auc( AUC, variables, num = seq_along(variables), auc_lim_min, auc_lim_max, ylab = "Mean Area Under the Curve", title = "Parsimony plot on the validation set" )
AUC |
A vector of AUC values (or mAUC for ordinal outcomes). |
variables |
A vector of variable names |
num |
A vector of indices for AUC values to plot. Default is to plot all. |
auc_lim_min |
Min y_axis limit in the parsimony plot (Default: 0.5). |
auc_lim_max |
Max y_axis limit in the parsimony plot (Default: "adaptive"). |
ylab |
Title of y-axis |
title |
Plot title |
Internal Function: Print plotted variable importance
plot_importance(ranking)
plot_importance(ranking)
ranking |
vector output generated by functions: AutoScore_rank, AutoScore_rank_Survival or AutoScore_rank_Ordinal |
AutoScore_rank
, AutoScore_rank_Survival
, AutoScore_rank_Ordinal
AutoScore function for binary and ordinal outcomes: Plot predicted risk
plot_predicted_risk( pred_score, link = "logit", max_score = 100, final_variables, scoring_table, point_size = 0.5 )
plot_predicted_risk( pred_score, link = "logit", max_score = 100, final_variables, scoring_table, point_size = 0.5 )
pred_score |
Output from |
link |
(For ordinal outcome only) The link function used in ordinal
regression, which must be the same as the value used to build the risk
score. Default is |
max_score |
Maximum total score (Default: 100). |
final_variables |
A vector containing the list of selected variables,
selected from Step(ii) |
scoring_table |
The final scoring table after fine-tuning, generated
from STEP(iv) |
point_size |
Size of points in the plot. Default is 0.5. |
Internal Function: Plotting ROC curve
plot_roc_curve(prob, labels, quiet = TRUE)
plot_roc_curve(prob, labels, quiet = TRUE)
prob |
Predicate probability |
labels |
Actual outcome(binary) |
quiet |
if set to TRUE, there will be no trace printing |
No return value and the ROC curve will be plotted.
Print scoring performance (KM curve) for survival outcome
plot_survival_km( pred_score, score_cut = c(40, 50, 60), risk.table = TRUE, title = NULL, legend.title = "Score", xlim = c(0, 90), break.x.by = 30, ... )
plot_survival_km( pred_score, score_cut = c(40, 50, 60), risk.table = TRUE, title = NULL, legend.title = "Score", xlim = c(0, 90), break.x.by = 30, ... )
pred_score |
Generated from STEP(v) |
score_cut |
Score cut-offs to be used for the analysis |
risk.table |
Allowed values include: TRUE or FALSE specifying whether to show or not the risk table. Default is TRUE. |
title |
Title displayed in the KM curve |
legend.title |
Legend title displayed in the KM curve |
xlim |
limit for x |
break.x.by |
Threshold for analyze sensitivity, |
... |
additional parameters to pass to
|
No return value and the KM performance will be plotted.
Print iAUC, c-index and time-dependent AUC as the predictive performance
print_performance_ci_survival(score, validation_set, time_point, n_boot = 100)
print_performance_ci_survival(score, validation_set, time_point, n_boot = 100)
score |
Predicted score |
validation_set |
Dataset for generating performance |
time_point |
The time points to be evaluated using time-dependent AUC(t). |
n_boot |
Number of bootstrap cycles to compute 95% CI for performance metrics. |
No return value and the ROC performance will be printed out directly.
Print mean area under the curve (mAUC) and generalised c-index (if requested)
print_performance_ordinal(label, score, n_boot = 100, report_cindex = FALSE)
print_performance_ordinal(label, score, n_boot = 100, report_cindex = FALSE)
label |
outcome variable |
score |
predicted score |
n_boot |
Number of bootstrap cycles to compute 95% CI for performance metrics. |
report_cindex |
Whether to report generalized c-index for model evaluation (Default:FALSE for faster evaluation). |
No return value and the ROC performance will be printed out directly.
Print mean area under the curve (mAUC) and generalised c-index (if requested)
print_performance_survival(score, validation_set, time_point)
print_performance_survival(score, validation_set, time_point)
score |
Predicted score |
validation_set |
Dataset for generating performance |
time_point |
The time points to be evaluated using time-dependent AUC(t). |
No return value and the ROC performance will be printed out directly.
Print receiver operating characteristic (ROC) performance
print_roc_performance(label, score, threshold = "best", metrics_ci = FALSE)
print_roc_performance(label, score, threshold = "best", metrics_ci = FALSE)
label |
outcome variable |
score |
predicted score |
threshold |
Threshold for analyze sensitivity, specificity and other metrics. Default to "best" |
metrics_ci |
whether to calculate confidence interval for the metrics of sensitivity, specificity, etc. |
No return value and the ROC performance will be printed out directly.
AutoScore Function: Print scoring tables for visualization
print_scoring_table(scoring_table, final_variable)
print_scoring_table(scoring_table, final_variable)
scoring_table |
Raw scoring table generated by AutoScore step(iv) |
final_variable |
Final included variables |
Data frame of formatted scoring table
AutoScore_fine_tuning
, AutoScore_weighting
20000 simulated samples, with the same distribution as the data in the MIMIC-III ICU database. It is used for demonstration only in the Guidebook. Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016).
sample_data
sample_data
An object of class data.frame
with 20000 rows and 22 columns.
Simulated data for 20,000 inpatient visits with demographic information, healthcare resource utilisation and associated laboratory tests and vital signs measured in the emergency department (ED). Data were simulated based on the dataset analysed in the AutoScore-Ordinal paper, and only includes a subset of variables (with masked variable names) for the purpose of demonstrating the AutoScore framework for ordinal outcomes.
sample_data_ordinal
sample_data_ordinal
An object of class data.frame
with 20000 rows and 21 columns.
Saffari SE, Ning Y, Feng X, Chakraborty B, Volovici V, Vaughan R, Ong ME, Liu N, AutoScore-Ordinal: An interpretable machine learning framework for generating scoring models for ordinal outcomes, arXiv:2202.08407
5,000 observations randomly sampled from
sample_data_ordinal
. It is used for demonstration only in the
Guidebook.
sample_data_ordinal_small
sample_data_ordinal_small
An object of class data.frame
with 5000 rows and 21 columns.
1000 simulated samples, with the same distribution as the data in the MIMIC-III ICU database. It is used for demonstration only in the Guidebook. Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016).
sample_data_small
sample_data_small
An object of class data.frame
with 1000 rows and 22 columns.
20000 simulated samples, with the same distribution
as the data in the MIMIC-III ICU database. Data were simulated based on the dataset
analysed in the AutoScore-Survival paper. It is used for demonstration
only in the Guidebook. Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016).
sample_data_survival
sample_data_survival
An object of class data.frame
with 20000 rows and 23 columns.
1000 simulated samples, with the same distribution
as the data in the MIMIC-III ICU database. Data were simulated based on the dataset
analysed in the AutoScore-Survival paper. It is used for demonstration
only in the Guidebook. Run vignette("Guide_book", package = "AutoScore")
to see the guidebook or vignette.
Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016).
sample_data_survival_small
sample_data_survival_small
An object of class data.frame
with 1000 rows and 23 columns.
20000 simulated samples with missing values, which can be used for demostrating AutoScore workflow dealing with missing values.
Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016).
sample_data_with_missing
sample_data_with_missing
An object of class data.frame
with 20000 rows and 23 columns.
AutoScore Function: Automatically splitting dataset to train, validation and test set, possibly stratified by label
split_data(data, ratio, cross_validation = FALSE, strat_by_label = FALSE)
split_data(data, ratio, cross_validation = FALSE, strat_by_label = FALSE)
data |
The dataset to be split |
ratio |
The ratio for dividing dataset into training, validation and testing set. (Default: c(0.7, 0.1, 0.2)) |
cross_validation |
If set to |
strat_by_label |
If set to |
Returns a list containing training, validation and testing set
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" set.seed(4) #large sample size out_split <- split_data(data = sample_data, ratio = c(0.7, 0.1, 0.2)) #small sample size out_split <- split_data(data = sample_data, ratio = c(0.7, 0, 0.3), cross_validation = TRUE) #large sample size, stratified out_split <- split_data(data = sample_data, ratio = c(0.7, 0.1, 0.2), strat_by_label = TRUE)
data("sample_data") names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label" set.seed(4) #large sample size out_split <- split_data(data = sample_data, ratio = c(0.7, 0.1, 0.2)) #small sample size out_split <- split_data(data = sample_data, ratio = c(0.7, 0, 0.3), cross_validation = TRUE) #large sample size, stratified out_split <- split_data(data = sample_data, ratio = c(0.7, 0.1, 0.2), strat_by_label = TRUE)
Internal function: Categorizing continuous variables based on cut_vec (AutoScore Module 2)
transform_df_fixed(df, cut_vec)
transform_df_fixed(df, cut_vec)
df |
dataset(training, validation or testing) to be processed |
cut_vec |
fixed cut vector |
Processed data.frame
after categorizing based on fixed cut_vec