Whelan, R1, Jollans, L and the IMAGEN consortium2
1School of Psychology, Trinity College Dublin, Dublin, Ireland
2IMAGEN consortium (www.imagen-europe.com)
Early substance use is a strong risk factor for adult substance dependence; therefore, identifying predictors of substance use in adolescence is undeniably advantageous. Longitudinal population neuroscience studies, though logistically challenging, offer a promising approach to detecting predictors of substance misuse phenotypes as causes and effects of substance misuse can be separated to some extent. However, neuroimaging datasets include a large number of features (e.g., voxels, regions of interest; ROIs), and have relatively small sample sizes (i.e., n<<p), which can result in to overfitting and a consequent lack of generalizability. Furthermore, neuroimaging data consist of correlated features, and effect sizes are typically weak. Here, we report a machine learning method that employs both filter and embedded feature selection to address the problem of dimensionality. Nested cross-validation is used to optimize hyperparameters and model performance is quantified on out-of-sample data. The approach is validated using simulated neuroimaging data with known properties across a range of input features and sample sizes. Next, we report results from imaging data obtained as part of the IMAGEN study that aimed to create a predictive model of binge drinking. All participants (n=272) had zero-to-low drinking at age 14 years (fewer than 2 lifetime drinks). At follow up, age 16 years, 151 participants remained at baseline levels of drinking, whereas 121 participants had at least three lifetime binge drinking episodes. The predictive model incorporated structural and functional brain data, psychometric data including personality measures, and family history data. This model was moderately successful in predicting future binge drinkers (area under the curve of the receiver operating characteristic, AROC, = .75). A similar method was applied to adolescent smokers, with all participants (n=420) having smoked fewer than 10 cigarettes at age 14 years. At age 16 years, 297 participants remained at baseline levels, whereas 123 participants had smoked at least 40 cigarettes. The classification method was again moderately successful (AROC = .78) in predicting those adolescents who would transition to regular smoking. Applying machine learning methods to high-dimensional data has great potential for identifying predictors of alcohol use, and preliminary results from EEG data will also be presented.