Attention is not all you need: SSM-based classification of ADHD genotypes

Intro

Genomic language models, large neural networks trained on raw DNA sequence rather than English text, have demonstrated that the same architectures powering modern NLP can be repurposed to operate on biological sequence data. Recent benchmarks show that these models can reliably distinguish between species from sequence alone, classifying human from worm or human from other primates with high accuracy. The natural extension of this capability is to classify within a species: given that all human samples share essentially the same genome at the species level, can a model be trained to recognize the variation between individuals that correlates with a particular phenotype?

This project investigates that question in the context of ADHD severity prediction. ADHD is a heritable condition, and standard GWAS analyses have identified many loci associated with its presentation; in addition, a growing body of work suggests that mitochondrial dysfunction is one of the underlying mechanistic threads. However, the majority of this evidence comes from SNP-by-SNP statistical analysis, which treats each locus independently. A foundation model, by contrast, processes the entire input sequence at once, and so in principle is capable of capturing combinations and contextual relationships between loci that single-locus methods cannot. Whether it does so in practice is the open question that motivates this work.

The remainder of this post is a writeup of the methods, since the model is still in training and results are not yet available. We describe the choice of a state-space model architecture over a transformer, the use of the Caduceus model in particular, our procedure for reconstructing per-subject sequences from PLINK files via a hg19-to-hg38 liftover, the current scoping of the work to specific regions of the genome (starting with mtDNA), and the class-imbalance problem that does not have a clean augmentation-based analogue when the input “language” is DNA.

Background

Genomic language modeling

Large language models such as BERT and the GPT family are commonly used for natural language processing tasks, where the input is a sequence of word or subword tokens and the model is trained to predict missing or following tokens given some context. Architecturally, these models are not specific to English text: the input is treated as an opaque sequence of token IDs, and the model learns the statistical relationships between tokens during pretraining on a large corpus. Any sequential data for which a sufficiently large corpus exists can in principle be modeled in the same way, since the architecture itself makes no assumption about what the tokens represent.

Genomic sequence data is one such input. The four nucleotide bases that comprise DNA are naturally analogous to characters in a written language, and the human reference genome together with the genomes of other organisms provides a corpus on the order of billions of bases. The pretraining objective is typically masked-language modeling, in the style of BERT: a fraction of the input bases are hidden from the model, and it is trained to recover them from the surrounding context. There are now several families of genomic language models — DNA-BERT, DNA-GPT, HyenaDNA, and Caduceus among others — that apply this recipe with various architectural choices, and most of them are pretrained primarily on the human reference genome. Fine-tuning on a task-specific dataset, as we do in this project, is therefore the standard way to specialize one of these pretrained models to a particular downstream problem.

One design choice worth mentioning at this stage is tokenization. In natural language, tokens are typically words or subword units learned from the training corpus, since operating at the character level would force the model to learn word-level structure from scratch. For DNA, the analogous choice is between single-character tokenization (one token per nucleotide) and k-mer tokenization (one token per fixed-length window of bases). Triplet tokenization is a particularly tempting option, since the cellular machinery itself reads DNA in groups of three when translating to amino acids during protein synthesis. However, k-mer tokenization is fragile to insertions and deletions: adding or removing a single base shifts the reading frame, so every subsequent token in the sequence changes even though the underlying biology has only changed in one place. Single-character tokenization avoids this problem at the cost of longer input sequences, and is the choice we make later for our specific task.

The capabilities of these models are evaluated on benchmarks adapted from the same principle as their NLP counterparts. The Nucleotide Transformer benchmark suite is the most commonly cited collection, and includes a range of tasks beyond the species-classification example mentioned above: promoter prediction, where the model is given a sequence and asked whether it contains a transcription-initiation site; splice-site prediction, where it is asked to identify the boundaries between coding and non-coding regions of a gene; and variant effect prediction, where the model is asked to estimate whether a given mutation is likely to be functionally consequential. The last of these is particularly relevant to the present work, since predicting ADHD severity from sequence is in essence a variant-effect problem at the level of an entire phenotype rather than a single locus: we are asking the model to integrate the effects of many variants across a region of the genome and produce a single behavioral readout. Results across the Nucleotide Transformer tasks show that genomic language models are picking up on real biological structure rather than surface-level sequence statistics, which is what makes them a plausible starting point for the fine-tuning task described in the rest of this post.

ABCD data

The Adolescent Brain Cognitive Development study is a long-term longitudinal study, funded by the NIH and launched in 2015, that tracks brain development in roughly 12,000 subjects between the ages of 9 and 20. The scientific motivation behind the study is to characterize how the brain matures through adolescence and into early adulthood, with particular attention to how that maturation interacts with the emergence of mental health conditions, substance use, and other behavioral outcomes during a developmentally sensitive window. To support this goal, the dataset is multimodal, collecting structural and functional MRI scans, behavioral and diagnostic questionnaires, environmental and demographic information, and cheek swabs from which DNA is sampled and genotyped on the Smokescreen SNP array.

The work described in this post is part of a larger multimodal project aimed at identifying signal for mental health conditions from these various data sources, with a sister subproject focused on the fMRI side. The immediate focus here is on what can be done with genomics alone, both because the question of whether a genomic foundation model can pick up phenotype-level signal is interesting on its own terms and because the interpretability story is cleaner when only one modality is in play. Of the data points collected by ABCD, we therefore focus on two in particular: each subject’s genotype from the Smokescreen array, and their score on the KSADS questionnaire.

KSADS, short for the Kiddie Schedule for Affective Disorders and Schizophrenia, is a semi-structured diagnostic interview administered by a clinician, with separate parent-reported and child-reported components that are then combined into a final diagnostic measurement. It is specifically designed to assess clinical conditions such as depression, anxiety, and ADHD, which is the reason it is the appropriate target for our work. ABCD also collects the Child Behavior Checklist (CBCL), but the CBCL is a more general parent-reported checklist of behavioral and emotional problems and is less targeted at specific diagnoses. The 0-to-4 scale we use as the model target corresponds to the standard KSADS severity classification, ranging from no symptoms through subthreshold presentation to a confirmed diagnosis at the upper end of the scale.

A note on cohort size: 12,000 subjects is a substantial cohort by clinical-study standards, but it is small relative to the pretraining corpora used by genomic foundation models, which typically include the full human reference genome on the order of billions of bases. We therefore use all available subjects in the fine-tuning set in order to maximize the amount of phenotype-labeled data available for training; the data-augmentation and class-balance complications that arise from this choice are discussed later in this post.

Hypothesis

Mental health conditions such as ADHD are known to be heritable, with twin studies placing the heritability of ADHD at roughly 70–80% [CITATION TODO]. The genetic basis of the condition is therefore well-established at a high level, even though the specific mechanisms by which that genetic component translates into the observed phenotype remain an active area of investigation. Genome-wide association studies (GWAS) are the standard statistical tool used to probe this question: a large cohort of subjects is genotyped at millions of SNP positions across the genome, and for each position a regression is performed against the phenotype of interest, with the positions whose effect size clears a genome-wide significance threshold reported as associated loci. GWAS has produced a substantial catalog of SNPs associated with ADHD and related conditions, and that catalog forms the prior on which much of the current biological understanding of the condition rests.

However, the structure of a GWAS imposes some limitations on the kind of signal it can capture. Each locus is tested independently against the phenotype, which means that interactions between loci — where two variants are individually unremarkable but jointly predictive — are not captured by the default analysis. Such interaction effects, known as epistasis, can in principle be tested for explicitly, but the combinatorial explosion of pairwise and higher-order tests makes a brute-force search statistically and computationally impractical at genome scale. The independence assumption is also at odds with the broadly polygenic nature of conditions like ADHD, where the phenotype is shaped by the cumulative contribution of many variants of small individual effect, most of which do not clear genome-wide significance on their own. Polygenic risk scores partially address this by summing weighted effects across many SNPs, but the weights themselves still come from independent single-locus regressions and so do not capture interactions between the contributing variants.

The hypothesis driving this project is that a deep learning model trained on large contiguous regions of the genome should be able to recover the signal that GWAS captures within those regions, and ideally pick up additional signal that single-locus methods cannot. Because the model processes the entire sequence at once, it has the structural capacity to learn from combinations of loci and from the sequence context surrounding each variant, both of which are precisely the kinds of effects that the independence assumption in GWAS forces it to miss.

In addition to the question of whether such a model classifies correctly, there is a follow-up question of mechanistic interpretability: when the model makes a classification decision, what parts of the input is it actually attending to? If the regions it identifies correspond to genes or regulatory elements already known to be involved in the condition, that provides some validation of the approach; if the model attends to regions that are not currently part of the biological picture, those regions become candidates for further investigation. The specifics of how we plan to extract that information from the trained model are discussed in the Next Steps section.

Biological basis

The specific biological mechanism we are interested in is mitochondrial dysfunction. The current theory holds that reduced energy production in neurons impairs the ability of brain cells to self-regulate, and that the regions of the brain responsible for sustained attention are particularly sensitive to this effect [CITATION TODO]. Accordingly, our region of interest within the genome is focused on mtDNA, as well as on the genes and regulatory regions on the nuclear chromosomes that are responsible for mitochondrial function.

Implementation

Model selection

The size of a single human genome (on the order of 3 billion base pairs) makes the standard transformer architecture poorly suited as a starting point. The self-attention operation has O(n²) time and memory complexity in the input length, which in practice limits transformer-based models to input windows on the order of 10,000 to 100,000 tokens. That is several orders of magnitude shorter than the genome itself, and even at the level of an individual chromosome it forces the model to operate on small windows of sequence at a time.

The length constraint matters for genomic data in particular because the biologically meaningful relationships between positions can span large distances. Regulatory elements such as enhancers can act on genes that are tens or hundreds of kilobases away on the same chromosome, and a model that can only see a narrow window at a time has no way to capture those interactions directly. As a concrete reference point, the mitochondrial chromosome is roughly 16.5 kilobases in length, which is small enough to fit comfortably inside the context window of any of the architectures we considered; this is convenient given that mtDNA is one of our primary regions of interest, but the larger nuclear chromosomes are still well out of reach without a long-context architecture.

State-space models (SSMs) are an alternative architecture that can be evaluated either as a linear recurrence — each output is computed from the previous hidden state and the current input, in the style of a classical RNN — or, equivalently for the linear case, as a long convolution over the entire input sequence. The recurrent view is intuitive and matches the way the model processes data sequentially at inference time; the convolutional view is what enables efficient parallel training, since a convolution over a long sequence can be computed in subquadratic time using fast algorithms like the FFT. The net result is that SSM architectures can handle input windows orders of magnitude longer than what a transformer can practically support, while remaining tractable to train.

Two foundation models built on this lineage are publicly available and pretrained on genomic data. HyenaDNA is built on the Hyena operator, which combines long convolutions with gating to produce a transformer-like architecture without self-attention, and was the first to demonstrate that a genomic language model could operate on a context window of up to one million tokens. HyenaDNA treats DNA as a single-stranded sequence, in the sense that the architecture itself has no built-in awareness that the two strands of the double helix encode the same biological information. Caduceus, by contrast, is built on top of the Mamba selective-scan SSM and explicitly introduces reverse-complement equivariance: a guarantee that the model produces consistent outputs regardless of which strand of DNA is fed in as input. This matters because DNA is intrinsically double-stranded, and the two strands are reverse complements of each other, so a model without this property can in principle give two different answers for what is biologically the same sequence. Building the symmetry into the architecture both removes that correctness gap and reduces the amount of sample-efficiency the model has to spend learning the symmetry from data.

We selected Caduceus for this project on the strength of the reverse-complement equivariance argument, and used the published 131k-token checkpoint as the starting point for fine-tuning. The 131k context window is well above what is required to contain the mitochondrial chromosome and is large enough to capture meaningful blocks of the nuclear chromosomes as well; we are also actively experimenting with shorter and longer context windows to find a workable trade-off between sequence coverage and the computational cost of training, since the cost of running the model scales with the length of the input window.

Model setup

We follow the recommended setup from the Caduceus authors for the most part. In particular, we use single-character tokenization rather than k-mer tokenization, for the reasons described earlier in the Genomic language modeling section: a single-base insertion or deletion has an outsized effect on downstream sequence in a way that k-mer tokenization would obscure by shifting the reading frame.

The more substantive change we make is to the classification head. KSADS severity is an ordinal target rather than a categorical one — the labels 0 through 4 are ordered, with each step corresponding to an increase in severity rather than to an unrelated class — and treating it as a standard 5-way softmax classification problem would discard that structure. A model trained with categorical cross-entropy is penalized equally for predicting class 1 instead of 0 as it is for predicting class 4 instead of 0, even though the latter is a much larger error on the underlying severity scale. To preserve the ordinal structure, we replace the pretrained head of the model with a proportional-odds ordinal regression head: instead of producing logits for each of the five classes directly, the model produces logits for the four thresholds between adjacent severity levels, and the cumulative probability of being at or below each threshold is computed via the sigmoid of the corresponding logit. The training loss is then a binary cross-entropy on these cumulative probabilities against the corresponding cumulative targets. The pooled representation that feeds into this head is taken from the first position of the final hidden state.

Data preprocessing

The ABCD dataset does not distribute per-subject genome sequences directly. Instead, each subject’s genotype is provided as a PLINK file, which specifies, for each variant on the Smokescreen array, the nucleotide observed at that subject’s pair of alleles together with the chromosome and base-pair offset of the variant. The PLINK coordinates are given relative to the hg19 reference genome, while the reference sequence we load is hg38, so the first step in preparing each subject’s input is a coordinate liftover from hg19 to hg38 (implemented via pyliftover). Variants whose hg19 coordinates do not have a clean unique mapping into hg38 are dropped at this stage.

The per-subject sequence is then constructed by starting from the relevant region of the hg38 reference and substituting in the subject’s called base at each lifted-over position. The current implementation operates on a single chromosomal region at a time, with mtDNA — the roughly 16.5 kb mitochondrial chromosome — as the starting target given the mitochondrial-dysfunction hypothesis. Expanding the pipeline to operate across multiple regions of the nuclear genome in parallel is the obvious follow-on direction, and is discussed further in the Next Steps section. The Smokescreen array covers roughly half a million SNPs across the genome, which is a tiny fraction of the roughly three billion bases in the reference. The vast majority of each subject’s input sequence is therefore identical to the reference and identical across all subjects; the inter-subject variation lives in the relatively sparse set of positions that the array actually genotypes. This is an important framing for the interpretability story later in the project, since the model can only detect signal in positions where there is variation to detect.

Two practical wrinkles are worth flagging at this point. The first is that real reference genomes contain a small number of N positions, used to indicate bases whose identity is uncertain or unresolved in the assembly, and the Caduceus tokenizer handles these natively as additional vocabulary entries rather than requiring us to mask or drop them. The second is the question of how to encode heterozygous calls, where the two alleles at a position differ; the current implementation simplifies this by writing a single base per position, which trades faithfulness to the diploid biology for tokenization simplicity. A more faithful encoding would require either an explicit second sequence for the other strand or the use of IUPAC ambiguity codes (R for A/G, Y for C/T, and so on), both of which are options we may revisit.

On the phenotype side, the KSADS data is structured as one row per subject per assessment timepoint, with two timepoints (baseline and a two-year follow-up) available for most subjects. We take the maximum of the two scores as each subject’s label, on the reasoning that we are interested in the severity of the diagnosis at any point during the study rather than at a particular timepoint, and that taking the maximum is the most natural way to collapse the longitudinal axis into a single ordinal target. Subjects with missing KSADS scores at both timepoints are coded as 0, on the working assumption that a missing score on this item reflects skip-logic from a negative screener question rather than truly missing data; this is a known modeling assumption rather than a directly protocol-mandated coding, and is one of the items on the list to revisit. The set of subjects in the final training set is then the intersection of the subjects present in the PLINK genotype file and the subjects present in the KSADS file, since neither modality fully covers every subject in the broader ABCD cohort.

Once the per-subject sequence is constructed and paired with its KSADS label, both are passed through the Caduceus tokenizer to produce the input IDs that feed into the model.

Training loop

The training loop itself is a fairly standard fine-tuning setup: the data is partitioned into an 80/20 train/test split, the model is trained on the training set, and accuracy on the held-out test set is used to validate the results. Because the base Caduceus model is large relative to the size of our fine-tuning corpus, we use LoRA — low-rank adaptation — rather than fine-tuning the full set of weights. LoRA freezes the pretrained weights and injects a small set of trainable low-rank update matrices into the linear projections of each Mamba block, which substantially reduces both the memory footprint of training and the number of parameters that need to be regularized against the relatively small ABCD cohort.

Learnings

Class imbalance

The primary practical difficulty encountered so far has been class imbalance. The distribution of KSADS scores across the cohort is heavily skewed toward the lower end of the severity scale, with the concrete counts coming out to roughly 5,700 subjects at score 0, 3,500 at score 1, 1,400 at score 2, 550 at score 3, and 730 at score 4. A model trained with uniform sample weighting on this distribution would receive most of its gradient signal from the majority classes, and the easiest way to get a low average loss would be to simply predict 0 for every subject and accept the cost on the long tail.

Our current approach is to use focal loss in place of standard cross-entropy. Focal loss multiplies the per-sample cross-entropy by a (1 - p_t)^γ term, where p_t is the model’s predicted probability for the correct class, which has the effect of downweighting easy, confidently correct examples and concentrating the gradient on samples the model is currently getting wrong. With α = 0.25 and γ = 2 as the default hyperparameters, this gives the underrepresented severe-presentation classes enough leverage on the loss to actually contribute to learning rather than being washed out by the bulk of the score-0 cohort.

The more conceptually interesting question is whether we could augment the data synthetically rather than just reweighting the loss, and this is where the genomic setting diverges sharply from the natural-language setting. In natural language, common augmentation strategies include replacing words with synonyms or rearranging clauses, both of which approximately preserve the meaning of the sentence. The genomic analogue of these operations — substituting nucleotides at random or shuffling subsequences around — corresponds to large-scale mutations that may not be compatible with life, and so cannot be assumed to preserve the phenotype label associated with the original sample. Synthetic augmentation in this setting is therefore an open problem rather than a solved one, and reweighting via focal loss is the practical workaround we are using in the meantime.

Next steps

Getting the model to classify correctly is one part of the goal; understanding why it classifies the way it does is the other. The latter is a question of mechanistic interpretability, and concerns what the model attends to when making a decision. A model that classifies correctly without yielding interpretable internals is of limited scientific use in this context, since the value of this approach over traditional GWAS lies precisely in the ability to surface combinations and contexts that single-locus methods miss.

A few interpretability approaches are on the table, none of them committed to in particular yet. The most direct route is in silico mutagenesis: take a trained model, perturb specific positions in the input sequence (either by introducing variants at sites of interest or by reverting the subject’s allele back to the reference), and observe the resulting change in the predicted severity score. This translates the question “what is the model looking at” into a concrete sensitivity analysis at the level of individual positions, and is a natural fit for genomic data since the perturbations have a clear biological interpretation. Gradient-based attribution methods such as integrated gradients are a related option, which produce a per-position importance map without requiring the model to be re-run for each perturbation. Probing classifiers, where a small linear model is trained on the intermediate hidden states to predict a biological property of interest, are a further option for testing whether specific kinds of information are linearly decodable from the representation the model has learned. The closest internal-routing analogue of attention-map inspection, given that Caduceus is built on Mamba rather than self-attention, is to look at the selective-scan gating values that govern how each layer decides which positions to incorporate into its hidden state and which to discard; visualizing these gates as a function of position offers a way to see which regions of the sequence the model is choosing to write into memory at each layer.

Whichever method we end up reaching for, the comparison against known biology is what gives the result its scientific weight. The natural reference points are resources such as the GWAS catalog of SNPs already associated with ADHD, curated mitochondrial gene lists for variants in the mitochondrial-function pathway, and regulatory annotations from large-scale projects on the non-coding genome. Positions that the model attends to which overlap with these known annotations would provide a sanity check on the approach; positions it attends to that fall outside these annotations are the more interesting case, since they become candidates for further investigation.

A second direction is to scale beyond the regions of the genome we have been working with so far. The current implementation operates on hardcoded chromosomal regions, with mtDNA as the starting point given the mitochondrial-dysfunction hypothesis, but the larger nuclear chromosomes contain many of the loci that GWAS has surfaced as associated with ADHD, and our broader interest is in the nuclear regions associated with mitochondrial function as well. Extending the pipeline to cover those regions in a more principled way — and eventually to operate over multiple regions of interest in parallel — is the obvious next direction once the mtDNA-only setup is yielding stable results.

Finally, although ADHD is the specific diagnosis we have been targeting in this writeup, the KSADS questionnaire covers a wide range of other conditions, including depression, anxiety, and other clinically assessed mental health diagnoses. Because the pipeline is parameterized by the choice of KSADS subscale as the prediction target, repointing it at a different diagnosis is, in principle, a matter of swapping out the phenotype column and re-training. The methods described in this post are therefore not specific to ADHD; they are a general recipe for fine-tuning a genomic foundation model on a clinically assessed phenotype, and ADHD is simply the first instance of that recipe we have been working through.