Pool-packaged AAV libraries exhibit extensive length-dependent and homology-dependent chimerism – Nature Biotechnology
Cloning complex libraries of doubly barcoded AAV constructs with various inserts
To clone a set of doubly barcoded AAV constructs, we digested plasmid AiP11839 (Addgene, 163509) with PacI and BbsI (New England Biolabs) and size-selected the 2.9 kb backbone containing the AAV2 ITRs on agarose gel. A GFP stuffer constructs with complex barcodes was created by two steps of PCRs (step 1: primers o924 + o925 amplifying GFP from p027, step 2: o926 + o927_v2) using Kappa Robust and standard cycling conditions. Random nucleotides were included in the primers to append random DNA barcodes (15 nt in o926, left BC1 and 16 nt in o927_v2, right BC2; Extended Data Fig. 1d) The second PCR was tracked by qPCR and stopped at the inflection point to maintain complexity and limit jackpotting. The barcoded GFP stuffer was then size-selected on agarose and inserted by Gibson assembly (4-μl reaction) in the AiP11839 PacI + BbsI-digested backbone above. The resulting library was cleaned up (reaction taken to 50 μl with 10 mM Tris 8 buffer and Zymo Clean and Concentrator with 3:1 binding buffer, eluted in 6 μl of water) and 3 μl was electroporated in 25 μl of C3020 cells (New England Biolabs). The resulting complex library (>5 million barcode transformants estimated by plating 0.03%) was grown overnight at 37 °C and plasmid-purified (Qiagen miniprep), generating the parental barcoded plasmid p146 (Extended Data Fig. 1b). This barcoded ‘dock’ contained two SapI sites internal to the BC1 and BC2 to allow insertion of various libraries of inserts. Just outside the SapI sites, a Nextera read1 adaptor served as homology for Gibson assembly on the left BC1 side (Extended Data Fig. 1d) and another constant-homology arm was used on the right BC2 side. Before replacing GFP with different classes of inserts, we generated additional dock plasmids to accommodate final constructs of fixed lengths by adding constant inserts outside of the barcoded stuffer. Briefly, p146 was sequentially digested with BglII and PmlI (New England Biolabs), the resulting linear product size-selected on agarose. Filler sequences of 1,291 bp and 1,898 bp were generated by PCR amplification from plasmid AiP11839 (constant_1291bp with primers o929 + o930, constant_1898bp with primers o928 + o930) using standard conditions. Following size selection on agarose, these were inserted by Gibson assembly and electroporated as described above, maintaining a high complexity of represented barcodes in the libraries, resulting in barcoded plasmids p147 (with constant_1291bp) and p148 (with constant_1898bp). All parental libraries were confirmed by whole-plasmid sequencing (Plasmidsaurus).
Parental barcoded AAV libraries with GFP stuffer flanked by SapI sites: p146, no additional insert; p147, with additional constant 1,291-bp insert outside of barcodes; p148, with additional constant 1,898-bp insert outside of barcodes.
Six libraries were then constructed to vary the insert length and class (homologous, meaning all components of the library are identical, or nonhomologous, meaning that all members of the library are different). To maintain a roughly constant length of 2.3 kb between the AAV ITRs, short inserts were integrated in p148, mid-sized inserts were integrated in p147 and long inserts were integrated in p146 (Extended Data Fig. 1b,c). All parental plasmids were digested with SapI (New England Biolabs) to release the GFP stuffer and the resulting barcoded backbones were size-selected on agarose for downstream steps, described below.
Homologous (fixed) inserts were taken from sections of the AiP11839 cargo (Extended Data Fig. 1a) and generated by two steps of PCR with primers. The first step appended handles (Nextera R1 on left, partial Nextera R2 on right) to the constant region: homologous_127bp with primers o931 + o934, homologous_739bp with o931 + o933 and homologous_2034bp with o931 + o932. These handles then served to prime a secondary PCR, which also appended a unique library index inside the construct for later demultiplexing. This secondary PCR was the same as that used for the construction of the nonhomologous libraries, corresponding to primers Nextera R1 (o759) in the forward direction and an indexed Nextera R2 with the constant right homology arm in the reverse direction (homologous_127bp with o937, homologous_739bp with o938 and homologous_2034bp with o939). The indexed constant inserts were then size-selected on agarose and inserted in their respective (for fixed ITR-to-ITR length) SapI-digested barcoded parental backbone (Extended Data Fig. 1b) with Gibson assembly. Following cleanup and electroporation, libraries were bottlenecked by serial dilution before outgrowth to a target complexity of ~20,000 transformants.
To generate nonhomologous insert libraries, we relied on tagmentation and PCR amplification of bacterial genomic DNA. Briefly, 1 μl at 10 ng μl−1 of gDNA extracted from E. coli cells (also containing plasmid p146, as described below) was tagmented with dually loaded Tn5 (Illumina, Nextera Tagment DNA enzyme, 15027916) at two doses: 0.4 μl of Tn5 enzyme 1 and 0.4 μl of a 20-fold dilution of the Tn5 enzyme together with 3.6 μl of water and 5 μl of 2× tagmentation buffer (Illumina, 15027866). Following cleanup (Zymo Clean and Concentrator, 3:1 binding buffer), 1 μl of the 10-µl Tris 8 10 mM elution (1 ng) was taken as input for 12 cycles of PCR (Kappa Robust) from the Nextera handles with indexed primer (same primers series as homologous inserts above, forward o759, reverse: short with o941, mid-sized with o942 and long with o943) to mark the libraries with internal insert indices for downstream demultiplexing. The resulting smear was size-selected to a narrow range in size on polyacrylamide gel for the short insert and on agarose for the mid-sized and long inserts. In all cases, to size-match nonhomologous fragments as carefully as possible, the homologous inserts of corresponding length were run on side lanes on the gels and the small corresponding range of the amplified tagmented gDNA was cut out and purified. The size-selected fragments were secondarily amplified with the same primers to generate more material for cloning, size-selected again and inserted by Gibson assembly in their respective SapI-digested barcoded parental backbone as for the homologous fragments. Following cleanup and electroporation, libraries were again bottlenecked to a target complexity of ~20,000 transformants.
Thus, all in all, we obtained the following six libraries fixed ITR-to-ITR length with the following insert characteristics and estimated complexity from transformant counts. All homologous libraries were confirmed by whole-plasmid sequencing (Plasmidsaurus) and nonhomologous libraries were spot-checked with Sanger sequencing of colonies (Genewiz).
Final dual-barcoded AAV libraries with various inserts (listed insert lengths do not include the Tn5 Nextera handles, included between all barcode pairs: 33 + 34 bp total): p149, short (127 bp) homologous insert; p150, short (~100 to 150 bp) nonhomologous inserts; p151, mid-sized (739 bp) homologous insert; p152, mid-sized (~650 to 850 bp) nonhomologous inserts; p153, long (2,034 bp) homologous insert; p154, long (~1,900 to 2,100 bp) nonhomologous inserts.
We note that the bacterial pellet used for genomic DNA extraction to tagment for nonhomologous library insert generation was outgrown from a colony on a p146 transformation plate (but grown on LB without ampicillin). As such, inserts from these libraries contained at substantial proportion sequences from plasmid p146 (proportion of mapped fragments: 65% for p150, 19% for p152 and 15% for p154; inserts mapping to p146 also mapped to AiP11839 given similarity). While inserts from these nonhomologous libraries are still very diverse, given the limited size of the plasmid and substantial proportions of the libraries, there are still nonzero pockets of homology for certain members of the libraries. To quantify this, we performed local pairwise alignment (pairwiseAlignment from R package Biostrings, version 2.62.0, option type = ‘local’) of randomly selected insert sequences from the size-selected digested plasmid long reads (pairs of reads with distinct barcodes). Setting a threshold score per library as the maximum of 1,000 alignments from a pair of insert sequences and another one-shuffled pair (FDR < 0.001), we quantified that about 1% of inserts had detectable homology (0.8% p150, 1% p152 and 1.8% p154), indeed close with theoretical expectation, assuming even random fragmentation with the proportion of reads within each library coming from the plasmid (size 5 kb) and the size of inserts (p150: 0.65 × 0.65 × (100 bp/5,000 bp) = 0.8%, p152 = 0.19 × 0.19 × (750 bp/5,000 bp) = 0.5%, p154 = 0.15 × 0.15 × (4,000 bp/5,000 bp) = 1.8%). Hence, despite the tagmentation material in the starting library being a mixture of genome and plasmid, the final libraries were effectively nearly completely nonhomologous.
Notably, a significantly enriched proportion of the rare swapped-barcode reads from the nonhomologous libraries originated from members of the library with regions of homology. For instance, among the 22/26 full BC-to-BC length barcode-swapped reads from library p150 for which the two corresponding preswap inserts could be mapped in the size-selected digested plasmid data, 3/22 had extensive homology (pairwise local alignment score > 50), which was over tenfold higher than randomly selected pairs of inserts from the same library (Fisher’s exact P < 0.005).
Generation of a valid BC1–BC2 pair dictionary from parental plasmid library p146
To generate the dictionary of valid barcode 1 and barcode 2 pairs, we amplified by PCR (primers o945 + o946v2 containing P5 and P7 Illumina adaptors) the GFP stuffer insert flanked by the two barcodes using 5 ng of starting template (50-μl reaction, Kapa Robust, standard conditions, ten cycles) and followed by 1× AMPure beads cleanup. The resulting library was sequenced using custom primers as a fraction of a NextSeq2000 P2 100-cycle run (read1: 42 cycles with o947, index1: 20 cycles with o761 Nextera_read1 (into SapI restriction site and GFP, not used), index2: 20 cycles with o948 (into GFP, not used), read2: 16 cycles with o762 Nextera_index2). Sequencing data were demultiplexed from other samples on the basis of the first ten indices of read1 (GATCCGTCGA) using bcl2fastq with base mask i10y*,y*,y*,y*, yielding 195.6 million reads to associate barcodes from p146. BC1 (5′/left of insert) was on cycles 1–16 within read2, whereas BC2 (3′/right of insert) was on cycles 21 to 36 within read1.
We then applied stringent representation and uniqueness criteria to identify unique valid pairs for our downstream swapping assessment. Read counts corresponding to identical barcodes 1 and 2 were first piled up. First, representation of barcodes (separately, summing reads from pairs with three or more counts) was inspected (Extended Data Fig. 2c), revealing a trimodal distribution: low-count barcodes (≤5 reads) corresponding to likely sequencing or PCR errors (BC1, n = 1,010,349, 2.5% of reads; BC2: n = 1,162,035, 2.9% of reads), intermediate-count barcodes making up the bulk of the coverage (BC1, n = 6,345,673 barcodes, 92.9% of reads; BC2, n = 6,596,695 barcodes, 93.5% of reads) and high-count barcodes (BC1, n = 827, 4.6% of reads; BC2, n = 530 barcodes, 3.7% of reads), possibly emerging from of clonal expansion during transformation outgrowth and/or PCR jackpotting. In the case of BC2, a single sequence (ATAACGACTTGTGAGC) was drastically overrepresented (2.5% of reads). This barcode and all other barcodes within a Levenshtein distance of ≤2 to it (n = 402 barcode, 0.25% of reads) were not considered in downstream analysis to avoid spurious nonunique pairs. We note that our barcode space was not saturated at that level of tolerance to mismatches (average of three BCs within our BC2 reads within Levenshtein distance of 2 to repeated one-shufflings of the ATAACGACTTGTGAGC sequence). Furthermore, barcodes with ten or more consecutive Gs or containing truncated BC (with the detected post-BC sequences: ATTAAAC for BC1, TAGCGCG for BC2) were removed, corresponding to a minute proportion of the library (BC1, n = 9,533, 1.1% of reads; BC2, n = 3,910, 0.06% of reads). To avoid mismatches from high-representation barcodes being retained as spurious mid-count barcodes, the list of mid-count barcodes was pruned by removing those within a Levenshtein distance of 2 from the high-count barcodes (number of mid-count barcodes removed in pruning process: BC1, n = 16,386; BC2, n = 4,747). All remaining pruned mid-count barcodes were retained downstream (BC1, n = 6,321,002; BC2, n = 6,588,391). Lastly, the high-count barcodes were further error-corrected by generating an undirected graph connecting barcodes within a Levenshtein distance of 2 or less. The most highly represented barcodes from each connected component were considered valid. Rare clusters composed of many well-represented barcodes (fold change between maximum and minimum < 10 within connected component) were discarded as possibly ambiguous, leading to n = 609 and n = 420 error-corrected high-count BC1 and BC2 sets, respectively. All in all, these filtering steps generated a list of well-represented high-quality barcodes (BC1, n = 6,321,611, BC2: n = 6,588,811).
From these well-represented BC1 and BC2 sets, we finally filtered unique pairings between the two. Specifically, all paired barcodes with ≥2 reads were filtered for members present in both separately valid sets. Then, the proportion of reads to each barcode within a pair (for example, the number of reads to BC1 from a given pair over number of reads containing the same BC1 across all pairs in the library) was computed. Only pairs for which >99% of reads mapped to a unique pair were retained. This led to a final set of n = 5,586,772 valid barcode pairs used for downstream assessment. Only exact matches to BC1 and BC2 constituents of these final pairs were used for filtering long-read data and setting the denominator in our barcode swap quantification. The retention proportion at filtering steps is presented in Extended Data Fig. 2b.
We note that the distinction between mid-count and high-count representation above was largely immaterial, as the overwhelming majority of detected BCs in our long-read libraries were from the mid-count set (n = 12 of 4,811 full BC-to-BC reads from AAV used to quantify swaps in Fig. 1 from the high-count barcode set), in line with their dominant representation, with no significant correlation with concordant or discordant pairing and barcode representation class (for the two libraries with detected high-count barcodes in the final read list, Fisher’s exact test: p149, P = 0.71; p151, P = 0.14).
AAV packaging
Complex libraries were packaged into PHP.eB or AAV2 capsids using the crude prep method previously described68. Maxiprep libraries cloned between AAV2 ITRs were transfected with PEI Max 40K (Polysciences, 24765-1) into one 15-cm plate of HEK-293T cells (American Type Culture Collection, CRL-11268), along with helper plasmid pHelper (Cell BioLabs) and either pUCmini-iCAP-PHP.eB69 (Addgene, 103005) or pAAV2/2 (Addgene, 104963). The final transfection mix contained 150 μg of PEI Max 40K, 30 μg of pHelper DNA, 15 μg of rep/cap plasmid DNA and 15 μg of library DNA per plate (‘standard conditions’). In the case of ‘sparse conditions’, we transfected with fewer library molecules per cell by using 10% (1.5 μg) library DNA along with 90% (13.5 μg) non-ITR-bearing empty expression vector plasmid DNA as a carrier (AiP12481, pCDNA3.1-CMV-empty-IRES2-mTFP1-BGHpA). Following transfection at 24 h, the medium was changed to low-serum conditions (1% FBS) and then, after 5 days, cells and supernatant were harvested into 50-ml conical tubes and AAV particles were released by three freeze–thaw cycles. The cell lysates were then treated with Benzonase to degrade free DNA (2 μl of Benzonase, 30 min at 37 °C, MilliporeSigma, E8263-25KU) and then cell debris was cleared with a low-speed spin (1,500g 10 min). The supernatant containing virus was concentrated over a Centricon column (100-kDa molecular weight cutoff; MilliporeSigma, Z648043) to a final volume of ∼100 μl, containing ~1 × 1012–3 × 1012 vector genomes. Crude AAVs were used for direct sequencing (Plasmidsaurus AAV sequencing service).
Sample preparation and sample submission for long-read sequencing
We long-read sequenced both direct and PCR-based libraries. For PCR-free sequencing of plasmid DNA, we digested the starting dually barcoded AAV plasmid libraries p146 and p149–p154 with NotI-HF and MluI-HF (New England Biolabs, using 2 μg of starting material, 37 °C, 1 h). The released the barcoded inserts (881 bp for p146, ~2.3 kb for p149–p154) were then size-selected on agarose (Zymoclean gel purification), pooled and submitted to Plasmidsaurus for a custom long-read project (project 8Y6YSQ.1; target: 3 million reads, recovery: 2.6 million reads). PCR-free libraries of AAV-packaged DNA were sequenced using the AAV service from Plasmidsaurus (project RP88L6).
To generate libraries of barcoded insert by PCR from the AAV-packaged DNA, we first extracted DNA by performing proteinase K treatment (3 μl of crude AAV prep, 6 μl of 10 mM Tris 8, 1 μl of proteinase K (Thermo Scientific, EO0491), 60 min at 50 °C, 5 min at 70 °C and then placed on ice), followed by phenol–chloroform extraction (adding 190 μl of 10 mM Tris 8, adding 200 μl of phenol–chloroform–isoamyl alcohol (Invitrogen, 15593-031), vortexing for 30 s, spinning at 16,000g at room temperature for 5 min and taking aqueous layer) and isopropanol precipitation (adding 1 μl of glycoblue, 50 μl of sodium acetate 3 M and 250 μl of isopropanol 100%, vortexing for 45 min at −80 °C and 45 min at 21,000g at 4 °C, washing with 80% ethanol, air-drying the pellet and resuspending in 10 μl of 10 mM Tris 8). Next, 2 μl of the precipitated DNA was taken as template for PCR with primers oJBL949 + oJBL950 (Kapa Robust HotStart Ready mix (Roche), annealing temperature: 60 °C, elongation time: 2 min 30 s) and tracked by qPCR (SYBr green). PCR libraries were generated from all samples packaged with the PHP.eB serotype in standard (nonsparse) packaging conditions (that is, samples 1–6; Supplementary Fig. 2f). Two separate reactions per sample (reaction 1: volume 20 μl, 16 cycles; reaction 2: volume 50 μl, 17–20 cycles) were pooled before submission. Of note, we tested the importance of adding a DNase treatment step by performing qPCR with primers targeting the insert between the AAV ITRs (oJBL905 + oJBL906) versus a sequence on the ampicillin cassette in the backbone (oJBL097 + oJBL098) comparing with and without DNase treatment before proteinase K treatment but saw no difference (>30-fold lower backbone material relative to cargo), likely because of the Benzonase treatment already having degraded the nonencapsidated DNA. PCR libraries from plasmids were prepared from 1 ng of template (with quantification possibly confounded by genomic DNA carryover) and reactions were stopped at 18 cycles (starting input material calibrated to still be in exponential phase before the inflection point). Plasmid-templated samples also included the parental p146 (which was pooled at 0.1-fold stoichiometry of other larger inserts to mitigate ONT size biases). In both plasmid-templated and AAV-templated conditions, PCR reactions were cleaned up with 0.4× AMPure clean, leaving predominantly >2-kb products (that is, largely excluding p153×-short and p153×-mid samples; Supplementary Fig. 4g). Two samples corresponding to pooled products (AAV pool sample 1, plasmid pool sample 2) were submitted to Plasmidsaurus for long-read sequencing (custom project DXYB6C, target reads: 1 million per sample).
Detailed methods on the computational pipeline to process the long-read data are provided in the Supplementary Methods.
PacBio AAV library preparation and sequencing
To generate libraries derived from AAV encapsidated DNA for PacBio sequencing, we adapted the protocol from Zhang et al.21 to reduce the number of gel extractions to improve yield and possibly reduce biases in subgenomic fragments. First, encapsidated DNA was purified using the Purelink Viral RNA/DNA mini kit (Invitrogen). Briefly, 50 μl of crude viral preparation (approximately 7 × 1011 viral genomes) was treated with 20 U of DNase 1 (Thermo, PI89836) at 37 °C for 30 min in a 200-μl total reaction. The DNase-treated encapsidated DNA was then treated for 15 min at 56 °C with 25 μl of proteinase K (Thermo, 4333793) in a total reaction volume of 425 μl (reaction including 5.6 μg of carrier RNA). The DNA was then purified using the column following the instructions and eluted in 50 μl. Polymerase Bst2 (New England Biolabs) was then used to perform double-stranding primed by the 3′ end of the AAV ITR. Specifically, 20 μl of purified DNA was mixed with 10 μl of 10x buffer, 6 μl of 100 mM MgSO4 and 45 μl of water, heated for 5 min at 95 °C and then placed on ice for 10 min. Then, 14 μl of 10 mM dNTPs were added, together with 4 μl of Bst2 polymerase and 1 μl of BSA (10 mg ml−1), and the mixture was incubated at 50 °C for 60 min. For PacBio library preparation, the SMRTbell prep kit 3.0 was used. Specifically, the Bst2 reaction was cleaned up with SMRTbell cleanup beads at 1.3× and eluted in 46 μl. After extraction and double-stranding, concentration was assessed by Qbit and integrity spot-checked with the Agilent TapeStation. End repair and poly(A) tailing was performed by adding 14 μl of the repair master mix (8 μl of repair buffer, 4 μl of end repair mix, 2 μl of DNA repair mix) and treated at 37 °C for 30 min and 65 °C for 5 min, followed by a hold at 4 °C. Next, 4 μl of SMRTbell barcoded adaptprs 3.0 were then added to the reaction from the previous step, together with 31 μl of ligation master mix (30 μl of ligation mix, 1 μl of ligation enhancer) and incubated for 30 min at 20 °C, followed by a hold at 4 °C. The resulting ligated samples were cleaned with 1.3× SMRTbell cleanup beads and eluted in 40 μl. The ligated DNA was then treated with nuclease by adding 10 μl of master mix (5 μl of nuclease buffer, 5 μl of nuclease mix) and incubated for 15 min at 37 °C, followed by a hold at 4 °C. The sample was then purified with 1.3× SMRTbell cleanup beads and eluted in 12 μl. Libraries were pooled and loaded on the PacBio Vega at 0.25 ng μl−1 (loading at a higher concentration would have increased the number of reads). The sequencing was run with application type ‘viral sequencing/AAV’, leading to modified adaptor calling (to allow for the capture of scAAV-like molecules).
Detailed methods on the computational pipeline to process the long-read data are provided in the Supplementary Methods. The associated pseudocode is provided in Supplementary Note 2.
Input plasmid dosage experiment
To assess the importance of input rAAV plasmid dosage on the chimerism phenomenon, we packaged the library p151:mid-hom as described above, but at four different doses per 15-cm plate of producing cells: 15 μg, 1.5 μg, 150 ng and 15 ng. The amount of DNA transfected was kept constant by compensating with the same ITR-free carrier plasmid (AiP12481, pCDNA3.1-CMV-empty-IRES2-mTFP1-BGHpA) as before. As a control, p152:mid-nonhom was also packaged a second time at the 15-μg dose. Analysis of the ONT data proceeded as described above with the same quality control filters. Metrics for these data are presented in Supplementary Fig. 3.
Statistical testing (bootstrap FDR)
To provide estimates of significance from counting noise (some AAV samples had relatively few reads passing quality control filters; Supplementary Fig. 2f), bootstrap resampling was performed to generate ensemble estimates of concordant barcode pairs. To compare two samples (for example, p149:AAV versus p150:AAV in Fig. 1c), n = 105 bootstrap resamplings were performed. In this case, the bootstrap FDR was taken as the fraction of resamplings in which sample p149:AAV had a higher bootstrap concordant pair fraction than the p150:AAV resampling, etc.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.