This page is for those using the Python API. For those wanting to use the command line application see the Command line interface page. General usage of this package is to provide convenient access to binding prediction methods and perform analysis on the results. There are multiple potential applications.
MHC binding and other prediction methods are implemented by inheriting from a Predictor object. All such classes should at minimum override the predict method for scoring a single sequence. This may wrap methods from other python packages or call command line predictors. For example the TepitopePredictor uses the epitopepredict.tepitope module provided with this package.
The predict method should return a Pandas DataFrame. The predict_sequences method is used for multiple protein sequences contained in a dataframe of sequences in a standard format. This is created from a genbank or fasta file (see examples below). For large numbers of sequences predict_sequences should be called with save=True so that the results are saved as each protein is completed to avoid memory issues, since many alleles might be called for each protein. Results are saved with one file per protein/sequence in csv format.
The results are of the following form and are returned sorted by the score column:
peptide core pos score name allele rank 198 VIFRLMRTNFL FRLMRTNFL 198 3.4 ZEBOVgp1 HLA-DRB1*0101 1 199 IFRLMRTNFLI FRLMRTNFL 199 3.4 ZEBOVgp1 HLA-DRB1*0101 1 200 FRLMRTNFLIK FRLMRTNFL 200 3.4 ZEBOVgp1 HLA-DRB1*0101 1 709 NRFVTLDGQQF FVTLDGQQF 709 2.5 ZEBOVgp1 HLA-DRB1*0101 4 710 RFVTLDGQQFY FVTLDGQQF 710 2.5 ZEBOVgp1 HLA-DRB1*0101 4 711 FVTLDGQQFYW FVTLDGQQF 711 2.5 ZEBOVgp1 HLA-DRB1*0101 4 70 DSFLLMLCLHH FLLMLCLHH 70 2.0 ZEBOVgp1 HLA-DRB1*0101 7 71 SFLLMLCLHHA FLLMLCLHH 71 2.0 ZEBOVgp1 HLA-DRB1*0101 7 72 FLLMLCLHHAY FLLMLCLHH 72 2.0 ZEBOVgp1 HLA-DRB1*0101 7 32 QGIVRQRVIPV IVRQRVIPV 32 1.7 ZEBOVgp1 HLA-DRB1*0101 10
where name is the protein identifier from the input file (a locus tag for example) and a score column which will differ between methods. MHC-II methods can be run for varying lengths, with the core usually being the highest scoring in that peptide/n-mer (but not always).
import epitopepredict as ep from epitopepredict import base, sequtils, analysis, plotting
create a Predictor object:
#get list of predictors print base.predictors ['tepitope', 'netmhciipan', 'iedbmhc1', 'iedbmhc2', 'mhcflurry', 'mhcnuggets', 'iedbbcell'] p = base.get_predictor('tepitope')
get sequence data:
#get data in genbank format into a dataframe df = sequtils.genbank2Dataframe(genbankfile, cds=True) #get sequences from fasta file df = sequtils.fasta2Dataframe(fastafile)
run predictions for a protein sequence:
seq = ep.testsequence label = 'myprot' #optional label for your sequence p = base.get_predictor('tepitope') p.predict(sequence=seq, allele='HLA-DRB1*01:01', length=11, name=label)
run predictions for multiple proteins:
#run for 2 alleles and save results to savepath alleles = ["HLA-DRB1*01:01", "HLA-DRB1*03:05"] p = base.get_predictor('tepitope') p.predict_proteins(df, length=11, alleles=alleles, save=True, path=savepath)
run predictions for a list of peptides:
from epitopepredict import peptutils seqs = peptutils.create_random_sequences(5000) p = ep.get_predictor('tepitope') x = p.predict_peptides(seqs, alleles=alleles)
run with multiple threads:
x = p.predict_peptides(seqs, alleles=alleles, threads=4)
load previous results into a predictor:
p.load(path=path) #where path stores csv files for multiple proteins p.load(filename=file) # where file is a csv formatted file of prediction results (can be 1 or more proteins)
get all the binders using the current data loaded into the predictor:
#default is to use percentile cutoff per allele, returns a dataframe p.get_binders(cutoff=.95)
get binders for only one protein by top median rank:
p.get_binders(name=name, cutoff=10, cutoff_method='rank')
get all promiscuous binders, returns a dataframe:
pb = p.promiscuous_binders(n=2, cutoff=.95) #same using score cutoff pb = p.promiscuous_binders(n=2, cutoff_method='score', cutoff=500)
find clusters of binders in these results:
cl = analysis.find_clusters(b, method, dist=9, minsize=3)