Code Examples

This page is for those using the Python API. For those wanting to use the command line application see the Command line interface page. General usage of this package is to provide convenient access to binding prediction methods and perform analysis on the results. There are multiple potential applications.

Methodology

MHC binding and other prediction methods are implemented by inheriting from a Predictor object. All such classes should at minimum override the predict method for scoring a single sequence. This may wrap methods from other python packages or call command line predictors. For example the TepitopePredictor uses the epitopepredict.tepitope module provided with this package.

The predict method should return a Pandas DataFrame. The predict_sequences method is used for multiple protein sequences contained in a dataframe of sequences in a standard format. This is created from a genbank or fasta file (see examples below). For large numbers of sequences predict_sequences should be called with save=True so that the results are saved as each protein is completed to avoid memory issues, since many alleles might be called for each protein. Results are saved with one file per protein/sequence in csv format.

The results are of the following form and are returned sorted by the score column:

     peptide       core      pos  score      name         allele  rank
198  VIFRLMRTNFL  FRLMRTNFL  198    3.4  ZEBOVgp1  HLA-DRB1*0101     1
199  IFRLMRTNFLI  FRLMRTNFL  199    3.4  ZEBOVgp1  HLA-DRB1*0101     1
200  FRLMRTNFLIK  FRLMRTNFL  200    3.4  ZEBOVgp1  HLA-DRB1*0101     1
709  NRFVTLDGQQF  FVTLDGQQF  709    2.5  ZEBOVgp1  HLA-DRB1*0101     4
710  RFVTLDGQQFY  FVTLDGQQF  710    2.5  ZEBOVgp1  HLA-DRB1*0101     4
711  FVTLDGQQFYW  FVTLDGQQF  711    2.5  ZEBOVgp1  HLA-DRB1*0101     4
70   DSFLLMLCLHH  FLLMLCLHH   70    2.0  ZEBOVgp1  HLA-DRB1*0101     7
71   SFLLMLCLHHA  FLLMLCLHH   71    2.0  ZEBOVgp1  HLA-DRB1*0101     7
72   FLLMLCLHHAY  FLLMLCLHH   72    2.0  ZEBOVgp1  HLA-DRB1*0101     7
32   QGIVRQRVIPV  IVRQRVIPV   32    1.7  ZEBOVgp1  HLA-DRB1*0101    10

where name is the protein identifier from the input file (a locus tag for example) and a score column which will differ between methods. MHC-II methods can be run for varying lengths, with the core usually being the highest scoring in that peptide/n-mer (but not always).

Basics

imports:

import epitopepredict as ep
from epitopepredict import base, sequtils, analysis, plotting

create a Predictor object:

#get list of predictors
print base.predictors
['tepitope', 'netmhciipan', 'iedbmhc1', 'iedbmhc2', 'mhcflurry', 'mhcnuggets', 'iedbbcell']
p = base.get_predictor('tepitope')

get sequence data:

#get data in genbank format into a dataframe
df = sequtils.genbank2Dataframe(genbankfile, cds=True)
#get sequences from fasta file
df = sequtils.fasta2Dataframe(fastafile)

run predictions for a protein sequence:

seq = ep.testsequence
label = 'myprot' #optional label for your sequence
p = base.get_predictor('tepitope')
p.predict(sequence=seq, allele='HLA-DRB1*01:01', length=11, name=label)

run predictions for multiple proteins:

#run for 2 alleles and save results to savepath
alleles = ["HLA-DRB1*01:01", "HLA-DRB1*03:05"]
p = base.get_predictor('tepitope')
p.predict_proteins(df, length=11, alleles=alleles, save=True, path=savepath)

run predictions for a list of peptides:

from epitopepredict import peptutils
seqs = peptutils.create_random_sequences(5000)
p = ep.get_predictor('tepitope')
x = p.predict_peptides(seqs, alleles=alleles)

run with multiple threads:

x = p.predict_peptides(seqs, alleles=alleles, threads=4)

load previous results into a predictor:

p.load(path=path) #where path stores csv files for multiple proteins
p.load(filename=file) # where file is a csv formatted file of prediction results (can be 1 or more proteins)

Analysis

get all the binders using the current data loaded into the predictor:

#default is to use percentile cutoff per allele, returns a dataframe
p.get_binders(cutoff=.95)

get binders for only one protein by top median rank:

p.get_binders(name=name, cutoff=10, cutoff_method='rank')

get all promiscuous binders, returns a dataframe:

pb = p.promiscuous_binders(n=2, cutoff=.95)
#same using score cutoff
pb = p.promiscuous_binders(n=2, cutoff_method='score', cutoff=500)

find clusters of binders in these results:

cl = analysis.find_clusters(b, method, dist=9, minsize=3)