Need help ? Contact us

OpenProt Help

Getting Started - Search, click here.

Getting Started - Browse, click here.

Getting Started - Data Submission Platform, click here.

FAQ Contents, click here.

Downloads guidelines, click here.


Getting Started - Search

If you want to know if a specific gene contains AltORFs (all predicted and those with evidence of expression), click Search.

You will then be redirected towards a query page:

You can input your search criteria as follow:

  1. Select a species (default is Homo sapiens).
  2. Select an assembly (default is the most recent in each species).
  3. Select an annotation (default is Ensembl+RefSeq). Several annotations are used by OpenProt to predict AltProts. Ensembl, NCBI RefSeq and combined Ensembl+RefSeq annotations are available for all species. If you want to know why OpenProt supports multiple annotations, you can click here. 
  4. Enter the name of your gene of interest.

Alternatively, you can also search by transcript or protein accessions (5 and 6 respectively). Both Ensembl and RefSeq accession IDs are accepted. Proteins may be searched on one or more specific transcripts.Similarly, one or more proteins can be searched for simultaneously.

Below is an example for the COL1A1 gene. Once you have entered your gene name and launched the search, your results will appear below. The number of found proteins respecting your search criteria is indicated at the top (1) of your results table (2).

You can then refine your search results by playing with the options in the dropdown menu or by selecting the Advanced Search option.

  1. Tick to search for or display only proteins (RefProts, AltProts and Isoforms) that have been detected by mass spectrometry (MS) and/or for which translation events (TE) have been identified in ribosome profiling studies.
  2. Tick to search for or display only proteins that have been detected by MS. For a list of MS studies reanalysed by OpenProt, click here.
  3. Tick to search for or display only proteins that have been detected by ribosome profiling. For a list of ribosome profiling studies reanalysed by OpenProt, click here.
  4. Tick to search for or display only proteins with predicted domains by InterProScan.
  5. Tick to search for or display only AltProts.
  6. Tick to search for or display only Isoforms.

Any of the above can be combined as you wish. An advanced search is also available by clicking Advanced Search.

  1. Filter by a specific amino acid sequence.
  2. Filter according to the transcript type (mRNA or ncRNA) or the localization of AltORFs in transcripts. Within mRNAs, the localization of AltORFs is defined by the predicted start codon localization with respect to the annotated CDS start codon. The localization of AltORFs within non-coding RNAs is labeled “-”. There are three possible choice of localizations of AltORFs within mRNAs: “5’UTR”, “CDS”, “3’UTR”. Thus, the dropdown menu offers 5 choice: “5’UTR”, “3’UTR”, “CDS”, “ncRNA”, or “mRNA”.
  3. Filter AltORFs in a specific reading frame (+1, +2 or +3). The reading frame is determined with respect to the first nucleotide of each transcript (+1 reading frame).
  4. Filter by dataset identifier. This is a dropdown menu containing all datasets currently in OpenProt. Select one to mine proteins detected in this dataset. Please note that the filter supports only one study at a time.

You can further sort your results by clicking on any option of the Order by dropdown menu (1).

  1. The following sorting options are available: “MS score (desc) / TE (desc) / Domains (desc)” (by default);
    “Domains (desc) / MS score (desc) / TE (desc)”;
    “TE (desc) / MS score (desc) / Domains (desc)”;
    “Molecular Weight (asc) / MS score (desc) / TE (desc) / Domains (desc)”;
    “Molecular Weight (desc) / MS score (desc) / TE (desc) / Domains (desc)”;
    “Protein Length (asc) / MS score (desc) / TE (desc) / Domains (desc)”;
    “Protein Length (desc) / MS score (desc) / TE (desc) / Domains (desc)”.
  2. Control which columns you want to see in the results table by clicking on the Column Settings and deselect any you don’t want to see.
  3. You can download your results table by clicking on Download as TSV. For more options and information on available downloads, click on Downloads Guidelines.
  4. You can download protein sequences from your results table by clicking on Download as FASTA. For more options and information on available downloads, click on Downloads Guidelines.
  5. You can also share your search by clicking on Share. A pop-up window will display a shareable link.

Main Menu


Getting Started - Browse

If you want to browse the genome of a specific species for AltORFs (all predicted and those with evidence of expression), click Browse.

You will then be directed towards a query page.

You can input your search criteria as follow:

  1. Select a species (default is Homo sapiens).
  2. Select an assembly (default is the most recent in each species).
  3. Select an annotation (default is Ensembl). Both Ensembl and NCBI RefSeq annotations are used by OpenProt to predict AltProts, and the browser is available for both. If you want to know why OpenProt supports multiple annotations, you can click here. For more informations on how to display both annotations on the browser, click here.
  4. Enter the name of your gene of interest.

Alternatively, you can also search by transcript or protein accessions (5 and 6 respectively). Both Ensembl and RefSeq accession IDs are accepted (depending on the chosen annotation). You can also directly enter genomic coordinates of interest (7).

Below is an example for the COL1A1 gene. Once you have entered your gene name (1) and launched the search (2), your results will appear centered in the browser window.

You can visualize the genomic coordinates (1) and the different tracks. The first track contains transcripts for the chosen annotations (2 - here, Ensembl). The second contains predicted proteins (3). The colour code is indicated below the browser with the transcripts annotated in blue, the RefProt in green, the AltProt in red and the Novel Isoforms in yellow. You can widen or narrow the browser window (4) and custom your display by adding or removing a track from the registry (5) The registry includes: genome, transcript, protein and peptide detection by default.

If you scroll down on the genome browser (1), the last track will appear and contains the peptide detected by MS (1).

Furthermore, you can click on a peptide and this will display the details associated to this peptide in a pop-up window.

The pop-up windows displays the peptide sequence (1), its genomic coordinates (2) and the proteins assigned to that peptide (3). All proteins this peptide has been assigned to are listed, across both annotations (Ensembl + RefSeq). For more information on peptide assignation rules, click here. The details page of the assigned proteins can be consulted directly by clicking on the goto details link (4).

Such pop-up windows are also displayed when clicking on a protein or a transcript (as shown below).

The transcript associated pop-up window contains the transcript genomic coordinates (1) and a list of all the associated proteins to this transcript (2). Each protein can then be accessed by clicking on the goto details link (3).

Main Menu


Getting Started - Data Submission Platform

From any OpenProt page, including the home page, click Submit study.

Once you clicked on Submit study, you should first select the type of file you are submitting: mass spectrometry or ribosome profiling.

***

For mass spectrometry studies, your dataset has to be available in the PRIDE Archive with a public PXD accession.

After entering the PXD accession number, the OpenProt submission platform will retrieve information from the PRIDE repository (here, we use the PXD015644 as an example). Thus, the PMID (1) and citation (2) are automatically filled, as well as the available samples in the dataset (3).

First, enter a contact email that will serve for all future correspondence (1). For example, we will send you the results of the analysis at this email address.

Once you have entered your email address, you can start selecting samples (2).

In order to select a sample, click on its name. The blue color indicates the sample is selected (1), the white background indicates the sample is not selected (2).

Once samples are selected, you can click on “Group selected” (1) to add each of them to on group with identical parameters. To correct erroneous selection, you can click on “clear selection” (2). If the datasets contain some samples that you don’t want to include in the analysis, you can click on “Exclude selection” (3). If you want to select all samples at once, you can click on “select all” (4). Once samples are grouped, they will be removed from the selection panel. If you forgot one sample, you can add it to a pre-formed group by clicking on “add to selected group” (5). Please not that all samples must be grouped in order to submit.

Your grouped samples will then appear in the parameters editing box (1). Please note that your samples should be grouped by parameter settings.

For each sample within a group, you can edit its fraction and replicate number by clicking on “Edit” (1). Then, for each group you have to indicate the enzyme (2) used for protein digestion, and the variable (3) and fixed modifications (4) to include in the analysis. These are drop-down menus with all available enzyme and modification in our pipeline. For custom enzyme or modifications, please contact us.

The parameters entered can always be changed by clicking the cross next to the selected enzyme or modification.

At the bottom of the page, the next parameters to enter are the species (1), the type of biological sample (2), the fragmentation protocol (3) and the mass spectrometer used (4).

The species, fragmentation and MS instrument are compulsory for submission. The MS instrument is retrieved from the PRIDE directory. The species is a dropdown menu containing all species currently supported by OpenProt. The fragmentation protocol is a dropdown menu with the protocols currently supported by OpenProt. (For more information on the fragmentation protocol, click here).

Once all compulsory parameters have been filled, you can click on submit. You will received an email (at the email address indicated at the top of the form) to confirm the submission (please check your spam folder if you don’t receive any email).

 ***

For ribosome profiling studies, your dataset has to be available in the Gene Omnibus Archive with a public GSE accession.

 After entering the GSE accession number, the OpenProt submission platform will retrieve information from the Gene Omnibus repository (here, we use the GSE144682 as an example). Thus, the PMID (1) and citation (2) are automatically filled, as well as the available samples in the dataset (3).

First, enter a contact email that will serve for all future correspondence (1). For example, we will send you the results of the analysis at this email address.

Once you have entered your email address, you can start selecting samples.

In order to select a sample, click on its name. The blue color indicates the sample is selected (4), the white background indicates the sample is not selected.

Once samples are selected, you can click on “Group selected” (2) to add each of them to on group with identical parameters. To correct erroneous selection, you can click on “clear selection” (3). If you want to select all samples at once, you can click on “select all” (4). If the datasets contain some samples that you don’t want to include in the analysis, you can click on “Exclude selection” (5). Once samples are grouped, they will be removed from the selection panel. If you forgot one sample, you can add it to a pre-formed group by clicking on “add to selected group” (6). Please not that all samples must be grouped in order to submit.

Your grouped samples will then appear in the parameters editing box (1). Please note that your samples should be grouped by parameter settings.

A sample can always be removed from the group by clicking the cross next to its name.

At the bottom of the page, the next parameters to enter are the species (1), the time of treatment (2), the drug used (3) and the biological type of the sample (4).

The species, time of treatment and drug used are compulsory for submission. The species is a dropdown menu containing all species currently supported by OpenProt. The time of treatment should correspond to when the drug was added during the protocol (if the drug was part of the lysis buffer, select n/a). The drug used is a dropdown menu containing all drugs currently supported by OpenProt.

Once all compulsory parameters have been filled, you can click on submit. You will received an email (at the email address indicated at the top of the form) to confirm the submission (please check your spam folder if you don’t receive any email).

Main Menu


FAQ contents

Mass Spectrometry analyses related questions:

Ribosome Profiling related questions:

Conservation related questions:

Main menu


What are AltProts (alternative proteins), novel predicted Isoforms and RefProts (reference proteins)?

Current genome annotations in eukaryotes rely partly on ORF prediction algorithms, which are reliable only for sequences above a certain length. Consequently, three main criteria are enforced to distinguish true ORFs from randoms: (1) a minimum length of 100 codons; (2) a single CDS per transcript; and (3) the use of an ATG start codon. However, these assumptions lead to a substantial underestimation of the proteomic information encoded within a gene, and hamper the discovery of proteins translated from unannotated ORFs (PMID: 29083303, 28627015, 26578573, 29626080).

Here, in OpenProt, we use different terms to identify proteins based upon their genome annotation status.
- Proteins currently annotated in databases, such as UniProtKB, are translated from canonical CDSs (annotated coding sequence) and are termed
reference proteins (or RefProts).

- Alternative ORFs (or AltORFs) are defined as potential protein-coding ORF, located either in non-coding RNAs (e.g. long non-coding RNAs, pseudogene RNAs), or in UTRs or alternative reading frames overlapping the CDS in mRNAs. Predicted proteins translated from AltORFs are termed alternative proteins (or AltProts; IP_). AltProt and RefProt from a same gene are not isoforms: they are coded by different ORFs and their amino acid sequence is completely different.

- Predicted proteins translated from an alternative ORF (as defined above), but that either display (1) a close homology with a reference protein from the same gene; (2) the same start and/or stop codon than the reference protein; and an alignment score above the threshold are considered novel isoforms of the reference proteins (or Isoforms; II_).

Genome annotations are ever-changing and rely also on manual curation. Thereby, once there is enough evidence of expression and function for an AltProt, it will be annotated in UniProtKB (via manual curation) and thus becomes a new RefProt in OpenProt. For example, human MIEF1 gene encodes two RefProts: the originally annotated MiD51 protein (UniProtKB Q9NQG6), and the recently annotated UniProtKB L0R8F8.

Expression of AltProts demonstrates that an unknown fraction of eukaryotic genes are polycistronic, and that an unknown fraction of RNAs originally annotated as non-coding RNAs are actually encoding small proteins.

Back

How does OpenProt differ from other small ORFs databases or UniProt?

OpenProt is the first and only protein database to enforce a polycistronic model of mammalian genome annotation. Most genome annotations rely on prior knowledge or ab initio algorithms which apply 3 arbitrary criteria: an ATG start codon, a minimum length of 100 codon, and a single coding sequence per transcript. OpenProt still holds the ATG start codon criteria (although admittedly other codon might initiate), but allows for multiple coding sequence per transcript.

Furthermore, OpenProt lowers the minimum length threshold to 30 codons. On the contrary to small ORFs (smORFs) databases, we do not hold a maximum length threshold, which allow for the detection of novel proteins in “non-coding” RNAs longer than 100 amino acids. Thus, OpenProt differs from other smORFs databases as it annotates (a) novel proteins on mRNA (polycistronic model) and (b) novel proteins longer than 100 codons (no maximum length threshold). Furthermore, OpenProt also predicts Novel Isoforms of known proteins.

Finally, OpenProt distinguishes itself from UniProtKB as a new tool, an expanded database for discoveries in functional proteomics. Indeed, OpenProt predicts novel ORFs but also collects evidence at 3 different level (conservation, translation and expression) using adapted pipelines. Novel proteins, discovered using OpenProt and further characterized may then be implemented to the UniProtKB database (and will change to RefProt in OpenProt).
Overall, OpenProt is a platform that offers
a novel expanded view of the proteome, allowing less serendipitous discoveries in proteomics (see PMID 29083303 and 29626081).

Back

How can I find an AltProt encoded in a specific gene?

Please go to Getting started.

Back

How do I get the protein sequence of an AltProt?

From the results table on the Search page, click on details to get additional information on the selected protein.

You can then click on show in the Protein column to obtain the selected protein sequence.

Ultimately, if you want to access protein sequences for a table of results, you can also click on download as FASTA from the results table.

Back

How do I get the DNA sequence of an AltORF encoding a specific AltProt?

From the results’ table on the Search page, click on details to get additional information on the selected protein.

You can then click on show in the DNA column to obtain the selected protein sequence.

Back

I want to detect predicted (or already detected) AltProts and/or Isoforms in MS-based proteomic analyses. Can I download databases in FASTA format?

Protein sequence databases can be downloaded by clicking Downloads in the upper right corner of any OpenProt.org pages and by following the Guidelines to download protein sequence databases (FASTA format) for MS users (several files including RefProts, and/or AltProts, and/or Isoforms). Alternatively, you can click on Download FASTA file for MaxQuant on the home page to download a Human protein sequence database containing AltProts and Isoforms for which at least two unique peptides have already been detected by MS. This file also contains RefProts.

Back

How to download a FASTA file for Proteomics analyses?

Databases for each species can be downloaded from the Downloads page. For a full tutorial on how to download data from OpenProt, go here.

Back

How do I download the full database (tsv format) of predicted AltProts, or smaller databases with subset of data based on specific criteria (e.g. experimental evidence)?

Databases are available in .tsv format (includes protein accession, protein length, molecular weight, isoelectric point, reading frame, gene symbol, genomic coordinates, transcript accession, type, localization, transcript coordinates, MS score (Mass Spectrometry), TE score (Translation Event), orthology across 10 species, prediction for the presence or not of Kozak motif or High-efficiency TIS (Translation Initiation Site) motif, and domain score). Please follow the Guidelines to download the full database or smaller databases with subset data based on specific criteria in .tsv format. You can also download the results obtained in the search page by selecting Download as TSV.

Back

Which database should I download?

Ultimately, that will depend on your experimental question, but here are some guidelines.

For MS-based proteomics analyses, we recommend using a database containing all RefProts and AltProts and Isoforms with evidence of detection (minimum of 1 or 2 peptides seen in MS datasets, depending on the level of stringency you would like).

If you wish to use the whole set of AltProts and Isoforms (including the predicted ones with no evidence of detection yet), we do not recommend using a classic MaxQuant analysis. The substantially increased size of the target database will have an impact on protein identification (for more information on how database size can alter protein identification, we recommend Jeong, et al.). The RefProts database and the AltProts and Isoforms database can be downloaded separately as FASTA files.

Back

How can I share with colleagues a specific search on OpenProt?

You can share a specific search on OpenProt with colleagues by clicking the search button at the top of the search results table (1).

A shareable link will be created and appear in a pop-up (2).

Back

Why does OpenProt supports several annotations (Ensembl and NCBI RefSeq)?

OpenProt supports both Ensembl and NCBI RefSeq annotations for all species. We chose to support both annotations, firstly, as both are widely used and people might find it easier to search using the annotation they are used to. Secondly, one should be aware that annotations are not perfect and the overlap between Ensembl and NCBI RefSeq annotations is not whole. In order to be as exhaustive as possible, we included both annotations; more information on comparison of annotations can be found here. 

Back

Why does OpenProt use a 3-frames in silico translation instead of a 6-frames?

Since OpenProt pipeline uses transcriptome annotations and not genome annotations, we can restrict the in silico translation to 3-frames mode.

Back

How can I visualize both annotations in the OpenProt genome browser?

OpenProt supports both Ensembl and NCBI RefSeq annotations for all species and you can visualize both in the genome browser. To do so, click on the + sign above the genomic track display.

Under the Defaults tab (1), you will find a list of tracks available on OpenProt. You can then tick the box of the annotations you want to display (2). Annotations will be displayed in separate tracks.

Back

What is the FASTA header?

The general format is as below:

>Identifier|TX=TaxonomyIdentifier OS=OrganismName GN=GeneName TA= TranscriptAccession PA=ProteinAccession


Here is a description of the fields:

  • Identifier Proteins annotated in OpenProt have an accession number starting either with IP (for Inferred Protein, i.e. predicted AltProt) or with II (Inferred Isoform, i.e. novel predicted isoform of RefProt). Protein accession numbers for RefProts originate from Ensembl, NCBI RefSeq or UniProtKB annotations. If the "RefProts Included" option is selected, the FASTA file (protein) will contain all RefProt sequences (non-redundant protein sequences from UniProtKB / SwissProt, Ensembl and NCBI RefSeq annotations). When more than one protein accession number is available for one sequence, UniProtKB annotation is used as the Identifier and others are included in the PA field.
  • TX = Taxonomy Identifier.
  • OS = Organism Name.
  • GN = Gene Name.
  • TA = Transcript accession number(s) (from Ensembl and/or NCBI RefSeq annotations).
  • PA = Protein accession number(s).

A selection of files that are available for download is available under the Downloads tab or here.

Back

Why are FASTA files of AltProts and Isoforms detected by MS not available for download in my species of interest?

It might happen that no files of AltProts and/or Isoforms detected by MS is available for download in some species yet. This does not mean that no AltProts or Isoforms exist in this species (the file containing all predicted proteins is available for download - 1). This absence can simply come from the number of MS studies that have been reanalysed by OpenProt in this particular species.

As exemplified here for Caenorhabditis elegans, no file of detected AltProts and/or Isoforms is available for download since MS datasets in this species have yet to be reanalysed using the OpenProt pipeline. An exhaustive list of MS studies reanalysed by OpenProt can be found here. Alternatively, if you would like to suggest a MS dataset to analyse using OpenProt, please contact us.

Back

Why do I see the same protein sequence annotated either as II_ (Isoform) or IP_ (AltProt)?

This situation can happen if two nucleotide sequences coding for the same amino acid sequence (predicted protein) are found at two different genomic loci.

Back

I am not sure to understand what is a novel predicted isoform (II_)?

If the initial definition of an Isoform (II_) let you hanging, let’s take a closer look using an example. A novel predicted Isoform, strongly detected using the OpenProt MS pipeline, is that of the MUC2 human gene. The protein (II_781286) is predicted on the ENST00000361558 transcript, annotated as non-coding in Ensembl. However, it shares a 96.6 % homology with the canonical protein MUC2, as shown below with a Clustal Omega alignment.

The dark blue colour indicates identical amino acids. This novel protein (II_781286) is currently not annotated (ncRNA in Ensembl), yet it shares a high homology with the canonical MUC2 protein, indicating that it is not an AltProt but a novel Isoform (II_).  

Back

How can I know why a novel protein is annotated as a novel predicted Isoform (II_)?

OpenProt now also displays, for all novel predicted Isoforms (II_), the RefProt(s) responsible for the annotation as isoform instead of AltProt. This information is displayed at the top of the details page for each novel predicted Isoform (circled in red) and in detail in the isoforms tab (red arrow).

The isoform tab contains a table (1) that lists for each RefProt (listed by their accession, 2) the reason for the annotation as isoform (3) for the looked up protein (here II_794710).

Back

How can I see if a novel protein shares sequence similarities with others from the same gene?

OpenProt reports the relationship between proteins from the same gene (isoform prediction). Briefly, OpenProt evaluates the protein sequence identity between proteins from the same gene using an all-vs-all BLAST search. OpenProt reports as isoforms proteins with a bitscore over 40 for an overlap over 50 % of the queried sequence. Each identified isoform is listed in the isoforms tab on each protein page (red arrow). The number indicated at the top (1) corresponds to the number of identified isoforms. The tab contains a protein tree (2) where nodes correspond to unique protein sequences present in OpenProt (3) and are coloured based on their identity (1) to the looked-up protein (here, IP_191523). Each node is accompanied by the corresponding protein accession. By clicking on the protein accession, a pop-up window (represented in dotted boxes) appears with the results of the BLAST search (4).

Back

Future features and directions for OpenProt?

OpenProt is constantly evolving.

  • Mass spectrometry and ribosome profiling studies are continuously being added to the database, thus increasing our number of reliable detections for all species. You can help us by submitting your own studies (click here for more information)!
  • OpenProt is developed in accordance with the FAIR guiding principles for scientific data management and stewardship (see Wilkinson et al., Sci Data, 2016). This means that generated accession numbers are persistent in time; and that the code, data and files have persistent release numbers.
  • Future releases plan to include a mass spectra viewer, a sequence similarity search (BLAST), protein-protein interaction networks and functional information gained from genomic variants.

If you have any suggestion of features, species or datasets for OpenProt, do not hesitate to contact us. We are always looking for opportunities to help, collaborate and make AltORFs more accessible to the scientific community!

Back

How does OpenProt follow the FAIR guidelines for database stewardship and management?

OpenProt follows all FAIR guidelines (PMID 26978244), when applicable, as described here in more details. The first guideline is to be findable: OpenProt does that via a web-based indexed searchable platform with unique and persistent identifiers for each sequence data (OpenProt thus meets points F1, F2, F3 and F4). The second guideline is to be accessible: OpenProt does that via an open, free, release-based web platform so that data access is persistent through time (OpenProt thus meets points A1, A1.1 and A2). The third guideline is to be interoperable: OpenProt does that via downloadable formal file formats and vocabularies which ease the use of OpenProt data with other softwares (OpenProt thus meets points I1, I2 and I3). The fourth and final guideline is to be reusable: OpenProt does that via links to outsourced data, richly described metadata with easy access and usage (OpenProt thus meets points R1, R1.1 and R1.3).

Back

Who are the people behind OpenProt?

OpenProt is the result of interdisciplinary collaborations between experts in the fields of proteogenomics, bioinformatics, functional proteomics, molecular evolution and cellular biology. All current and past members are listed on our About page.

Back

Mass Spectrometry analyses related questions

How is the MS score calculated?

The Mass spectrometry score (MS score) is the sum of unique peptides identified in MS-studies reanalyzed by OpenProt.

In order to be annotated as detected by MS, several criteria have to be met:

(1) Only unassigned peptides from MS data can be matched to AltProts. If a peptide matches both a RefProt and an AltProt, the corresponding Peptide Spectrum Match (PSM) will be assigned to the RefProt only. For more information on the MS annotation pipeline enforced by OpenProt, see below. For more information on peptide assignation rules, click here.

(2) An AltProt or Isoform must have been seen with at least one unique peptide.

As such, if an AltProt is detected by one unique peptide in three different MS datasets, and two different unique peptides in one other MS study, the corresponding MS score is 5 (1+1+1+2).

Back

What are the MS coverage statistics?

OpenProt now reports MS coverage statistics for each protein. The table (1) is on the MS tab above the MS detection summary (2). The contains the number of unique datasets in which the protein was detected (1), alongside the number of unique peptides across all datasets (2) and the total number of PSMs (peptide-spectrum matches) supporting the annotation across all datasets (3). Furthermore, OpenProt also reports the theoretically detectable sequence coverage (possible sequence coverage, 4) and the detected sequence coverage (5).

Back

I am interested in a protein annotated in OpenProt, but it has an MS score of 0

With the particular peptide assignment rules enforced by OpenProt to ensure the detection of novel proteins is only supported by unique peptides, one must realise some novel proteins (AltProts or Novel Isoforms) may not be detectable by MS. As a matter of fact, in Human 44,350 proteins annotated in OpenProt would not be detectable by MS. These would then have a possible sequence coverage of 0. We encourage the users to interpret the MS score in the light of the possible sequence coverage for each protein.

Back

How does OpenProt identify AltProts in MS-based proteomic analyses?

OpenProt retrieves MS raw data files from Proteome Xchange and collaborators. Each MS data file is then analysed using the same pipeline, a stringent FDR of 0.001% is enforced. Traditionally, a 1% FDR is used; we chose a more stringent FDR to focus on high quality AltProt identifications only. The decoy database contains the reversed sequences of proteins from the target database. All peptide sequences identified for a protein (RefProt, Isoform or AltProt) can be seen under the details tab.

Then you can click on the Mass Spectrometry tab to display details about evidence of expression from MS datasets.

In the first column, you will find the name of the study analysed (1), along with a link and the reference. For each study, you will find in the second column the list of identified peptides (2), and the third column indicates the number of peptide spectrum matches (3). Finally, next to the tab title (4), the number between brackets refers to the overall MS score, as explained here.

Back

How is the increase in the database search space accounted for?

Adding all predicted AltProts and novel Isoforms to the database corresponds to an additional 518,957 entries in Human alone, leading to a substantial increase in the search space. Thus, OpenProt enforces a stringent FDR of 0.001 % (traditionally, set at 1 %), in order to focus on highly confident AltProts and Novel Isoforms identifications, and to account for the database increased size. This strategy was designed with the kind help of Peptide Shaker developers, initial validations included: (a) a minimum of 80% overlap of RefProts identifications with the original MS study; and (b) a manual validation of randomly selected spectra.

Back

How does OpenProt deal with comparability and reliability across MS datasets?

OpenProt pipeline retrieves publicly available top-down MS/MS datasets mostly from PRIDE and ProteomeXchange. That ensures datasets have been run through the PRIDE Inspector to assess data quality (PMID 22318026) and that they follow the ProteomeXchange consortium guidelines (PMID 27924013). Moreover, for each retrieved dataset, parameters are validated through text-mining and manual curation. As such, OpenProt ensures to analyze data from high quality mass spectrometers and to do so with the appropriate parameters. Although this step is time consuming, it guarantees higher quality data and reliability on OpenProt.

Back

 

I have some MS datasets I would like to re-analyze using OpenProt, what should I do?

All data on OpenProt is freely available and downloadable. You can thus download custom fasta files to analyze your MS datasets. For more information on downloads, click here. For more information on which database to download, click here.

Back

I have some MS datasets I would like to analyze using OpenProt, but I don’t know how or don’t have the computational resources for, what can I do?

OpenProt now offers a data submission service! Your dataset has to be publicly available on the PRIDE Archive with a unique PXD accession. Enter your PXD accession in the data submission form and fill it with adequate parameters. OpenProt will review the parameter settings and contact you with the results within 7 to 10 days. Please note that the results will be incorporated in the next OpenProt release. For more information on how to submit a dataset, click here. If you have additional questions or requests regarding data submission, you can contact us here.

Back

I want to submit a MS dataset, but OpenProt tells me the study has already been submitted

Thank you for using the data submission platform of OpenProt! If after entering your PXD accession, OpenProt displays this pop-up (here, PXD001874 is used as an example):

It means your dataset has already been analyzed with OpenProt and you can see all the proteins (RefProts, Novel Isoform and AltProts) detected in your dataset, by clicking on the displayed link. This link will redirect you to the Search page of OpenProt, filtered for your dataset (red arrow).

The total number of identified across the entire dataset is indicated at the top (3,073 proteins here). As usual, the results can be ordered differently (1), the columns displayed can be customized (2), the results can be downloaded as a TSV file (3) or a FASTA file (4). Finally, this output can be shared using the Share button (5).

However, if the total number of proteins displayed is 0 (circled red below - example of PXD011929 here):

This means this dataset has been recently analyzed by OpenProt but has not yet been released on the website. If you wish to have access to the results from such a dataset without wanting to wait, you can contact us here.

Back

I have MS datasets that I would like to share with OpenProt, how can I do so?

We are constantly adding datasets to the OpenProt database. If you have some you would be willing to share with us, you can contact us here or you can use our novel data submission platform (click here for a tutorial)!

Back

I have an RNA-seq dataset and would like to download a custom fasta with OpenProt, how can I do so?

OpenProt allows the download of custom database to couple MS studies with RNA-seq experiments for example. Under the search page, you enter your list of transcripts (1) - or your list of genes (1) - and click on search (2). This will display your table of results, and you can then download these as a fasta file (3).

For more information on the fasta header, click here.

Nota bene: For the moment (and for computational reasons), there is a limit at 2,000 gene / transcript entries at a time. However, you can download it as several fasta files and then concatenate them; or download all sequences in the Downloads section and filter by your genes / transcripts of interest.

Back

Where can I find a list of MS experiments re-analyzed by OpenProt?

A full list of mass spectrometry studies added to the OpenProt database can be found here.

Back

Where can I find a link to a specific study my protein of interest was detected in?

Under the Details page of a specific protein, you can click on the Mass Spectrometry tab (1). The first column (2) contains the details about each specific study. It notably contains a link to the data (3) and the Pubmed ID to the related publication (4).

Back

How can I find all proteins detected in one MS dataset on OpenProt?

OpenProt offers an advanced search filter on its Search page (click here for a tutorial). Enter the name of the study and click on update search. Please note that for computational reasons, only one dataset at a time can be queried at the moment. For more information on MS detection for novel proteins, click here.

Back

Peptide assignment across isoforms: what are the rules?

In the case of two possible assignations on different genes, the peptide is unassigned.

In the case of two possible assignations from the same gene, the rules are detailed below for every encountered combination:

  1. If a RefProt is amongst the possible assignations, the peptide will always be assigned to the RefProt.
  2. Assignation possible to two RefProts, the peptide is assigned to both
  3. Assignation possible to two Novel Isoforms (II_), the peptide is assigned to both.
  4. Assignation possible to two AltProts (IP_), the peptide is assigned to both.
  5. Assignation possible to a Novel Isoform (II_) and an AltProt (IP_), the peptide is assigned to both.

Therefore, when a Novel Isoform (II_) or AltProt (IP_) is detected by MS, it is necessarily with a specific peptide that doesn’t match the associated RefProt or any other RefProt. Nota bene: rules (a) to (e) only apply when assignations refer to different proteins from the same gene. If it is a different gene, the peptide is unassigned.

Back

Can proteins coded by RNAs currently annotated as non-coding (pseudogenes RNAs and lncRNAs) be detected?

OpenProt annotates all ORFs (starting with an ATG and longer than 30 codons) and the corresponding AltProts in coding and non-coding RNAs. If non-coding derived AltProts have characteristics that allow detection by MS, then they can be detected. For more information on MS detection for novel proteins, click here.

Back

MS detection truths: recommendations

One has to always remember that a vast majority of MS datasets re-analyzed by OpenProt uses a trypsin digestion. Thus, some proteins may not be detectable under these conditions. An absence of detection does not mean the protein does not exist.

Moreover, AltProts are mostly small proteins (median length of 45 amino acids) which decrease the likelihood of detectable, sufficiently long, unique peptides. Furthermore, some protocols may be in favor of large and abundant proteins, when small proteins detection might require specific protocols (see Ma et al., Anal Chem, 2016).

Finally, Novel Isoforms or AltProts might not be identifiable using our OpenProt pipeline if no tryptic peptides permits a unique assignation. Indeed, when a peptide can also be assigned to a RefProt, it will always be assigned to the RefProt (see peptide assignation rules).

Back

Ribosome Profiling related questions

What does TE and TIS mean?

TE stands for Translation Event, when TIS stands for Translation Initiation Sites.

Back

What is the TE score?

The TE score displayed on the search results page and on the Translation tab corresponds to the number of studies in which a significant identification was made. For more information on translation events identification, see below.

Back

What does PRICE do?

PRICE is an entropy based model for identification of translated Open Reading Frames (ORFs) from ribosome profiling datasets. It stands for PRobabilistic Inference of Codon activities by an EM algorithm (see PMID 29529017). PRICE uses parameters inferred from well-translated, annotated ORFs to model the stochastic events in ribosome profiling. In brief, a given codon in a ribosomal P site can produce several footprints, PRICE uses Maximum Likelihood algorithms to reconstitute the set of codons more likely to give the observed reads. The set of codons are then assembled in ORF candidates, where a machine-learning algorithm predicts the start codon. Detected ORFs are then filtered according to a stringent FDR of 1% (traditionally set at 10%) to focus on highly confident translation event. For more information on the PRICE algorithm, see PMID 29529017.

Back

How are translation event detection displayed on OpenProt?

PRICE results are crossed against OpenProt database. Results summary can be visualized in the TE (Translation Event) column from the main results table (1). The detailed results can be seen by clicking on the details tab.

From the details page, you can click on the Translation evidence tab.

Under the Translation tab, you will find a table with all the studies in which this ORF has been detected (1). The detections will be separated based on the annotations (1).

The third column corresponds to the genomic coordinates of the detected ORF (2), followed by the start codon and type of ORF (3). The p-value corresponds to the ORF identification confidence (4, for more information on the p-value, click here). The samples column (5) lists all samples within an experiment with the associated readcount for this ORF. The last column (6) corresponds to the transcript and associated protein accessions followed by the overlap of the PRICE predicted ORF sequence with the OpenProt predicted ORF sequence.

Back

How should I interpret the p-value?

The p-value associated to an ORF detection corresponds to the significance of a generalized binomial test (not corrected for multiple comparisons). In brief, it indicates the confidence of that ORF not being attributable to noise. In ribosome profiling experiments, noise can arise from (1) ribosomal scanning, (2) abortive translation events in the leader region, (3) non-ribosome mediated mRNA protection from RNAses, or (4) overlapping ORFs. Nota bene: this p-value indicate the confidence of the ORF identification, not the confidence of its detection which would be represented by the enforced 1% FDR (see above).

Back

How are multi-mapped reads accounted for?

We run PRICE using the “rescue” mode. This means that if a footprint maps at several places in the genome, it is either discarded or rescued if uniquely mapped reads are found near one of the possible genomic loci.

Back

Where can I find a list of ribosome profiling studies analyzed by OpenProt?

A list of ribosome profiling studies analyzed by OpenProt can be found here.

Back

Where can I find a link to a specific study my protein of interest was detected in?

Under the Details page of a specific protein, you can click on the Translation tab (1). The first column (2) contains the details about each specific study. The name of the study (3) is a link to the original study.

Back

How can I find all proteins detected in one Ribo-seq dataset on OpenProt?

OpenProt offers an advanced search filter on its Search page (click here for a tutorial). Enter the name of the study you wish to query and click on update search. Please note that for computational reasons, only one dataset at a time can be queried at the moment. For more information on detection of novel proteins by ribosome profiling, click here.

Back

Can the translation of pseudogenes be detected?

Pseudogenes are by definition an imperfect copy of a functional gene. Thus, pseudogenes share a high degree of homology with their related genes and this may hinder their detection by ribosome profiling. Indeed, most of the footprints will multimap to the gene and the pseudogene since footprints are short fragments. OpenProt enforces a pipeline that may hinder pseudogene detection but that favour highly confident annotations. Thus, a pseudogene may be detected with our pipeline only if unique footprints can be seen.

Back

I have some ribosome profiling data I would like to analyze using OpenProt, but I don’t know how or don’t have the computational resources for, what can I do?

OpenProt now offers a data submission service! Your dataset has to be publicly available on the Gene Omnibus repository with a unique GSE accession. Enter your GSE accession in the data submission form and fill it with adequate parameters. OpenProt will review the parameter settings and contact you with the results within 7 to 10 days. Please note that the results will be incorporated in the next OpenProt release. For more information on how to submit a dataset, click here. If you have additional questions or requests regarding data submission, you can contact us here.

Back

I have ribosome profiling datasets that I would like to share with OpenProt, how can I do so?

We are constantly adding datasets to the OpenProt database. If you have some you would be willing to share with us, you can contact us here or you can use our novel data submission platform (click here for a tutorial)!

Back

I want to submit a Ribo-seq dataset, but OpenProt tells me the study has already been submitted

Thank you for using the data submission platform of OpenProt! If after entering your GSE accession, OpenProt displays this pop-up (here, GSE131112 is used as an example):

It means your dataset has already been analyzed with OpenProt and you can see all the proteins (RefProts, Novel Isoform and AltProts) detected in your dataset, by clicking on the displayed link. This link will redirect you to the Search page of OpenProt, filtered for your dataset (red arrow).

The total number of identified across the entire dataset is indicated at the top (15,112 proteins here). As usual, the results can be ordered differently (1), the columns displayed can be customized (2), the results can be downloaded as a TSV file (3) or a FASTA file (4). Finally, this output can be shared using the Share button (5).

However, if the total number of proteins displayed is 0 (circled red below - example of GSE144682

here):

This means this dataset has been recently analyzed by OpenProt but has not yet been released on the website. If you wish to have access to the results from such a dataset without wanting to wait, you can contact us here.

Back

TE (Translation Events) detection truths: recommendations

The OpenProt pipeline for ribosome profiling dataset analyses uses the PRICE algorithm. It is a model, and thus it may not always fully converge and use the same parameters. Therefore, we encourage seeking detections across multiple datasets. Similarly to mass spectrometry data, the more an ORF would have been identified in ribosome profiling datasets, the more confident we are.

In an effort to focus on highly confident translation events, we use a stringent 1 % FDR and a pipeline that may hinder detection of pseudogenes (see above). Furthermore, identifications are dependent on the quality of the study analyzed (signal to noise ratio, and sequencing depth). Some transcripts may not be seen at all in an experiment. Thus, it is important to remember that an absence of detection does not mean an absence of translation.

Back

Conservation related questions

What is an ortholog and a paralog?

An ortholog is a protein sequence from a species that shares a high degree of homology with a protein sequence from another species. Two orthologous proteins are 2 similar proteins from different species. Thus, orthologs have a common ancestor gene and diverge by a speciation event.

A paralog is a protein sequence from a species that shares a high degree of homology with a protein sequence from a different gene within the same species. Two paralogous proteins are 2 similar proteins from different genes within one species. Thus, paralogs originate from a duplication event, creating a “copy” of an existing gene.

Back

What is the InParanoid approach?

The InParanoid algorithm (PMID 25429972) aims to identify ortholog and paralog groups. The algorithm consists of an all-vs-all Basic Local Alignment Search Tool (BLAST) comparison of all protein sequences in two species. For example, all proteins from Homo sapiens are BLAST searched against all proteins from Pan troglodytes. Several type of orthologies can be identified, all included in OpenProt : one-to-one corresponds to a pairwise best reciprocal hit (BRH); one-to-many corresponds to all orthologs to one query protein; many-to-one corresponds to all queries matching to one ortholog; and many-to-many corresponds to all orthologs to all queries. Secondly, the same can be done within one species to identify paralogs. OpenProt uses a significance filter at a bitscore of 40 for an overlap over 50 % of the query sequence, as previously published (Samandi et al., eLife, 2017). For more information on the InParanoid algorithm, see PMID 25429972.

Back

How can I see if a protein is conserved?

You can see on your search results species that contain orthologs across the 10 species currently supported by OpenProt (1). Species are abbreviated using a two letter code, the first letters of the species and sub-species names (for example, Rattus norvegicus is abbreviated RN). The darker the blue colour, the more similar is the protein sequence from the ortholog in that species. All details for identified orthologs and paralogs can be found under the details tab.

Then, you can click on the conservation tab, that will display orthologs and paralogs. The number on the Conservation tab corresponds to the number of species with at least one identified ortholog, out of the 10 species currently supported by OpenProt.

The tree of orthologs and paralogs is then displayed. Orthologs and paralogs are separated in two trees (yellow node, 1). Then, orthologs for each species can be displayed or hidden by clicking on the species node (2). The size of the nodes relates to the number of orthologs, when the colour relates to the homology (identity percentage, 3).

Details for each identified ortholog and paralog can be displayed by clicking on the accession key.

Both the BLAST (1) and reciprocal BLAST (inverted query species, 2) results are displayed. Notably of interest, the bitscore (3) and the query sequence coverage (qcovs, 4) are displayed. The identity percentage is indicated as well (pident, 5). Finally, by clicking on the blue marked accession of the identified ortholog from the pop-up window, one can directly access the details tab to that specific protein (in the above example IP_1296106 in Rattus norvegicus).

Back

I found a novel protein but it is weakly conserved, is it a random ORF?

A lack of conservation does not necessarily mean the ORF is random. Several studies showed that de novo genes rise up from short ORFs (PMID 29556078) not conserved across species. Furthermore, transcriptome annotation is more thorough in human than in other species. This can lead to an apparently weakly conserved sequence, which is in fact due to poorer transcriptome annotations in other species.

Back

Conservation analysis truths: recommendations

AltProts have a median length of 45 amino acids, much shorter than the 460 amino acids of RefProts. This is to keep in mind when looking at orthology, and this directed our choice of threshold (see above). It is possible one ortholog passes the filter when the homology rely only on a conserved functional domain. However, such cases would always have a bitscore close to the threshold of 40. That is why we always encourage users to look at the protein sequences, their alignment and scores. These can be found by clicking on the accession key of identified orthologs and/or paralogs (see above).

OpenProt currently does not account for outparalogs. For example, should a gene undergo a duplication event in a distant species, all protein sequences derived from the original and the “copy” genes will be identified as orthologs from another species when they actually make up separate ortholog groups. That is why we encourage for careful exploration of OpenProt conservation data for each candidate of interest.

Back


Downloads guidelines

How to download protein sequence databases (FASTA format) for MS users.

From any OpenProt page, including the home page, click Downloads.

Once you clicked on Downloads, you should first select an OpenProt release. The most recent is the default option.

You can then tune the database you would like to download:

  1. Select a species. Available species so far are: Homo sapiens, Mus musculus, Rattus norvegicus, Pan troglodytes, Danio rerio, Drosophila melanogaster, Caenorhabditis elegans, Bos taurus, Saccharomyces cerevisiae S288c, and Ovis aries.
  2. Select an assembly. The most recent for each species is input by default upon species selection.
  3. Select the desired protein type. You can choose whether you would like to download RefProts only, or AltProts and Isoforms only, or if you would like to download all, RefProts, AltProts and Isoforms.

Once you have selected the protein type you desire, a result table will already appear, but you can refine it further.

  1. Select an annotation. OpenProt supports both Ensembl and NCBI RefSeq annotations for all species. If you would like to have more information on supported annotations, please click here.
  2. Select the desired level of supporting evidence. You will be given this choice if you chose as protein type one that includes AltProts and Isoforms. You have 3 options: “all predicted”, “detected with at least one unique peptide”, or “detected with at least two unique peptides”. This choice refers to the level of supporting MS evidence annotated in OpenProt database. If you are unsure which database would suit you best, you can read more here for recommendations on which one to choose.

The table of results from your download query are grouped on a table. The first column (1) indicates the annotation used, and the next two refers to your search criteria regarding supporting evidence and the protein type. Several file types are available for download (2 - TSV, FASTA (protein), FASTA (DNA), or BED). Finally, each file is accompanied by a readme that regroups all information needed to understand it (3): headers, parse rules and file naming scheme.

Once you have selected the file you wish to download, a pop-up table containing the downloadable file becomes visible. You can download the database by clicking on its name, or from the read me pop-up.

Main menu

FAQ contents

Downloads Guidelines

Contact us