Open Proteome Database

OpenProt Help

Getting Started - Search, click here.

Getting Started - Browse, click here.

Getting Started - Data Submission Platform, click here.

FAQ Contents, click here.

Downloads guidelines, click here.

Getting Started - Search

If you want to know if a specific gene contains AltORFs (all predicted and those with evidence of expression), click Search.

You will then be redirected towards a query page:

You can input your search criteria as follow:

Select a species (default is Homo sapiens).
Select an assembly (default is the most recent in each species).
Select an annotation (default is Ensembl+RefSeq). Several annotations are used by OpenProt to predict AltProts. Ensembl, NCBI RefSeq and combined Ensembl+RefSeq annotations are available for all species. If you want to know why OpenProt supports multiple annotations, you can click here.
Enter the name of your gene of interest.

Alternatively, you can also search by transcript or protein accessions (5 and 6 respectively). Both Ensembl and RefSeq accession IDs are accepted. Proteins may be searched on one or more specific transcripts.Similarly, one or more proteins can be searched for simultaneously.

Below is an example for the COL1A1 gene. Once you have entered your gene name and launched the search, your results will appear below. The number of found proteins respecting your search criteria is indicated at the top (1) of your results table (2).

You can then refine your search results by playing with the options in the dropdown menu or by selecting the Advanced Search option.

Tick to search for or display only proteins (RefProts, AltProts and Isoforms) that have been detected by mass spectrometry (MS) and/or for which translation events (TE) have been identified in ribosome profiling studies.
Tick to search for or display only proteins that have been detected by MS. For a list of MS studies reanalysed by OpenProt, click here.
Tick to search for or display only proteins that have been detected by ribosome profiling. For a list of ribosome profiling studies reanalysed by OpenProt, click here.
Tick to search for or display only proteins with predicted domains by InterProScan.
Tick to search for or display only AltProts.
Tick to search for or display only Isoforms.

Any of the above can be combined as you wish. An advanced search is also available by clicking Advanced Search.

Filter by a specific amino acid sequence.
Filter according to the transcript type (mRNA or ncRNA) or the localization of AltORFs in transcripts. Within mRNAs, the localization of AltORFs is defined by the predicted start codon localization with respect to the annotated CDS start codon. The localization of AltORFs within non-coding RNAs is labeled “-”. There are three possible choice of localizations of AltORFs within mRNAs: “5’UTR”, “CDS”, “3’UTR”. Thus, the dropdown menu offers 5 choice: “5’UTR”, “3’UTR”, “CDS”, “ncRNA”, or “mRNA”.
Filter AltORFs in a specific reading frame (+1, +2 or +3). The reading frame is determined with respect to the first nucleotide of each transcript (+1 reading frame).
Filter by dataset identifier. This is a dropdown menu containing all datasets currently in OpenProt. Select one to mine proteins detected in this dataset. Please note that the filter supports only one study at a time.

You can further sort your results by clicking on any option of the Order by dropdown menu (1).

The following sorting options are available: “MS score (desc) / TE (desc) / Domains (desc)” (by default);
“Domains (desc) / MS score (desc) / TE (desc)”;
“TE (desc) / MS score (desc) / Domains (desc)”;
“Molecular Weight (asc) / MS score (desc) / TE (desc) / Domains (desc)”;
“Molecular Weight (desc) / MS score (desc) / TE (desc) / Domains (desc)”;
“Protein Length (asc) / MS score (desc) / TE (desc) / Domains (desc)”;
“Protein Length (desc) / MS score (desc) / TE (desc) / Domains (desc)”.
Control which columns you want to see in the results table by clicking on the Column Settings and deselect any you don’t want to see.
You can download your results table by clicking on Download as TSV. For more options and information on available downloads, click on Downloads Guidelines.
You can download protein sequences from your results table by clicking on Download as FASTA. For more options and information on available downloads, click on Downloads Guidelines.
You can also share your search by clicking on Share. A pop-up window will display a shareable link.

Main Menu

Getting Started - Browse

If you want to browse the genome of a specific species for AltORFs (all predicted and those with evidence of expression), click Browse.

You will then be directed towards a query page.

You can input your search criteria as follow:

Select a species (default is Homo sapiens).
Select an assembly (default is the most recent in each species).
Select an annotation (default is Ensembl). Both Ensembl and NCBI RefSeq annotations are used by OpenProt to predict AltProts, and the browser is available for both. If you want to know why OpenProt supports multiple annotations, you can click here. For more informations on how to display both annotations on the browser, click here.
Enter the name of your gene of interest.

Alternatively, you can also search by transcript or protein accessions (5 and 6 respectively). Both Ensembl and RefSeq accession IDs are accepted (depending on the chosen annotation). You can also directly enter genomic coordinates of interest (7).

Below is an example for the COL1A1 gene. Once you have entered your gene name (1) and launched the search (2), your results will appear centered in the browser window.

You can visualize the genomic coordinates (1) and the different tracks. The first track contains transcripts for the chosen annotations (2 - here, Ensembl). The second contains predicted proteins (3). The colour code is indicated below the browser with the transcripts annotated in blue, the RefProt in green, the AltProt in red and the Novel Isoforms in yellow. You can widen or narrow the browser window (4) and custom your display by adding or removing a track from the registry (5) The registry includes: genome, transcript, protein and peptide detection by default.

If you scroll down on the genome browser (1), the last track will appear and contains the peptide detected by MS (1).

Furthermore, you can click on a peptide and this will display the details associated to this peptide in a pop-up window.

The pop-up windows displays the peptide sequence (1), its genomic coordinates (2) and the proteins assigned to that peptide (3). All proteins this peptide has been assigned to are listed, across both annotations (Ensembl + RefSeq). For more information on peptide assignation rules, click here. The details page of the assigned proteins can be consulted directly by clicking on the goto details link (4).

Such pop-up windows are also displayed when clicking on a protein or a transcript (as shown below).

The transcript associated pop-up window contains the transcript genomic coordinates (1) and a list of all the associated proteins to this transcript (2). Each protein can then be accessed by clicking on the goto details link (3).

Main Menu

Getting Started - Data Submission Platform

From any OpenProt page, including the home page, click Submit study.

Once you clicked on Submit study, you should first select the type of file you are submitting: mass spectrometry or ribosome profiling.

***

For mass spectrometry studies, your dataset has to be available in the PRIDE Archive with a public PXD accession.

After entering the PXD accession number, the OpenProt submission platform will retrieve information from the PRIDE repository (here, we use the PXD015644 as an example). Thus, the PMID (1) and citation (2) are automatically filled, as well as the available samples in the dataset (3).

First, enter a contact email that will serve for all future correspondence (1). For example, we will send you the results of the analysis at this email address.

Once you have entered your email address, you can start selecting samples (2).

In order to select a sample, click on its name. The blue color indicates the sample is selected (1), the white background indicates the sample is not selected (2).

Once samples are selected, you can click on “Group selected” (1) to add each of them to on group with identical parameters. To correct erroneous selection, you can click on “clear selection” (2). If the datasets contain some samples that you don’t want to include in the analysis, you can click on “Exclude selection” (3). If you want to select all samples at once, you can click on “select all” (4). Once samples are grouped, they will be removed from the selection panel. If you forgot one sample, you can add it to a pre-formed group by clicking on “add to selected group” (5). Please not that all samples must be grouped in order to submit.

Your grouped samples will then appear in the parameters editing box (1). Please note that your samples should be grouped by parameter settings.

For each sample within a group, you can edit its fraction and replicate number by clicking on “Edit” (1). Then, for each group you have to indicate the enzyme (2) used for protein digestion, and the variable (3) and fixed modifications (4) to include in the analysis. These are drop-down menus with all available enzyme and modification in our pipeline. For custom enzyme or modifications, please contact us.

The parameters entered can always be changed by clicking the cross next to the selected enzyme or modification.

At the bottom of the page, the next parameters to enter are the species (1), the type of biological sample (2), the fragmentation protocol (3) and the mass spectrometer used (4).

The species, fragmentation and MS instrument are compulsory for submission. The MS instrument is retrieved from the PRIDE directory. The species is a dropdown menu containing all species currently supported by OpenProt. The fragmentation protocol is a dropdown menu with the protocols currently supported by OpenProt. (For more information on the fragmentation protocol, click here).

Once all compulsory parameters have been filled, you can click on submit. You will received an email (at the email address indicated at the top of the form) to confirm the submission (please check your spam folder if you don’t receive any email).

***

For ribosome profiling studies, your dataset has to be available in the Gene Omnibus Archive with a public GSE accession.

After entering the GSE accession number, the OpenProt submission platform will retrieve information from the Gene Omnibus repository (here, we use the GSE144682 as an example). Thus, the PMID (1) and citation (2) are automatically filled, as well as the available samples in the dataset (3).

First, enter a contact email that will serve for all future correspondence (1). For example, we will send you the results of the analysis at this email address.

Once you have entered your email address, you can start selecting samples.

In order to select a sample, click on its name. The blue color indicates the sample is selected (4), the white background indicates the sample is not selected.

Once samples are selected, you can click on “Group selected” (2) to add each of them to on group with identical parameters. To correct erroneous selection, you can click on “clear selection” (3). If you want to select all samples at once, you can click on “select all” (4). If the datasets contain some samples that you don’t want to include in the analysis, you can click on “Exclude selection” (5). Once samples are grouped, they will be removed from the selection panel. If you forgot one sample, you can add it to a pre-formed group by clicking on “add to selected group” (6). Please not that all samples must be grouped in order to submit.

Your grouped samples will then appear in the parameters editing box (1). Please note that your samples should be grouped by parameter settings.

A sample can always be removed from the group by clicking the cross next to its name.

At the bottom of the page, the next parameters to enter are the species (1), the time of treatment (2), the drug used (3) and the biological type of the sample (4).

The species, time of treatment and drug used are compulsory for submission. The species is a dropdown menu containing all species currently supported by OpenProt. The time of treatment should correspond to when the drug was added during the protocol (if the drug was part of the lysis buffer, select n/a). The drug used is a dropdown menu containing all drugs currently supported by OpenProt.

Main Menu

FAQ contents

What are AltProts (alternative proteins), novel predicted Isoforms and RefProts (reference proteins)?
How does OpenProt differ from other small ORFs databases or UniProt?
How can I find an AltProt encoded in a specific gene?
How do I get the protein sequence of an AltProt?
How do I get the DNA sequence of an AltProt?
I want to detect predicted (or already detected) AltProts and/or Isoforms in MS-based proteomic analyses. Can I download databases in FASTA format?
How to download a FASTA file for MaxQuant?
How do I download the full database of predicted AltProts, or smaller databases with subsets of data based on specific criteria (e.g. experimental evidence)?
Which database should I download?
How can I share with colleagues a specific search on OpenProt?
Why does OpenProt support several annotations (Ensembl and NCBI RefSeq)?
Why does OpenProt use a 3-frames in silico translation instead of a 6-frames?
How can I visualize both annotations in the OpenProt genome browser?
What is the FASTA header?
Why are FASTA files of AltProts and Isoforms not available for download in my species of interest?
Why do I see the same protein sequence annotated either as II_ (Isoform) or IP_ (AltProt)?
I am not sure to understand what is a novel predicted Isoform (II_)?
How can I know why a novel protein is annotated as a novel predicted Isoform (II_)?
How can I see if a novel protein shares sequence similarities with others from the same gene?
Future features and directions for OpenProt?
How does OpenProt follow the FAIR guidelines for database stewardship and management?
Who are the people behind OpenProt?

Mass Spectrometry analyses related questions:

How is the MS score calculated?
What are the MS coverage statistics?
I am interested in a protein annotated in OpenProt, but it has an MS score of 0
How does OpenProt identify AltProts in MS-based proteomic analyses?
How is the increase in the database search space accounted for?
How does OpenProt deal with comparability and reliability across MS datasets?
I have some MS datasets I would like to re-analyze using OpenProt, what should I do?
I have some MS datasets I would like to analyze using OpenProt, but I don’t know how or don’t have the computational resources for, what can I do?
I want to submit a MS dataset, but OpenProt tells me the study has already been submitted.
I have MS datasets that I would like to share with OpenProt, how can I do so?
I have an RNA-seq dataset and would like to download a custom fasta with OpenProt, how can I do so?
Where can I find a list of MS experiments re-analyzed by OpenProt?
Where can I find a link to a specific study my protein of interest was detected in?
How can I find all proteins detected in one MS dataset on OpenProt?
Peptide assignment across isoforms: what are the rules?
Can proteins coded by RNAs currently annotated as non-coding (pseudogenes RNAs and lncRNAs) be detected?
MS detection truths: recommendations

Ribosome Profiling related questions:

What does TE and TIS mean?
What is the TE score?
What does PRICE do?
How are translation event detection displayed on OpenProt?
How should I interpret the p-value?
How are multi-mapped reads accounted for?
Where can I find a list of ribosome profiling studies analyzed by OpenProt?
Where can I find a link to a specific study my protein of interest was detected in?
How can I find all proteins detected in one Ribo-seq dataset on OpenProt?
Can the translation of pseudogenes be detected?
I have some ribosome profiling data I would like to analyze using OpenProt, but I don’t know how or don’t have the computational resources for, what can I do?
I have ribosome profiling datasets that I would like to share with OpenProt, how can I do so?
I want to submit a Ribo-seq dataset, but OpenProt tells me the study has already been submitted
TE (Translation Events) detection truths: recommendations

Conservation related questions:

What is an ortholog and a paralog?
What is the InParanoid approach?
How can I see if a protein is conserved?
I found a novel protein but it is weakly conserved, is it a random ORF?
Conservation analysis truths: recommendations

Main menu

What are AltProts (alternative proteins), novel predicted Isoforms and RefProts (reference proteins)?

Current genome annotations in eukaryotes rely partly on ORF prediction algorithms, which are reliable only for sequences above a certain length. Consequently, three main criteria are enforced to distinguish true ORFs from randoms: (1) a minimum length of 100 codons; (2) a single CDS per transcript; and (3) the use of an ATG start codon. However, these assumptions lead to a substantial underestimation of the proteomic information encoded within a gene, and hamper the discovery of proteins translated from unannotated ORFs (PMID: 29083303, 28627015, 26578573, 29626080).

Here, in OpenProt, we use different terms to identify proteins based upon their genome annotation status.
- Proteins currently annotated in databases, such as UniProtKB, are translated from canonical CDSs (annotated coding sequence) and are termed reference proteins (or RefProts).

- Alternative ORFs (or AltORFs) are defined as potential protein-coding ORF, located either in non-coding RNAs (e.g. long non-coding RNAs, pseudogene RNAs), or in UTRs or alternative reading frames overlapping the CDS in mRNAs. Predicted proteins translated from AltORFs are termed alternative proteins (or AltProts; IP_). AltProt and RefProt from a same gene are not isoforms: they are coded by different ORFs and their amino acid sequence is completely different.

- Predicted proteins translated from an alternative ORF (as defined above), but that either display (1) a close homology with a reference protein from the same gene; (2) the same start and/or stop codon than the reference protein; and an alignment score above the threshold are considered novel isoforms of the reference proteins (or Isoforms; II_).

Genome annotations are ever-changing and rely also on manual curation. Thereby, once there is enough evidence of expression and function for an AltProt, it will be annotated in UniProtKB (via manual curation) and thus becomes a new RefProt in OpenProt. For example, human MIEF1 gene encodes two RefProts: the originally annotated MiD51 protein (UniProtKB Q9NQG6), and the recently annotated UniProtKB L0R8F8.

Expression of AltProts demonstrates that an unknown fraction of eukaryotic genes are polycistronic, and that an unknown fraction of RNAs originally annotated as non-coding RNAs are actually encoding small proteins.