Many genes exert their functions as components of protein complexes. It has been observed that direct and indirect protein-protein interactions often lead to similar phenotypes.
Online Mendelian Inheritance in Man OMIM [ 1 ] is a comprehensive database for human genetic diseases and related genes that has become an authoritative source of disease phenotypes. Text mining of OMIM can be used to analyze human disease phenotypes. Computational approaches integrating phenotype similarity data with protein-protein interactions data have been developed to prioritize candidate genes [ 2 — 4 ].
However, OMIM is manually curated by disease experts, and phenotypic descriptions are found in free text. Therefore, standard vocabulary for phenotypic annotation is necessary for extracting clinical manifestations from OMIM records. Currently, three standard vocabularies are used for human phenotypic annotation in text mining of OMIM, i. Similarity scores are calculated between disease phenotype pairs, and then disease phenotype networks DPN are created.
The former, called mimMiner [ 5 ], is the most widely-used for identifying disease genes, and the latter, called resnikHPO in this paper, is the first DPN annotated by HPO—a structured, comprehensive and well-defined set of terms dedicated to describing human phenotypic abnormalities [ 14 ].
By contrast, mimMiner is a numerical matrix of similarity scores between human diseases from OMIM and is easy to use. Which DPN is a better phenotype similarity network for prioritizing candidate genes? In this study, we aim to comprehensively compare the performance of mimMiner and resnikHPO on the same dataset, which, to our knowledge, has not been performed before. MimMiner was derived from the work of van Driel et al, who thought that the full text and clinical synopsis fields of OMIM records describe genetic disorders and used the anatomy and the disease sections of MeSH to automatically extract terms from the OMIM records.
Thus, every record generated a feature vector. The similarity between the OMIM records was quantified by calculating the cosine of the angle between normalized vector pairs Eq 1.
By performing pairwise comparison of OMIM records, a disease phenotype network containing diseases was obtained, called mimMiner, whose arcs were weighted with a value in the range [0, 1]. ResnikHPO is based on HPO, which is constructed by using ontological concepts to represent clinical attributes of human diseases in the form of a directed acyclic graph. The similarity between diseases pairs was calculated using Eq 2. Here, MICA is the most informative common ancestor of two terms t 1 and t 2 in ontologies.
IC denotes the information content, which is defined as the negative natural logarithm of the frequency of a term used to annotate diseases. To compare resnikHPO with mimMiner, four methods were adopted to normalize the ranges, i. Sim min and Sim max are the minimum and maximum value in a phenotype similarity matrix. The normalization is implemented using the following formulae:. Two suffixes -R and -C were used to distinguish row and column vectors of the asymmetric matrix of resnikHPO.
For example, after Lin normalization, three forms are available, i. The former two forms mean that the disease phenotype similarity data are extracted from the row and column vectors of the asymmetric matrix of resnikHPO, and the last form means that the disease phenotype similarity data are extracted from the symmetric matrix i.
As reported in an earlier study [ 2 ], logistic regression can strengthen the correlation between the similarity of two diseases and their causal genes. Known associations between diseases and genes were extracted from the OMIM database. For each disease gene, the following experiment was conducted. A gene was removed from the set of genes causing the disease. This removed gene was called the target gene, and the remaining genes in the set of genes composed the seed set. The target gene was predicted by prioritizing candidate genes.
Because same gene or correlated genes could lead to overlapping disease phenotypes, protein—protein interactions PPI can be used to discover new genes [ 18 ]. All protein-coding genes in the PPI were also selected as candidate genes for prioritization, and the target gene was predicted according to the ranking. For a polygenic disease, the seed set and phenotypic similarity data were used together to recover the association between the target gene and the disease.
For a monogenic disease, there was no seed set after the unique gene was removed. As a result, the reconstruction of disease-gene relationship is mainly based on phenotypic similarity. The disease-gene associations were extracted from the OMIM database, while the protein gene networks were obtained from Human Protein Reference Database HPRD [ 19 ], which contains unique interactions between proteins.
Recently, a number of gene prioritization methods have been proposed, and the predictive performances were accessed. In this study, the PPI consists of proteins. Y represents a prior knowledge function. If a protein is related to a considered disease, then Y assigns positive values to it and zero otherwise.
Disease similarity information is incorporated into Y in proportion. To systematically compare the performance of different DPNs, the following evaluation criteria were used. A lower number for MRR indicates better performance. In LOOCV, the proportion of true disease genes ranked at the top with rank 1 is defined as the precision of prioritizing candidate genes [ 23 , 24 ].
A high precision means that the DPN has a high prediction power. However, the difference between different DPN forms is often minimal; therefore, the number of true disease genes ranked at the top is presented to accurately describe the prioritization performance. In a real situation, computer-aided gene screening usually gives a certain number of candidate genes. However, because validating disease-causing genes in vivo is expensive and time consuming, the number of candidate genes should be limited.
The genes ranked within the top 5, 10 and 30 are sufficient for discovering the vast majority of genes; therefore, the TPRs in the top 5, 10 and 30 were selected as performance measures to estimate the efficiencies of the DPNs. These three thresholds represent reasonable biological hypotheses in a genetic screen [ 25 ], and they were adopted to evaluate gene prioritization tools or algorithms in previous studies [ 26 ]. The corresponding TPRs are the ratios of the number of true disease genes ranked in the top 5, 10 and 30 to the number of all validated disease-causing genes.
The common disease records were extracted from the two DPNs to ensure that the diseases in the two datasets were consistent. For the asymmetry and symmetry matrices extracted from resnikHPO, four methods, i. Then, these data were processed with a logistic regression function.
In each run of LOOCV, the candidate genes were ranked, and the results were compared according to the evaluation criteria. Two cases, i. The first step consists of extracting the common disease records from mimMiner and resnikHPO to construct the corresponding subnetworks for comparative analysis. The resnikHPO has asymmetric and symmetric versions so that two subnetworks are extracted. Every subnetwork can be represented by an adjacency matrix. Step 2 consists of normalization and logistic regression.
The asymmetric and symmetric subnetworks of resnikHPO need normalization to adjust all values into the range [0, 1].
Four methods shown in italics are adopted to process the corresponding adjacency matrix. Illustrated by the example of Lin method, the normalized asymmetric matrix is named Lin-R and Lin-C, while the normalized symmetric matrix is named Lin. In essence, Lin-R and Lin-C is the same matrix, but the difference is that the similarity scores between a given disease and the others are drawn from the row and column vectors of the matrix, respectively. In addition, mimMiner and resnikHPO can be integrated herein and the combination of mimMiner and Tanimoto outperforms each network alone see the main text.
Logistic regression is an optional process, but it substantially improves performance of the networks. Step 3 consists of validation and evaluation. For a fair comparison, the shared diseases were extracted from the two disease phenotype networks. The protein-coding genes in the PPI network and the diseases formed a disease-gene network, containing associations between diseases and genes. The diseases consist of monogenic diseases and polygenic diseases. Then, different DPNs were statistically compared, and the C value in the logistic regression function was analyzed.
Finally, two real examples demonstrate the ability of mimMiner, resnikHPO and their integration network for gene prioritization. For each of the diseases that were associated with only one gene, the disease-gene association was removed before the LOOCV. Then, the target gene was predicted using the previously known disease phenotype similarity information. First, the original data, untreated by logistic regression, was validated.
For the resnikHPO, each of the four transformations was validated individually, and each transformation included three different forms. The experimental results are listed in Table 1. The results show that for each transformation, the performance of the -C form was better than the corresponding -R and the symmetric form.
Additionally, the symmetric form is a little better than the -R form because it is the average of these two asymmetric forms. It is notable that the Tanimoto-C correctly ranked 20 target genes at the top, and it is far superior to the other transformations.
Frequently, phenotypes are related and used--the term is used--to relate a difference in DNA sequence among individuals with a difference in trait, be it height or hair color, or disease, or what have you. But it's important to remember that phenotypes are equally, or even sometimes more greatly influenced by environmental effects than genetic effects.
So a phenotype can be directly related to a genotype, but not necessarily. Nature Education 1 1 Why can you possess traits neither of your parents have? The relationship of genotype to phenotype is rarely as simple as the dominant and recessive patterns described by Mendel.
Aa Aa Aa. Complete versus Partial Dominance. Figure 1. Figure Detail. Multiple Alleles and Dominance Series. Summarizing the Role of Dominance and Recessivity. References and Recommended Reading Keeton, W. Heredity 35 , 85—98 Parsons, P. Nature , 7—12 link to article Stratton, F. Article History Close. Share Cancel. Revoke Cancel. Keywords Keywords for this Article. Save Cancel. Flag Inappropriate The Content is: Objectionable. Flag Content Cancel.
Email your Friend. Submit Cancel. This content is currently under construction. Explore This Subject. Gene Linkage. The Foundation of Inheritance Studies. Methods for Studying Inheritance Patterns. Variation in Gene Expression. Topic rooms within Gene Inheritance and Transmission Close. No topic rooms are there. Or Browse Visually.
Other Topic Rooms Genetics. Student Voices. Creature Cast. Simply Science. Green Screen. Green Science. Among the breast and ovarian cancer patients who were identified as the mutation c. The difference in the breast:ovarian cancer relative risk associated with the c. In this population-based series the median age at diagnosis of breast cancer patients among the cases with no mutations was The median age at diagnosis of ovarian cancer cases among the cases without mutations was The linear trends of age-related cumulative incidence of breast cancer cases are shown in Figure 1.
We observed a significant difference in cumulative incidence of breast cancer among the c. The difference in cumulative incidence of patients without mutations in comparison with the c. The difference in age at diagnosis of ovarian cancer among the c. Age-related cumulative incidence of breast cancer cases among c. The breast cancer survival analysis includes 25 mutation c. There were 36 women who died from breast cancer 7 in the c. The last 4 cases were included in survival analysis, but were counted as the end of the follow-up period and not as death events.
The cumulative survival plot is shown in Figure 2. Analysis of the Kaplan-Meier curves showed that the clinical outcome of breast cancer patients who were the c. We did not observe any significant difference in tumour staging and lymph node status in both groups of mutation carriers Table 1. Overall survival of breast cancer patients - c. We also performed Cox regression analysis among breast cancer patients where cancer related mortality was used as the end point.
The presence of the c. The presence of any BRCA1 founder mutation was not significantly associated with unfavourable prognosis in multivariable analysis among all hereditary and sporadic breast cancer patients Table 3. Hereditary cancer institute database contains information about families of BRCA1 founder mutation carriers including 79 families of c.
Overall, the c. The amount of "breast cancer families" with a history of only breast cancer cases and without ovarian cancer cases at least two 1 st or 2 nd related through a man degree relatives was higher among the c. Probands, c.
Due to the polygenic inheritance of breast and ovarian cancers, clustering of specific cancer localizations within families can be considered an important factor for unspecific bias of results obtained by analysing family histories. To overcome this problem we also investigated the prevalence of breast, ovarian and other cancer localizations among all the 1 st and 2 nd degree relatives of the carriers of both BRCA1 founder mutations irrespective of family composition.
The prevalence of different cancer localizations among all the 1 st and 2 nd degree relatives of the c. In this study we investigated the prevalence of the most common BRCA1 founder mutations in a population-based series of breast and ovarian cancer cases in Latvia. Several population-based studies performed in neighbour countries of Latvia Lithuania, Belarus and Poland have demonstrated that 3 founder mutations - c.
Previous studies in Latvia have shown that the c. Nevertheless, according to our estimation the prevalence of the c. The exact prevalence of BRCA1 founder mutations in Latvia is difficult to investigate due to relatively high heterogeneity of the Latvian population; however, in the analysis of a population screening of hereditary cancer syndromes in the Valka district of Latvia, the prevalence of BRCA1 founder mutations in the Latvian population was estimated at approximately 0.
In our study we also found some phenotypic variations of hereditary breast and ovarian cancer syndromes among the c. First of all, we observed a significant difference in the prevalence of c. The breast:ovarian cancer ratio was higher among the c. Despite the fact that some previous reports have shown an almost equal breast:ovarian cancer ratio among the c.
BRCA1 c. Several other reports have demonstrated an increased prevalence of ovarian cancer cases among the c. In the analysis of hospital-based series of breast and ovarian cancer cases in Belarus the breast:ovarian cancer ratio among the c. Several studies have shown some other phenotypic variations besides breast:ovarian cancer ratios associated with mutations, located in different parts of the BRCA1 gene.
Satagopan et al found that the estimated lifetime risk of ovarian cancer development were two times as high for the c. Al-Mulla et al showed that age-related expressivity and penetrance of breast and ovarian cancers depended on the mutation position in the BRCA1 gene and differed among the carriers of various mutations located in exons 2, 11 and 13 [ 17 ].
The genotype-phenotype correlation effect among the c. The median age of onset of breast and ovarian cancers among the c. A similar trend in the age of onset of breast cancer cases among the c.
0コメント