The database of Genotypes and Phenotypes (dbGaP) was developed by National Center for Biotechnology Information (NCBI) to archive and distribute the results of various studies that have examined the interaction of genotype and phenotype. It is public repository for individual level phenotype, exposure, genotype, sequence data and the associations between them. Searching relevant studies of particular interest accurately and completely is challenging task due to keyword based search method of dbGaP Entrez system. Text mining is emerging research field which enable users to extract useful information from text documents and deals with retrieval, classification, clustering and machine learning ...view middle of the document...
The database contains specific phenotype variables and statistical summaries of genetic information. It allows access to individual level data if it is approved by an NIH Data Access Committee. The database is growing very fast. In 24 October 2013 dbGaP contained 402 studies and by 5 may 2014 there were 468 top- level studies.
1.2 Challenges in dbGaP study text retrieval
As of 5 may 2014, dbGaP contained 468 studies, including more than 144716 phenotype variables. However, retrieving relevant studies accurately and completely is challenging issue, because phenotypic information related to studies is often stored in a non-standardized format. For particular queries, the dbGaP Entrez system returns several studies that are irrelevant, and it does not make clear how particular studies are selected and why they appear in a particular order. Thus, users have to review each study description carefully to determine relevant studies, which can become a laborious and time-consuming task when there are many studies to be retrieved.
1.3 Text Mining
The age of information made it easy for humans to store huge amount of text documents. These are available on the internet, on corporate intranets and elsewhere. However, while amount of information is increasing day by day, but our ability to process and absorb this information remain constant.
Text mining is the...