Latent Semantic Indexing
Latent Semantic Indexing (or LSI) is a concept-based information retrieval model. Terms and documents are both encoded for vector space representation so that documents may be clustered (semantically) near each other yet share no common terms. LSI addresses the two fundamental problems which plague traditional lexical-matching indexing schemes: synonymy and polysemy. Content Analyst Company, LLC owns the original patent to LSI: Computer information retrieval using latent semantic structure U.S. Patent No. 4,839,853, June 13, 1989.
The Integrated Modeling Project (IMP) sponsored by the Environmental Impacts Program of the USDA Forest Service is an integrated forest health and productivity assessment of southern and southeastern forests in relation to changing climate, air quality, and land use changes. The primary research focus of Prof. Michael W. Berry and Research Associate Karen S. Minser (Dept. of Computer Science) is the development of a problem-solving environment or PSE which facilitates the horizontal integration of forest responses to environmental stresses and disturbances through the use of micro-scale cellular automata.
The Interactive Cluster Analysis Toolkit (or ICAT) utilizes the Enhanced Hoshen-Kopelman algorithm to provide a highly adaptable method for cluster analysis. Within the context of diabetic retinopathy, different neighborhood rules implemented within ICAT provide better approaches for classifying retinal features such as neovascularization and exudates. The flexible design of ICAT allows new metrics for characterizing cluster geometry or new neighborhood rules for cluster identification to be easily incorporated.
A Regional Simulation model (RSim) designed to integrate environmental effects of on-base military training testing as well as off-base development. Effects considered include air and water quality, noise, and habitats for endangered and game species. A risk assessment approach is being used to determine impacts of single and integrated risks. The RSim simulation will eventually be available on the Web and will be used in a gaming mode so that users can explore repercussions of military and land-use decisions. RSim is currently being developed for the region around Fort Benning, Georgia but is broadly applicable. This project is sponsored by the Strategic Environmental Research & Development Program (SERDP) — an initiative funded by the U.S. Departments of Energy and Defense and the U.S. Environmental Protection Agency (EPA).
Land-Use Change Analysis System for the simulation of land-cover changes on a heterogeneous (distributed) computing environment. LUCAS generates new maps of land cover representing the amount of land-cover change so that issues such as biodiversity conservation, assessing the importance of landscape elements to meet conservation goals, and long-term landscape integrity can be addressed.
Encyclopedia of Computer Science and Engineering
Michael W. Berry served as the Applications area editor of the Encyclopedia of Computer Science and Engineering (Wiley Interscience) that was edited by Prof. Benjamin Wah at the University of Illinois at Urbana-Champaign and published in 2004.
Information Retrieval Seminar
Three-Day Seminar Course on Information Retrieval, Facultad de Matemátics Universidad Autónoma de Yucatán (UADY) Mérida, México, March 10-12,2004.
SVDPACK comprises four numerical (iterative) methods for computing the singular value decomposition (SVD) of large sparse matrices using double precision ANSI Fortran-77. A compatible ANSI-C version (SVDPACKC) is also available. SVDPACK and SVDPACKC implement Lanczos and subspace iteration-based methods for determining several of the largest singular triplets for large sparse matrices. The development of SVDPACK was motivated by the need to compute large-rank approximations to sparse term-document matrices from information retrieval applications such as Latent Semantic Indexing (described at the left). SVDPACKC was used in in the InfoMap project developed in the Computational Semantics Laboratory at Stanford University.
Whole Genome Phylogeny
As whole genome sequences continue to expand in number and complexity, effective methods for comparing and categorizing both genes and species represented within extremely large datasets are required. Current methods have generally utilized incomplete (and likely insufficient) subsets of the available data even as additional data becomes available at a rapid rate. In collaboration with Prof. Gary Stuart at Indiana State University, an accurate and efficient method for producing robust gene and species phylogenies using very large whole genome protein datasets has been developed. This method relies on multidimensional protein vector definitions supplied by the singular value decomposition (SVD) of large sparse data matrices in which each protein is uniquely represented as vector of overlapping tetrapeptide frequencies. Link above is to presentation slides shown on March 23 at the UT-ORNL Bioinformatics Summit 2002, and an updated presentation was made at a Indiana Univ. School of Informatics Colloquim on Nov. 14, 2003 (audio/slides).
Understanding the functional relationship between genes remains to be a major challenge in interpretation of genomic data. Bioinformatics tools to automate extraction and utilization of gene information from the biological databases and the scientific literature are being developed. We present a new software environment called Semantic Gene Organizer © (SGO) which utilizes Latent Semantic Indexing (LSI), a concept-based vector space model, to automatically extract gene relationships from titles and abstracts in MEDLINE citations.
We have developed a Web-based bioinformatics tool called Feature Annotation Using Nonnegative matrix factorization (FAUN) to facilitate both the discovery and classification of functional relationships among genes. Both the computational complexity and parameterization of nonnegative matrix factorization (NMF) for processing gene sets are currently being investigated. FAUN has been tested on several manually constructed gene collections (size ranging from 50 to 800 genes) and has been particularly engineered to analyze several microarray-derived gene sets obtained from studies of the developing cerebellum in normal and mutant mice. FAUN provides utilities for collaborative knowledge discovery and identification of new gene relationships from text streams and repositories (e.g., MEDLINE). It is particularly useful for the validation and analysis of gene associations suggested by microarray experimentation. Click here for a video about NIMBioS with Elina Tjioe demonstrating FAUN. This project is supported by the Gene Regulation in Time & Space project (funded by the NIH).
GST Retreat Poster (March 14, 2008, 4.7MB ppt) UT-ORNL-KBRIN Poster (March 28-30, 2008); published in BMC Bioinformatics July 8, 2008
The Grid Computing for Ecological Modeling and Spatial Control of Wildfires project is a National Science Foundation (NSF) funded research project which began in 2005 and concluded in 2008. The project involved several students and postdoctoral fellows who developed several different fire spread models and several different methods to evaluate how spatial control might be utilized to limit the spread of a wildfire. The software simulated a fire starting at a variety of possible burnable locations on a map. The fire would then spread based upon burnable/non-burnable (green/black) areas in the map, in the simplest case, with the possibility of including a local fire load which would affect the magnitude of local burns, as well as the probability of spread. The unique aspect of this project involved the computation for optimal placement of a fire break with the objective of enclosing the fire and sparing as much of the region as possible from burning. The overall goal of the project is to improve the accuracy of responses to fire spread, to develop effective control strategies, and to produce a method that might be useful in training for fire suppression personnel.
Python for Biologists
The intent of this tutorial, created from a COSC 670 course project during Spring Semester 2012, is to enlighten computational biologists with some of the novel features of the python programming language for problem solving. This material is intended to accompany a one day in-person hands-on workshop and serve as a post workshop resource for workshop attendees.