Sunday, September 7, 2008

Collaborative Analytics and Environment for Linking Early Event Detection to an Effective Response

A system for early detection, situational awareness and coordinated response is essential to effectively mitigate the threat (morbidity and mortality) of a health-related event and to improve health. The progress made to-date in biosurveillance worldwide is significant and should be evolved to meet existing and emerging needs. There are existing processes, relationships, technologies, policies, infrastructures, and advances in science and technology that provide a solid foundation to a truly integrated biosurveillance solution. Of paramount importance is the need to strengthen the capacity and enable data-driven decision-making of public health services from the local to the district, national and global levels—both strategically and tactically.

In looking at the current landscape; however, we found the majority of the designs, analyses and evaluations of early detection (or biosurveillance) systems have been geared towards specific data sources and detection algorithms. Much less effort has been focused on how these systems will "interact" with humans. For example, consider multiple domain experts working at different levels across different organizations in an environment where numerous biosurveillance algorithms may provide contradictory interpretations of ongoing events.

Nico and I have been working on project codenamed RNA (or Event Evolution) to provide the public health, disaster and humanitarian communities with 1) a "collaborative” virtual environment supporting the entire life cycle of an event, and 2) a ubiquitous biosurveillance capability. Our objective is to connect early event indications to a coordinated and timely response, therefore reducing the response cycle. Through a hybrid (event-based and indicator-based) surveillance approach, we aim to provide processes, methods, and technology tools for streamlining collaboration between domain experts and machine learning algorithms. By synthesizing a wide variety of health-related event indications into a consolidated picture, RNA is anticipated to help the user community to:
  1. Rapidly identify, characterize, localize, and track health-related events
  2. Maintain a global awareness of the situation
  3. Integrate and analyze data relating to human health, animal, plant, food, and environment
  4. Disseminate alerts
The humans are essential part at every step of the life cycle of an event. Humans understand the meanings of information, languages, images, etc. better than machines alone and can make contextually relevant collections of information. Each collection can build the power of the knowledge network in order to corroborate or refute different hypotheses especially at the early and sketchy stages of an event.

RNA will consist of several high-level modules, including:
  1. Data processing
  2. Automatic feature extraction, data classification and tagging
  3. Human input, hypotheses generation and review
  4. Predictions and alerts output
  5. Field confirmation and feedback
The data processing module allows users to assimilate, broker, and/or collect information from several sources (SMS messages (e.g., Geo-Chat microformat), RSS feeds, email list (e.g., ProMED or ProMED MBDS), documents, web pages, scholarly publications, electronic medical records, animal disease data, environmental feed, remote sensing, VoIP, alerts, etc.). A number of ontologies (geographical and disease) will be supported in multiple languages, including: English, Spanish, French, Japanese, Vietnamese, Thai, Korean, Chinese, Arabic, Russian, and Khmer. Disease ontologies will include: Medical Subject Headings (or MeSH), Systematized Nomenclature of Medicine--Clinical Terms (SNOMED-CT), Logical Observation Identifiers Names and Codes (LOINC®), and The International Classification of Diseases (ICD9/10.)

The automatic feature extraction, data classification and tagging module is extensible allowing the introduction of machine learning algorithms (e.g., Bayesian). The components of this module can extract and augment features (or metadata) from multiple data streams; such as: 1) at the earliest stages of a disease outbreak it extracts source and target geo-location, time, route of transmission (e.g., person-to-person, waterborne), 2) at the later stages of a disease outbreak it provides detection and suggestion of tags for new sources based on the evidence on other sources using Support Vector Machines (or SVM) (as will be discussed below). Additional feature extraction include "data decorators"; for example, extracting features, such as: soil moisture, temperature, and mosquitoes density from a reference NASA remote sensing database during a suspected Dengue fever outbreak investigation following heavy rainfall and flooding associated with the landfalls of a hurricane in the Lower Rio Grande Delta. In addition, these components help detect relationships between extracted features within a collaborative space or across different collaborative spaces (e.g., Riff, the ProMED MBDS network.) We plan on using a smaller but more comprehensive set of event classification as follows:
  1. Large aerosol release
  2. Building/vessel contamination
  3. Small release or contamination
  4. Continuous or intermittent release of an agent
  5. Contagious person-to-person
  6. Commercially distributed products
  7. Waterborne
  8. Vector/host –borne
  9. Sexual or parenteral transmission
This high-dimensionality of threat space classification can be further reduced to:
  1. Single or focus event
  2. Continuous/sustained event
  3. Distributing/disseminating event
With human input, this module can help suggest possible classes (or a combination of classes) depending on time, space, and where we are in the life cycle of an event. Possible classifications include the following (Note: All tags/classifications can follow a hierarchical construct):
  1. Syndromes (e.g., dermatological, gastrointestinal, musculoskeletal, neurological, respiratory)
  2. Symptoms (e.g., fever, cough, sore throat, diarrhea)
  3. Routes of transmission (e.g., person-to-person, waterborne, foodborne, aerosol)
  4. Diseases (e.g., TB, HIV, Influenza, Influenza/Avian Influenza)
  5. Hosts (e.g., human, animal, plant, multiple)
  6. Time and geo-location (spatio-temporal features)
For example, at the earliest stages of an infectious disease outbreak, classification of an event could indicate that “there is an unknown respiratory event, transmitted person-to-person, detected in location X, and is spreading with a Y spatio-temporal pattern, across regions Z1, Z2, and Zn.

The human input and review module is exposed as a set of features that allow users to comment, tag, and rank the elements (positive, neutral, or negative). Additionally, users (or groups, such as communities of interests) can generate and test multiple hypotheses in parallel, further collect and rank sets of related items (evidence), and model against baseline information (for cyclical or known events).

With that information together, the field confirmation and feedback module helps maintain and record history of a list of ongoing possible threats. Components in this module allow domain experts to focus their field investigation and information gathering in order to confirm or refute hypotheses under consideration. Feedback information is then fed into the virtual collaborative network to update (increase or decrease) the reliability of the sources and credibility of the users in light of their inferences or decisions. By analyzing the factors contributing to the identification of various events, it will be possible to help future epidemiologic investigation through play-back or retrospective analysis. Possible information we anticipate recording include:
  1. Which automated systems generated the most reliable alerts, and for what types of conditions?
  2. Which human users where the most effective in identifying conditions?
  3. Which indications are the most effective in identifying a health event?
  4. What factors help to minimize or aggravate a health event?
  5. Which elements of the biosurveillance life cycle require the most time and/or collaboration?
The network history can provide a common point of evaluation (for the overall situation awareness and the individual processes of the biosurveillance network) for a variety of surveillance and response techniques.

Over the summer, the Humanitarian FOSS (HFOSS) Project Summer Institute 2008 (May' 08 - July' 08) carried out an internship project mentored by InSTEDD and a number of HFOSS faculty. During this internship, Juan Pablo Mendoza and Qianqian Lin developed ALPACA Light Parsing And Classifying Application (ALPACA) to:
  1. Transform raw unstructured documents (e.g., news reports, ProMED mail, etc.) into machine readable and analyzable data using a text parsing module
  2. Categorize documents using a SVM classifier using libSVM for: a) Classification into a predetermined (user-defined) list of categories as described above (syndromes, symptoms, routes of transmission, diseases, etc.), and b) Suggesting additional tags and/or topics using a Naive Bayes classifier given existing topics and monitoring human input and review. This is especially helpful with new (emerging) threats or those threats that we know about but we experience them at a much bigger scale than usual (e.g., far more virulent flu virus than we’ve experienced over the past few years)
We tested ALPACA against two widely accepted early sources of information in the public health community; Reuters news and ProMED mail. Results are shown here:

ALPACA is extensible through a plug-in functionality that provides a simple way to add additional parsers and classifiers to the application. We are continuously adding and testing additional algorithms and we welcome your contribution to help us better calibrate existing classifiers and parsers as well as introduce additional ones (you can visit our collaborative space here.)

With RNA we hope to provide the user community with a ubiquitous capability that enables detection, prediction and response to health-related events through a collaborative environment that combines data exploration, integration, search and inference—providing more complex analysis and deeper insight. We've demonstrated RNA's initial capabilities (feature extraction, classification, and tagging and item clustering with a spatio-temporal context) as part of Riff and leveraging mesh4x during a demonstration for the MBDS in SE Asia last week. In the future we also plan to offer RNA as a service that can be integrated with other platforms and networks.

The analytical and collaborative support does not end at the early detection of an event; we envision RNA to provide a rich and flexible functionality during and after an event for maintaining situational awareness, effective response planning, and evaluation. We plan on providing RNA's libraries, tools and applications in the Google Code soon. In the meantime, we look forward to your feedback and contribution.

Some Definitions
  • Public health: "is the study and practice of managing threats to the health of a community. The field pays special attention to the social context of disease and health, and focuses on improving health through society-wide measures like vaccinations, the fluoridation of drinking water, or through policies like seatbelt and non-smoking laws. The goal of public health is to improve lives through the prevention and treatment of disease. The United Nations' World Health Organization defines health as "a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity." In 1920, C.E.A. Winslow defined public health as "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals." The public-health approach can be applied to a population of just a handful of people or to the whole human population. Public health is typically divided into epidemiology, biostatistics and health services. Environmental, social, behavioral, and occupational health are also important subfields." [Source:]
  • Epidemiology: "is the study of factors affecting the health and illness of populations, and serves as the foundation and logic of interventions made in the interest of public health and preventive medicine. It is considered a cornerstone methodology of public health research, and is highly regarded in evidence-based medicine for identifying risk factors for disease and determining optimal treatment approaches to clinical practice. In the work of communicable and non-communicable diseases, the work of epidemiologists range from outbreak investigation to study design, data collection and analysis including the development of statistical models to test hypotheses and the documentation of results for submission to peer-reviewed journals. Epidemiologists may draw on a number of other scientific disciplines such as biology in understanding disease processes and social science disciplines including sociology and philosophy in order to better understand proximate and distal risk factors." [Source:]
  • Outbreak: "is a classification used in epidemiology to describe a small, localized group of people or organisms infected with a disease. Such groups are often confined to a village or a small area. Two linked cases of an infectious disease are usually sufficient to constitute an outbreak. Outbreaks may also refer to epidemics, which affect a region in a country or a group of countries, or pandemics, which describe global disease outbreaks." [Source:]
  • EID: "Emerging infectious diseases (EIDs) are caused by pathogens that have increased in incidence, geographic or host range, have changed pathogenesis, or are newly-evolved or newly-recognized. Over three-quarters of emerging infectious diseases are a result of zoonotic pathogens. Evidence suggests that emerging diseases are driven largely by anthropogenic environmental changes and/or changes in human demographics and behavior. In certain areas, these factors act on a background of high pathogen biodiversity and will alter host-parasite dynamics driving the emergence of known and unknown pathogens." [Source:]
  • Biosurveillance

No comments:

Post a Comment