I am a computational linguist with over 10 years of experience in Natural Language Processing (NLP) and Machine Learning (ML), 20+ peer-reviewed publications, and more than 5 years as a professional software developer responsible for designing, implementing, and deploying custom NLP solutions.
My expertise centers around information extraction and assembling knowledge graphs from unstructured text. While I favor neuro-symbolic approaches that are both explainable and directly editable by humans, I believe in choosing the right tool for the job and pride myself on identifying simple and effective solutions.
I have a track record of multi-disciplinary collaborative research as a tenure-track professor in academia funded by agencies such as DARPA, NSF, and the CDC. I also bring years of applied research experience in industry as both a co-founder of a bootstrapped startup and an individual contributor at a large corporation developing new IP and production quality code in languages like Python and Scala.
As an Applied Scientist with the Product Graph team within Personalization, I work on automatically discovering novel product dimensions and enriching the Amazon Catalog to help customers make informed purchase decisions.
- Designed and implemented a neuro-symbolic system to discover novel attributes and compatibility information from product profiles in support of a multi-team knowledge graph initiative.
- In less than one week, I designed and implemented a scalable system for weakly labeling sections of e-Commerce web pages for document layout analysis and targeted information extraction.
- Implemented and deployed a series of scheduled Amazon-scale ETL tasks using Apache Spark and PySpark to improve automation and prevent data drift.
University of Arizona
Assistant Professor (tenure-track) of Computational Linguistics at the University of Arizona and founding director of both the online MS in Human Language Technology (HLT) and the Graduate Certificate Program in Natural Language Processing (NLP).
I design and teach graduate-level courses in statistical natural language processing (NLP) that cover both "classical" machine learning (ML) and deep learning (DL) methods for NLP.
My research is funded by agencies such as DARPA, NSF, and the CDC. I also hold appointments in the Cognitive Science GIDP and the Computational Social Science Graduate Certificate Program.
- Investigator that helped secure over $9M in grant-based funding from multiple federal agencies.
- Managed and trained remote and in-person teams of junior NLP researchers and software developers.
- Developed, documented, and deployed CI pipelines and NLP software for multiple federal agencies on AWS and AWS GovCloud.
- Within a year of being hired during the University of Arizona's largest budget cuts, I designed a graduate curriculum, oversaw the development of its courses, and launched a fully online MS program in Human Language Technology that has attracted a global body of students.
I am co-founder of Lum AI, a small bootstrapped NLP startup focusing on large-scale machine reading and rapid text annotation/data labeling.
In addition to product development, I manage our AWS deployments (mostly ECS + terraform + GitHub Actions).
- Designed and implemented a resilient actor-based distributed version of the Odinson information extraction system that requires less than 30% of the resources of an Elasticsearch cluster.
- Designed, implemented, and deployed horizontally-scalable containerized services for information extraction on AWS with rolling deployments triggered via changes to the default branch of a Git repository.
- Managed a small, distributed team of software developers and applied scientists over multiple projects.
PhD (Computational Linguistics)
MS (Human Language Technology)
MA (Applied Linguistics)
Projects and Software
A web-based platform (hosted solution) for rapid text annotation and data labeling for NLP.
A fast and highly scalable language and runtime system for information extraction that supports patterns composed of graph traversals and token-level constraints. The successor to Odin. Odinson is
- IDE design, development, and deployment (closed source)
- a REST API and companion Python library
- language features and testing
- development of a distributed version for web-scale information extraction using Akka (development of this component was funded by DARPA's Causal Exploration program).
A platform for literature-based discovery that incorporates multi-domain extractions of causal interactions into a single searchable knowledge graph. Originally developed to support the Bill and Melinda Gates Foundation's efforts to improve child and maternal health. Create conceptual models (interest maps) by searching for direct and indirect influence relations, merging concepts, injecting your own expertise, and collaboratively editing models.
- system architecture and deployment (AWS)
- open domain machine reader and assembly system
- incorporation and alignment of citation graph (MAG) information and clinical trials (this component was funded by the Bill and Melinda Gates Foundation as part of their KI Platform Prototype)
Information extraction system for BioNLP that includes components for event extraction, NER, domain-specific coreference resolution, causal event ordering, and grounding. Reach was the most precise and highest throughput machine reading system in DARPA's Big Mechanism program, and has been used by biologists to discover novel and plausible biological hypotheses for multiple cancers.
- broad coverage and extensible information extraction of biomolecular statements described in scholarly documents. These statements often describe complex nested relations (e.g., a positive regulation involving a particular post-translational modification)
- assembly and causal ordering of model fragments of cell signaling pathways
- coreference resolution tailored to the biomedical domain ("how can we automatically determine the antecedent of an expression like the protein?")
Grants and Contracts
ADHS-CDC COVID Disparities Initiative
|Agency||CDC & Arizona Department of Health Services|
Address COVID-19 health disparities among underserved and high-risk populations in Arizona, including racial and ethnic minorities as well as rural communities.
- personalized question-answering and semantic search systems for different audiences (community health workers, patients, etc.) that operate over curated document collections
- automatic speech recognition (ASR)
- monitoring trusted information sources to detect policy changes
- machine translation, summarization, and customized message generation
- modernizing cyberinfrastructure for health communication (ex. telemedicine systems)
Democratizing machine reading for non-experts
Democratizing machine reading for non-experts: Easy and interpretable methods to extract structured information from text
This work aims to democratize machine reading technology to make it accessible to subject matter experts (ex. molecular biologists) who may be entirely unfamiliar with natural language processing and machine learning. In an effort to hybridize symbolic and statistical approaches, my collaborators and I are leveraging neural methods for program synthesis and reinforcement learning to generate editable and executable and human-editable rules for rapid information extraction.
- Determining research direction
- System design and implementation
- Deployment of GPU-accelerated sotware demos on AWS
Supply chain Quantification Using Imperfect Data (SQUID)
|Role||PI (Phase I subcontract through Raytheon BBN)|
Improve the efficiency of the military supply chain by contructing operational process models from fragmented data.
- information and event extraction related to logistics processes (supply chain events) and event ordering
- query parsing and intent understanding to power a chatbot interface for logisticians
- document layout analysis for PDFs