Attending this event?

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Tuesday, October 26

10:00am CDT

Emu: Species-Level Microbial Community Profiling for Full-Length Nanopore 16S Reads
Technical Presentations Group 1: Algorithms, Foundations, Visualizations, and Engineering Applications

16S rRNA based analysis is the established standard for elucidating microbial community composition. However, with short-read data delivering only a portion of the 16S gene, this analysis is limited to genus-level results at best. Obtaining species-level accuracy is imperative since two bacterial species within the same genus have proven to express drastically different behaviors on their community and human health. Full-length 16S sequences have the potential to provide species-level resolution. Yet, taxonomic identification algorithms designed for previous generation sequencers are not optimized for the increased read length and error rate of Oxford Nanopore Technologies (ONT). Here, we present Emu, a novel approach that employs an Expectation-Maximization (EM) algorithm, to generate a taxonomic abundance profile from full length 16S rRNA reads. We demonstrate accurate sample composition estimates by our new software through analysis on two mock communities and one simulated data set. We also show Emu to elicit fewer false positives and false negatives than previous methods on both short and long read data. Finally, we illustrate a real-world application of Emu by processing vaginal microbiome samples from women with and without vaginosis, where we observe distinct species-level differences in the microbial composition between the two groups that are fully concordant with prior research in this important area. In summary, full-length 16S ONT sequences, paired with Emu, opens a new realm of microbiome research possibilities. Emu proves, with the appropriate method, increased accuracy can be obtained with nanopore long reads despite the increased error rate. Our novel software tool Emu allows researchers to further leverage portable, real-time sequencing provided by ONT for accurate, efficient, and low-cost characterization of microbial communities.

Author: Kristen Curry (Rice University)


Kristen Curry

Rice University

Tuesday October 26, 2021 10:00am - 10:15am CDT

10:15am CDT

Co-Manifold Learning
Technical Presentations Group 1: Algorithms, Foundations, Visualizations, and Engineering Applications

Representation learning is typically applied to only one mode of a data matrix, either its rows or columns. Yet in many applications, there is an underlying geometry to both the rows and the columns. We propose utilizing this coupled structure to perform co-manifold learning: uncovering the underlying geometry of both the rows and the columns of a given matrix. Our framework is based on computing a multiresolution view of the data at different combinations of row and column smoothness by solving a collection of continuous optimization problems. We demonstrate our method’s ability to recover the underlying row and column geometry in simulated examples and real cheminformatics data.

Authors: Eric Chi (Rice University), Gal Mishne (University of California, San Diego), and Ronald Coifman (Yale University)


Eric Chi

Rice University

Tuesday October 26, 2021 10:15am - 10:30am CDT

10:30am CDT

Random-Walk Based Graph Representation Learning Revisited
Technical Presentations Group 1: Algorithms, Foundations, Visualizations, and Engineering Applications

Representation learning is a powerful framework for enabling the application of machine learning to complex data via vector representations. Here, we focus on representation learning for vertices of a graph using random walks. We introduce a framework for node embedding based on three dimensions: type of process, similarity metric, and embedding algorithm. Our framework not only covers many existing approaches but also motivates new ones. In particular, we apply it to produce new state-of-the-art results on link prediction.

Authors: Zexi Huang (UCSB), Arlei Silva (Rice University), and Ambuj Singh (UCSB)


Arlei Silva

Rice University

Tuesday October 26, 2021 10:30am - 10:45am CDT

10:45am CDT

Tuesday October 26, 2021 10:45am - 11:00am CDT

11:00am CDT

Multi-Task Deep Learning Framework for Sales Forecasting and Product Recommendation
Technical Presentations Group 2: AI for Good + Business Impact/Industry

Sales forecasting and product recommendation are important tasks for Business-to-Business (B2B) companies, particularly as more business transactions are occurring through digital channels (eCommerce). Transaction data contains both explicit signals (price, revenue, ratings) and implicit signals (product purchases, user clicks). Sales prediction, based on explicit signals, and product recommendation, based on implicit signals, are commonly achieved with separate machine learning models. We propose a new multi-task learning model framework, which performs a joint optimization to do prediction and recommendation tasks simultaneously. This multi-task deep learning model captures and predicts seasonality in the data and has an effective sampling mechanism to improve implicit feedback for the recommendation task. Our experiments on real B2B transaction datasets have shown that the multi-task model can achieve comparable performance for both tasks compared to single-task models (around 40% lower mean absolute percentage error and 30% improvement in Diversity@K, which is the percentage of overall items that are captured in the top K recommendations). In addition, the multi-task model enables better solutions to problems such as cold start and collaborative filtering

Authors: Wenshen Song (PROS Inc.), Yan Xu (PROS Inc.), Faruk Sengul (PROS Inc.), and Justin Silver (PROS Inc.)


Tuesday October 26, 2021 11:00am - 11:15am CDT

11:15am CDT

Using Visual Feature Space as a Pivot Across Languages
Technical Presentations Group 2: AI for Good + Business Impact/Industry

People can create image descriptions using thousands of languages, but these languages share only one visual space. The aim of this work is to leverage visual feature space to pass information across languages. We show that models trained to generate textual captions in more than one language conditioned on an input image can leverage their jointly trained feature space during inference to pivot across languages. We particularly demonstrate improved quality on a caption generated from an input image, by leveraging a caption in a second language. More importantly, we demonstrate that even without conditioning on any visual input, the model demonstrates to have learned implicitly to perform to some extent machine translation from one language to another in their shared visual feature space even though the multilingual captions used for training are created independently.

Authors: Ziyan Yang (Rice University), Leticia Pinto-Alva (University of Southern California), Franck Dernoncourt (Adobe Research), and Vicente Ordóñez (Rice University)


Ziyan Yang

Rice University

Tuesday October 26, 2021 11:15am - 11:30am CDT

11:30am CDT

Math Word Problem Generation with Mathematical Consistency and Problem Context Constraints
Technical Presentations Group 2: AI for Good + Business Impact/Industry

We study the problem of generating arithmetic math word problems (MWPs) given a math equation that specifies the mathematical computation and a context that specifies the problem scenario. Existing approaches are prone to generating MWPs that are either mathematically invalid or have unsatisfactory language quality. They also either ignore the context or require manual specification of a problem template, which compromises the diversity of the generated MWPs. In this paper, we develop a novel MWP generation approach that leverages i) pre-trained language models and a context keyword selection model to improve the language quality of the generated MWPs and ii) an equation consistency constraint for math equations to improve the mathematical validity of the generated MWPs. Extensive quantitative and qualitative experiments on three real-world MWP datasets demonstrate the superior performance of our approach compared to various baselines.

Authors: Zichao Wang (Rice University), Andrew Lan (University of Massachusetts Amherst), and Richard Baraniuk (Rice University)


Zichao (Jack) Wang

Rice University

Tuesday October 26, 2021 11:30am - 11:45am CDT

11:45am CDT

Multi-Task Learning for Demand Prediction Through a Hyper-Network
Technical Presentations Group 2: AI for Good + Business Impact/Industry

Demand of goods in consumer goods is based on a variety of factors like price, seasonality, competitor price, geographic location, demographic data, etc. A common practice is to use some features, like geographic data and demographic data, to segment the market and build an individual model for each segment. However, with this approach we lose potentially valuable information which can be learned across segments. Hence, we propose a method for simultaneously learning multiple demand models to borrow knowledge and improve accuracy, especially for models with sparser data. For this, we propose using a neural network as a hyper-network to estimate the parameters of each demand model. Our approach leads to knowledge sharing across models as opposed to independent model fitting in each task while generating a model which is computationally tractable. Results of applying the proposed method on large-scale real data shows improved prediction accuracy and price elasticity estimates compared with the common two-step approach of clustering and using independent models.

Authors: Manu Chaudhary (PROS), Yanyan Hu (University of Houston), and Shahin Boluki (PROS)


Tuesday October 26, 2021 11:45am - 12:00pm CDT

12:00pm CDT

Lunch + Networking
Tuesday October 26, 2021 12:00pm - 1:00pm CDT

12:00pm CDT

ML for Energy Transition
Talk 1:End-to-End Approaches to Enhance the CO2 Capture - Cécile Pereira
Nanoporous materials can be used as solid adsorbents to capture CO2 from combustion flue gases or directly from the air using what is called a temperature swing adsorption (TSA) process. In this process, the gas containing the CO2 is injected into a gas (CO2 source) / solid (solid-sorbents material) contactor of the material where the pores of the material selectively adsorb the CO2. Once the adsorbent is saturated, a highly enriched CO2 gas stream is recovered by purging the contactor with a combination of heat and steam. If the right nanoporous material can be found, a cost-effective approach to CO2 capture may be achievable. ACO2RDS (Adsorptive CO2 removal from dilute sources) is a multi-year project to develop transformative solid-sorbent-based technologies for CO2 capture from dilute sources, specifically natural gas combined cycle (NGCC) power plant flue gas and atmospheric CO2 with direct air capture (DAC). In this presentation, we introduce the ACO2RD project and we review key state-of-the-art publications on the topic.

Talk 2: A Deep Learning-Accelerated Data Assimilation and Forecasting Workflow for Commercial-Scale Geologic Carbon Storage - Hewei Tang
Fast assimilation of monitoring data to forecast the transport of materials in heterogeneous media has many important applications. Such applications include the management of CO2 migration in geologic carbon storage reservoirs. It is often critical to assimilate emerging data and make forecast in a timely manner. However, the high computational cost of data assimilation with a high-dimensional parameter space undermines our ability to achieve this goal.

In the context of geologic carbon storage, we propose to leverage physical understandings of porous medium flow behavior with deep learning techniques to develop a fast history matching -reservoir response forecasting workflow. Applying an Ensemble Smoother Multiple Data Assimilation (ES-MDA) framework, the workflow updates geologic properties and predicts reservoir performance with quantified uncertainty from observed pressure and CO2 plumes. As the most computationally expensive component in such a workflow is reservoir simulation, we developed surrogate models to predict dynamic pressure and CO2 plume extents under multi-well injection. The surrogate models employ deep convolutional neural networks, specifically, a wide residual network and a residual U-Net. Intelligent treatments are applied to bridge between quantities in a true 3D reservoir and a single-layer model underlying the workflow. The workflow can complete history matching and reservoir forecasting with uncertainty quantification in less than one hour on a mainstream personal workstation.

Talk 3: Monitoring of Microseismic for CO2 Sequestration - Bob Clapp
Monitoring of microseismic events is going to play an important role in evaluating CO2 reservoirs during injection. Since DAS fibers are installed down wells and are thus close to the microseismic events, they hold vast potential for high-resolution analysis of their continuously-recorded data.

However, accurately detecting microseismic signals in continuous data is challenging and time-consuming. DAS acquisitions generate substantial data volumes, and microseismic events have a low signal-to-noise ratio in individual DAS channels.

Herein we design, train, and deploy a machine learning model to automatically detect microseismic events in DAS data acquired inside a proxy to an injection well, an unconventional reservoir. We create a curated dataset of 6,786 manually-picked microseismic events. The machine learning model achieves an accuracy of 98.6\% on the benchmark dataset and even detects low-amplitude events missed during manual picking. Our methodology detects over 100,000 events allowing us to reconstruct the spatio-temporal fracture development accurately.

avatar for Cécile Pereira

Cécile Pereira

Data Science & AI Research Scientist, TotalEnergies
Cécile Pereira is a research scientist in the digital domain, working for Total CSE, Data Science & AI team. Her current research focuses on the development of new products and materials. She is strongly involved in the computational chemistry project, and she is co-supervising the... Read More →
avatar for Hewei Tang

Hewei Tang

Postdoctoral Staff Member, Lawrence Livermore National Laboratory (LLNL)
Dr. Hewei Tang is currently a postdoctoral staff member in Lawrence Livermore National Laboratory’s Atmospheric, Earth, and Energy Division. She holds a Ph.D. degree in Petroleum Engineering from Texas A&M University. Dr. Tang serves as an Associate Editor of Journal of Petroleum... Read More →
avatar for Bob Clapp

Bob Clapp

Technical Director, Stanford Center for Computational Earth and Environmental Science
Dr. Robert “Bob” Clapp is Technical Director of the Stanford Center for Computational Earth and Environmental Science. He has been at Stanford University for two decades, during which time he has published dozens of articles and presented talks on a wide range of geophysical and... Read More →
avatar for Mauricio Araya

Mauricio Araya

Senior R&D Manager HPC & ML, TotalEnergies
Mauricio Araya is a Senior Computer Scientist and lead researcher working at TotalEnergies EP R&T USA. He is also lecturer with the Professional Science Master’s Program at  the  Weiss  School  of  Natural  Science  of  Rice University,  where  he  teaches  computational... Read More →

Tuesday October 26, 2021 12:00pm - 1:00pm CDT

1:00pm CDT

ShiftAddNet: A Hardware-Inspired Deep Network
Technical Presentations Group 3: Algorithms, Foundations, Visualizations, and Engineering Applications

Multiplication (e.g., convolution) is arguably a cornerstone of modern deep neural networks (DNNs). However, intensive multiplications cause expensive resource costs that challenge DNNs' deployment on resource-constrained edge devices, driving several attempts for multiplication-less deep networks. This paper presented ShiftAddNet, whose main inspiration is drawn from a common practice in energy-efficient hardware implementation, that is, multiplication can be instead performed with additions and logical bit-shifts. We leverage this idea to explicitly parameterize deep networks in this way, yielding a new type of deep network that involves only bit-shift and additive weight layers. This hardware-inspired ShiftAddNet immediately leads to both energy-efficient inference and training, without compromising the expressive capacity compared to standard DNNs. The two complementary operation types (bit-shift and add) additionally enable finer-grained control of the model's learning capacity, leading to more flexible trade-off between accuracy and efficiency, as well as improved robustness to quantization and pruning. We conduct extensive experiments and ablation studies, all backed up by our FPGA-based ShiftAddNet implementation and energy measurements. Compared to existing DNNs or other multiplication-less models, ShiftAddNet aggressively reduces over 80% hardware-quantified energy cost of DNNs training and inference, while offering comparable or better accuracies.

Authors: Haoran You (Rice University), Xiaohan Chen (The University of Texas at Austin), Yongan Zhang (Rice University), Chaojian Li (Rice University), Sicheng Li (Alibaba DAMO Academy), Zihao Liu (Alibaba DAMO Academy), Zhangyang Wang (The University of Texas at Austin), and Yingyan Lin (Rice University)


Haoran You

Rice University

Tuesday October 26, 2021 1:00pm - 1:15pm CDT

1:15pm CDT

Neural Architecture Search for Inversion
Technical Presentations Group 3: Algorithms, Foundations, Visualizations, and Engineering Applications

Over the years, people have been using deep learning to tackle inversion problems, and we see the framework has been applied to build the relationship between recording wavefield and velocity (Yang et al., 2016). Here we will extend the work from 2 perspectives, one is deriving a more appropriate loss function, as we know, the pixel-2-pixel comparison might not be the best choice to characterize image structure, and we will elaborate on how to construct cost function to capture high-level feature to enhance the model performance. Another dimension is searching for the more appropriate neural architecture, which can be viewed as a subproblem within hyperparameter optimization, which is a subset of an even bigger picture, the automatic machine learning, or AutoML. There are several famous networks, U-net, ResNet (He et al. 2016), and DenseNet (Huang et al., 2017), and they achieve phenomenal results for certain problems, yet it’s hard to argue they are the best for inversion problems without thoroughly searching within certain space. Here we will be showing our architecture search results for inversion.

Authors: Xin Zhao (CGG), Licheng Zhang (University of Houston) and Cheng Zhan (Microsoft)


Tuesday October 26, 2021 1:15pm - 1:30pm CDT

1:30pm CDT

Generalized Zero-Shot Learning via Normalizing Flows
Technical Presentations Group 3: Algorithms, Foundations, Visualizations, and Engineering Applications

Generalized Zero-shot Learning (GZSL) in Computer Vision refers to the task of recognizing images for which classes are not available during training, but other data such as textual descriptions for all classes are available. The idea is to leverage the information from these language descriptions to recognize both seen and unseen classes by transferring knowledge from each modality. This setup poses a more realistic scenario in image classification problems, where it is not possible to manually collect and annotate all the images for a specific class, but it is more viable to use natural language descriptions. In this work, we explore Normalizing Flows to generate features from a shared latent space that aligns the image and textual representations. These new features synthetically generated by our model are then used to enlarge the training set, so that the aligned representations for all seen and unseen classes can be used to train a classifier in a supervised manner. For this purpose, we simultaneously train two Invertible Neural Networks, one for the image representation, and the other for the textual description. Our aim is that the features encoded in the forward pass would work as data embeddings which we align so that they share the same feature space. In the reverse pass, both networks are enforced to reconstruct their corresponding input as a supervised signal for each modality. In this way, our approach outperforms previous generative models that use Variational Autoencoders and Generative Adversarial Networks in the CUB dataset by significant margins.

Authors: Paola Cascante-Bonilla (University of Virginia), Yanjun Qi (University of Virginia) and Vicente Ordonez (Rice University)


Paola Cascante-Bonilla

University of Virginia

Tuesday October 26, 2021 1:30pm - 1:45pm CDT

1:45pm CDT

Tuesday October 26, 2021 1:45pm - 2:00pm CDT

2:00pm CDT

Explainable Deep Learning Approaches to Predict Development of Brain Metastases in Patients with Lung Cancer Using Electronic Health Records
Technical Presentations Group 4: Healthcare

Brain metastases (BM) from lung cancer accounts for the majority of BM cases. Brain metastases cause neurological morbidity and affect quality of life as they can be associated with brain edema. Therefore, early detection of brain metastases and prompt treatment can achieve optimal control. In this study, we employed the RNN-based RETAIN model to predict the risk of developing BM among patients diagnosed with lung cancer based on electronic health record (EHR) data. Meanwhile, we also extended the feature attribution method, Kernel SHAP, to structural EHR data to interpret the decision process. The deep learning models utilize the longitudinal information between different patient encounters to obtain explainable predictions for BM. Through a series of well-defined cohort construction and case-control matching criteria, the best AUC in the test set was obtained by RETAIN reaching 0.825, which achieved 3.7% improvement compared with the baseline model. The high contribution list identified by RETAIN and Kernel EHR was highly related to BM development, and especially to the higher lung cancer stages. Moreover, the sensitivity analysis also demonstrated that both RETAIN and Kernel SHAP can recognize the unrelated features and put more contribution to the important features.

Authors: Zhao Li (UTHealth), Ping Zhu (UTHealth), Rongbin Li (UTHealth), Yoshua Esquenazi (UTHealth) and W. Jim Zheng (UTHealth)


Tuesday October 26, 2021 2:00pm - 2:15pm CDT

2:15pm CDT

Deep Learning-Based Blood Glucose Predictors In Type 1 Diabetes
Technical Presentations Group 4: Healthcare

Objectives: In this work, we present short-term predictions of blood glucose (BG) levels in people with type 1 diabetes (T1D) obtained with a deep-learning based architecture applied to a multivariate physiological dataset of actual T1D patients. Methods: Stacks of convolutional neural network (CNN) and long short-term memory (LSTM) units are proposed to predict BG levels for 30, 60 and 90 minutes prediction horizons (PH), given historical glucose measurements, meal information and insulin intakes. Evaluation of predictive capabilities was performed on two actual patients datasets, Replace-BG and DIAdvisor, respectively. Findings: for 90 minutes PH our model obtained mean absolute error (MAE) of 17.30 ± 2.07 and 18.23 ± 2.97 [mg/dl], root mean squared error (RMSE) of 23.45 ± 3.18 and 25.12 ± 4.65 [mg/dl]), coefficient of determination (R2) of 84.13 ± 4.22 and 82.34 ± 4.54 [%], and in terms of the continuous glucose-error grid analysis (CG-EGA) 94.71 ± 3.89 [%] and 91.71 ± 4.32 [%] accurate predictions (AP), 1.81 ± 1.06 [%] and 2.51 ± 0.86 [%]) benign errors (BE), and 3.47 ± 1.12 [%] and 5.78 ± 1.72 [%] erroneous prediction (EP), for Replace-BG and DIAdvisor datasets, respectively. Conclusion: Our investigation demonstrated that our method, compared to existing approaches in the literature, achieved superior glucose forecasting performance, showing the potential for application in decision support systems for diabetes management.

Authors: Mehrad Jaloli (University of Houston) and Marzia Cescon (University of Houston)


Marzia Cescon

University of Houston

Tuesday October 26, 2021 2:15pm - 2:30pm CDT

2:30pm CDT

MLPrE – A Tool for Preprocessing Data and Conducting Exploratory Data Analysis Prior to Machine Learning Model Construction
Technical Presentations Group 4: Healthcare

Data preparation is one of the less glamorous aspects of doing Data Science. Combined with Exploratory Data Analysis (EDA), preparation consumes a significant percentage of time spent by a Data Scientist yet is critical to do correctly. Starting from data that can exist in multiple formats, the data are manipulated until it matches the needs for input to a Machine Learning (ML) model. The modifications may be guided by EDA, which is the process to understand the data, such as calculating the percentage of NULL values, determining basic statistics, and looking at potential correlations with other columns. Notebooks such as Jupyter and Zeppelin are great tools for these exercises, but their integration within larger processing pipelines such as Apache Airflow may not be ideal. Our tool, MLPrE, evolved out of the need for early-stage data preparation and analysis that was consistent and could be repeated by other Data Science team members. This was accomplished through a dataframe storage mechanism with stages used to describe stepwise changes to that dataframe; these stages are described using JavaScript Object Notation (JSON) and these are parsed and used to direct the code to perform steps in a specific order. Currently, there are approximately fifty stages for input/output, filtering, basic statistics, feature engineering, and EDA. MLPrE is Apache Spark based using the Python as the development language.

Authors: David Maxwell (University of Texas M D Anderson Cancer Center), Ya Zhang (University of Texas M D Anderson Cancer Center), James Lomax III (University of Texas M D Anderson Cancer Center), Robert Brown (University of Texas M D Anderson Cancer Center), Brian Dyrud (University of Texas M D Anderson Cancer Center), Melody Page (University of Texas M D Anderson Cancer Center), Mary McGuire (University of Texas M D Anderson Cancer Center), Daniel Wang (University of Texas M D Anderson Cancer Center), and Caroline Chung (University of Texas M D Anderson Cancer Center)


David Maxwell

University of Texas M D Anderson Cancer Center

Tuesday October 26, 2021 2:30pm - 2:45pm CDT

2:45pm CDT

Diabetes Management in Underserved Communities: Data-Driven Insights from Continuous Glucose Monitoring
Technical Presentations Group 4: Healthcare


Continuous glucose monitoring (“CGM”) has proven itself to be beneficial for people with diabetes, providing real-time feedback and clear glucose targets for patients. Unfortunately, we have supporting evidence almost exclusively from White individuals living with type 1 diabetes who are well-educated and can afford health insurance. There is limited understanding of CGM utility for people with type 2 diabetes (“T2D”). This is a massive gap, given T2D accounts for 90-95% of all diabetes cases. This gap is further intensified by two factors. One, there are negligible studies on CGM use in underserved communities, including racial/ethnic minorities who bear a disproportionate burden of the disease. Two, current CGM guidelines are based on summary statistics that smooth out the effect of potentially prognostic glucose patterns observed at different times of the day. We propose a fine-grained analysis of CGM data to discover clinically meaningful physiological and behavioral insights on T2D. These insights can then help design more effective and affordable treatments, which can significantly benefit underserved communities.


CGMs capture glucose readings every 15 minutes, providing high-resolution temporal information that may detect diabetes onset and progression. Based on prior clinical research on T2D progression, we hypothesize that increasing diabetes risk is associated with: (i) increased glucose abnormalities with distinct patterns during the day vs. overnight, and (ii) bigger glucose surges after meals, most clearly observable after breakfast.


We analyzed 2 weeks of CGM data from 119 participants from an underserved community in Santa Barbara, CA (predominantly Hispanic/Latino females, 54·4 ±12·1 years old) stratified into three groups of increasing diabetes risk: (i) 35 normal but at risk of T2D (“at-risk”), (ii) 49 with prediabetes (“pre-T2D”), and (iii) 35 with T2D.

Overnight vs. rest of the day analysis: T2D participants spent significantly higher time in the elevated glucose range of 140-180 mg/dL throughout the day than at-risk and pre-T2D individuals (p<0.0001). Pre-T2D participants, interestingly, spent higher time between 140-180 mg/dL compared to at-risk individuals during the day (p<0.01) but not overnight.

Breakfast analysis: T2D participants had more prominent and more prolonged glucose peaks than the other two groups, with significantly greater height and duration of breakfast glucose peaks than at-risk and pre-T2D participants (p<0.0001 and p<0.01, respectively).


We observed a distinct progression of glucose abnormality in a cohort of predominantly Hispanic/Latino individuals at-risk of T2D, those with pre-T2D, and those with T2D. Our results suggest that: (i) disease progression is initially associated with greater glucose excursions during the day and then eventually overnight; and (ii) Glucose peaks after breakfast become taller and take longer to attain with increasing diabetes severity. Both sets of results provide a CGM-based approach to monitoring diabetes progression at home. In the future, we need to validate our findings in longer-duration studies and other populations. Nevertheless, the proposed data-driven measures have the potential to detect diabetes onset early and offer opportunities for new pharmacological and non-pharmacological diabetes treatment regimens that can better benefit underserved communities disproportionately burdened with the disease.

Authors: Souptik Barua (Rice University), Namino Glantz (Sansum Diabetes Research Institute), David Kerr (Sansum Diabetes Research Institute) and Ashutosh Sabharwal (Rice University)


Souptik Barua

Rice University

Tuesday October 26, 2021 2:45pm - 3:00pm CDT

3:00pm CDT

Closing Remarks
avatar for Angela Wilkins

Angela Wilkins

Executive Director, The Ken Kennedy Institute
Angela Wilkins is the Executive Director of the Ken Kennedy Institute. Angela is responsible for the development and implementation of Ken Kennedy Institute’s programs in the computational sciences. After earning a Ph.D. in theoretical physics from Lehigh University, she shifted... Read More →

Tuesday October 26, 2021 3:00pm - 3:05pm CDT

4:30pm CDT

Outdoor Networking Reception with Sponsors
Join us for an outdoor networking reception on Tuesday, October 26th with sponsors at Holman Draft Hall from 4:30-6:30.

Thank you to our sponsor, DDN, for making this reception possible!

Tuesday October 26, 2021 4:30pm - 6:30pm CDT
Holman Draft Hall 820 Holman St, Houston, TX 77002
Wednesday, October 27

9:00am CDT

Conference Registration, Breakfast, Networking
Wednesday October 27, 2021 9:00am - 9:45am CDT

10:00am CDT

Wednesday October 27, 2021 10:00am - 10:05am CDT

10:05am CDT

Automatic Machine Learning with AutoGluon - Algorithms, Domains, Applications
AutoML is the ultimate challenge for machine learning algorithms. After all, design choices need to be automatic and tools need to work reliably all the time, within a given budget for computation and time. This poses exciting and (many unsolved) problems both in terms of model selection, calibration, optimization, adaptive design of priors, and data detection. In this talk I give an overview of the associated scientific problems and the current state of the art in terms of what goes into AutoGluon.

avatar for Alex Smola

Alex Smola

VP and Distinguished Scientist, Amazon Web Services
Alex Smola studied physics in Munich at the University of Technology, Munich and at AT&T Research in Holmdel. He received a Doctoral Degree in computer science at the University of Technology Berlin in 1998. He worked at the Fraunhofer Geselschaft (1996-1999),  NICTA (2004-2008... Read More →

Wednesday October 27, 2021 10:05am - 10:50am CDT

10:50am CDT

Scalable and Sustainable AI Acceleration for Everyone: Hashing Algorithms Train Billion-parameter AI Models on a Commodity CPU faster than Hardware Accelerators
Current Deep Learning (DL) architectures are growing larger to learn from complex datasets. Training and tuning astronomical-sized models are time and energy-consuming and stalls the progress in AI. Industries are increasingly investing in specialized hardware and deep learning accelerators like TPUs and GPUs to scale up the process. It is taken for granted that commodity hardware CPU is incapable of outperforming powerful accelerators such as GPUs in a head-to-head comparison of training large DL models. However, GPUs come with additional concerns: expensive infrastructural change which only few can afford, hard to virtualize, main memory limitations, chip shortage. Furthermore, the energy consumption of current AI training is prohibitively expensive. An article from MIT Technology Review noted that training one Deep Learning model generates more carbon footprint than five cars in their lifetime.

In this talk, I will demonstrate the first algorithmic progress that exponentially reduces the computation cost associated with training neural networks by mimicking the brain's sparsity. We will show how data structures, particularly hash tables, can be used to design an efficient "associative memory" that reduces the number of multiplications associated with the training of the neural networks. Implementation of this algorithm challenges the common knowledge prevailing in the community that specialized processors like GPUs are significantly superior to CPUs for training large neural networks. The resulting algorithm is orders of magnitude cheaper and energy-efficient. Our careful implementations can train Billions of parameter recommendation models on refurbished old generation CPU significantly faster than top-of-the-line TensorFlow alternatives on the most potent A100 GPU clusters. In the end, I will discuss the current and future state of this line of work along with a brief discussion on the planned extensions.

avatar for Anshumali Shrivastava

Anshumali Shrivastava

Associate Professor Computer Science, Electrical and Computer Engineering, Statistics, and Founder of ThirdAI Corp., Rice University
Anshumali Shrivastava's research focuses on Large Scale Machine Learning, Scalable and Sustainable Deep Learning, Randomized Algorithms for Big-Data and Graph Mining.

Wednesday October 27, 2021 10:50am - 11:35am CDT

11:35am CDT

Lunch + Networking
Wednesday October 27, 2021 11:35am - 12:30pm CDT

12:30pm CDT

Democratizing Deep Learning with Commodity Hardware: How to Train Large Deep Learning Models on CPU Efficiently with Sparsity
GPUs are expensive, require premium infrastructure, and are hard to virtualize. Furthermore, our models and data are growing faster than GPU memory. The communication cost of distributing the models over GPUs is prohibitively expensive for most workloads.

Wouldn't it be nice if we could train extensive models with commodity CPUs faster than GPUs? CPUs are cheap, well understood, and ubiquitous hardware. The main memory in CPUs can quickly run in Terabytes (TB) with minimum investment. For extensive models, we can fit both the model and the data in the CPU RAM.

This tutorial will focus on a new emerging paradigm of deep learning training using sparsity and hash tables. We will introduce the idea of selectively identifying parameters and sparsity patterns during exercise. We will demonstrate the integration of these algorithms in existing python codes. As a result, we demonstrate significantly superior deep learning capabilities on CPU, making them competitive (or even better) than state-of-the-art packages on some of the best GPUs. If time permits, we will briefly discuss multi-node implementation and some thoughts on how to train outrageously (Tens of billions or more) large models on small commodity clusters.

avatar for Anshumali Shrivastava

Anshumali Shrivastava

Associate Professor Computer Science, Electrical and Computer Engineering, Statistics, and Founder of ThirdAI Corp., Rice University
Anshumali Shrivastava's research focuses on Large Scale Machine Learning, Scalable and Sustainable Deep Learning, Randomized Algorithms for Big-Data and Graph Mining.

Nicholas Meisburger

Rice University

Shabnam Daghaghi

Rice University

Minghao Yan

RIce University

Wednesday October 27, 2021 12:30pm - 2:30pm CDT

12:30pm CDT

How to Deal with Volume and Velocity Associated with Hundreds of Terabytes (and Beyond) of Genomics Data?
Whole-genome shotgun sequencing (WGS) has enabled numerous breakthroughs in large-scale comparative genomics research. However, the size of genomic datasets has grown exponentially over the last few years.

This tutorial will focus on two new emerging techniques to handle the challenges associated with Volume and Velocity.

1. Repeated and Merged Bloom Filters (RAMBO) for processing hundreds of terabytes of sequence data. We will see how we index 170 TB of bacterial and virus sequences in less than 14 hours on a shared cluster at Rice, allowing searching for similar or anomalous sequences in a few milliseconds.

2. How to subsample high-velocity meta-genomics data, which keep the diversity intact. We will discuss how we can handle data that is generated at a very high rate. We will show how we can have an efficient sampling scheme roughly as fast as random sampling (RS). However, unlike RS, it preserves the diversity of the genomic pool. We will discuss how these techniques can even be pushed to the edge due to their tiny memory requirements.

Some hands-on experience on these two techniques will be provided.


Ben Coleman

Rice University

Gaurav Gupta

Rice University

Josh Engels

Rice University

Benito Geordie

Rice University

Alan Ji

Rice University

Junyan Zhang (Henry)

Rice University

Wednesday October 27, 2021 12:30pm - 2:30pm CDT
Room 280

2:30pm CDT

Afternoon Break and Networking
Wednesday October 27, 2021 2:30pm - 3:00pm CDT

3:00pm CDT

SeqScreen: Accurate and Sensitive Functional Screening of Pathogenic Sequences via Ensemble Learning
Modern benchtop DNA synthesis techniques and increased concern of emerging pathogens have elevated the importance of screening oligonucleotides for pathogens of concern. However, accurate and sensitive characterization of oligonucleotides is an open challenge for many of the current techniques and ontology-based tools.  To address this gap, we have developed a novel software tool, SeqScreen, that can accurately and sensitively characterize short DNA sequences using a set of curated Functions of Sequences of Concern (FunSoCs), novel functional labels specific to microbial pathogenesis which describe the pathogenic potential of individual proteins.  SeqScreen uses ensemble machine learning models encompassing multi-stage Neural Networks and Support Vector Classifiers which can label query sequences with FunSoCs via an imbalanced multi-class and multi-label classification task with high accuracy. In summary, SeqScreen represents a first step towards a novel paradigm of functionally informed pathogen characterization from genomic and metagenomic datasets. SeqScreen is open-source and freely available for download at: www.gitlab.com/treangenlab/seqscreen

Authors: Advait Balaji, Bryce Kille, Anthony Kappell, Gene Godbold, Madeline Diep, R. A Leo Elworth, Zhiqin Qian, Dreycey Albin, Daniel Nasko, Nidhi Shah, Mihai Pop, Santiago Segarra, Krista Ternus, and Todd Treangen


Todd Treangen

Rice University

Wednesday October 27, 2021 3:00pm - 3:15pm CDT

3:15pm CDT

Parallel RRT Algorithm for Robotic Motion Planning
The advent of autonomous technology ranging from self-driving cars to robotic surgery has propelled motion planning algorithms to the forefront of research. The Rapidly-exploring Random Tree (RRT) algorithm is one such example that is used by robots to find a suitable path between two points while avoiding obstacles. It does this by building a search tree rooted at the start point and then grows the tree by randomly generating and connecting nodes in the search space. It then verifies each connection to ensure no collision has taken place. The algorithm terminates when the goal region is searched and returns a valid path through the tree.

Traditionally, RRT is designed to run sequentially on a single thread. Increasing the speed and efficiency of the algorithm would facilitate its use in highly complex realistic scenarios. With the advent of powerful computing machines, it is an opportune time to enhance the performance of these algorithms. This paper presents a novel parallel-RRT motion planning algorithm that performs computationally intensive steps in batches simultaneously on multiple threads. This increases the number of nodes created and collision checked per second hence finding paths faster.

To test the novel algorithm, we recorded the time taken for a car in a two dimensional space to navigate from a start to a goal point while avoiding obstacles in unknown environments. Results proved that the algorithm successfully utilized the additional threads to calculate paths quicker and more efficiently. In terms of speed, the algorithm showed a 2x speedup when using 2 threads and a 2.35x speedup when using 3 threads. In terms of efficiency, which was reflected by the number of connections added to the search tree per second, the algorithm showed a 2.25x increase in efficiency using 2 threads and a 3x increase using 3 threads.

These preliminary results show promise for leveraging parallel implementations of motion planning algorithms. The use of novel parallel algorithms such as that utilized in this paper heralds the progression into a new era of motion planning capabilities and would invigorate current development efforts in robotics and automation.

Authors: Mantej Singh, Rahul Shome, and Lydia Kavraki


Mantej Singh

Rice University

Wednesday October 27, 2021 3:15pm - 3:30pm CDT

3:30pm CDT

MaGNET: Uniform Sampling from Deep Generative Network Manifolds without Retraining
Deep Generative Networks (DGNs) are extensively employed in Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and their variants to approximate the manifold structure and the distribution of a training dataset. However, the samples from the data manifold used to train a DGN are often obtained based on preferences, costs, or convenience such that they favor certain modes (c.f., the large fraction of smiling faces in the CelebA dataset or the large fraction of dark-haired individuals in FFHQ). These inconsistencies will be reproduced in any data sampled from the trained DGN, which has far-reaching potential implications for fairness, data augmentation, anomaly detection, domain adaptation, and beyond. In response, we develop a differential-geometry-based technique that, given a trained DGN, adapts its generative process so that the distribution on the data generating manifold is uniform. We prove theoretically and validate experimentally that our technique can be used to produce a uniform distribution on the manifold regardless of the training set distribution.

Authors: Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk


Ahmed Imtiaz Humayun

Rice University

Wednesday October 27, 2021 3:30pm - 3:45pm CDT

3:45pm CDT

Magnified Convolutional Enrichment Representation Model
Feature representation mathematically characterizes domain entities, which is crucial in machine learning. We designed a dynamic deep model to evaluate the over-representation of a disease and genes as the controlled vocabulary, with leveraging the contexture information with the word embedding and the global enrichment information, to represent the human diseases. The model has been evaluated and demonstrated the good fitness for predicting the associations of complex diseases.
Authors: Guocai Chen, Herbert Chen, Yuntao Yang, Abhisek Mukherjee, Shervin Assassi, Claudio Soto, and Wenjin Zheng


Wednesday October 27, 2021 3:45pm - 4:00pm CDT

4:00pm CDT

PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication
Graph Convolutional Networks (GCNs) is the state-of-the-art method for learning graph-structured data. Training large-scale GCNs requires distributed training across multiple accelerators such that each accelerator is able to hold a partitioned subgraph. However, distributed GCN training incurs prohibitive overhead of communicating node features and gradients among partitions for every GCN layer in each training iteration, limiting the achievable training efficiency and model scalability. To this end, we propose PipeGCN, a simple-yet-effective scheme that hides the communication overhead by pipelining inter-partition communication with intra-partition computation. It is non-trivial to pipeline for efficient GCN training, as communicated node features/gradients will become stale and thus can harm the convergence, negating the pipeline benefit. Notably, little is known regarding the convergence rate of GCN training with stale features. This work not only provides a theoretical convergence guarantee but also finds the convergence rate of PipeGCN to be close to that of the vanilla distributed GCN training without pipeline. Furthermore, we develop a smoothing method to further improve PipeGCN's convergence. Extensive experiments show that PipeGCN can largely boost training throughput (up to 2.2×) while achieving the same accuracy as its vanilla counterpart and that PipeGCN also outperforms existing full-graph training methods.

Authors: Cheng Wan, Youjie Li, Cameron Wolfe, Anastasios Kyrillidis, Nam Kim, and Yingyan Lin


Cheng Wan

Rice University

Wednesday October 27, 2021 4:00pm - 4:15pm CDT

4:15pm CDT

Quantification of Myxococcus Xanthus Aggregation and Rippling Behaviors: Deep-Learning Transformation of Phase-Contrast into Fluorescence Microscopy Images
Myxococcus xanthus bacteria are a model system for understanding pattern formation and collective cell behaviors. When starving, cells aggregate into fruiting bodies to form metabolically inert spores. During predation, cells self-organize into traveling cell-density waves termed ripples. Both phase-contrast and fluorescence microscopy are used to observe these patterns but each has its limitations. Phase-contrast images have higher contrast, but the resulting image intensities lose their correlation with cell density. The intensities of fluorescence microscopy images, on the other hand, are well-correlated with cell density, enabling better segmentation of aggregates and better visualization of streaming patterns in between aggregates; however, fluorescence microscopy requires the engineering of cells to express fluorescent proteins and can be phototoxic to cells. To combine the advantages of both imaging methodologies, we develop a generative adversarial network that converts phase-contrast into synthesized fluorescent images. By including an additional histogram-equalized output to the state-of-the-art pix2pixHD algorithm, our model generates accurate images of aggregates and streams, enabling the estimation of aggregate positions and sizes, but with small shifts of their boundaries. Further training on ripple patterns enables accurate estimation of the rippling wavelength. Our methods are thus applicable for many other phenotypic behaviors and pattern formation studies.

Authors: Jiangguo Zhang, Jessica Comstock, Christopher Cotter, Patrick Murphy, Weili Nie, Roy Welch, Ankit Patel, and Oleg Igoshin


Jiangguo Zhang

Rice University

Wednesday October 27, 2021 4:15pm - 4:30pm CDT

4:30pm CDT

Posters and Networking
COVID-19 Chest X-Ray Image Classification Using Deep Learning: Soumava Dey (American International Group Inc.), Gunther Correia Bacellar (Microsoft), Mallikarjuna Chandrappa (Bank Of America) and Rajlakshman Kulkarni (Bank Of America)

Localization for Autonomous Underwater Vehicles Inside GPS-Denied Environments: Issam Ben Moallem (Rice University), Ashesh Chattopadhyay (Rice University), Pedram Hassanzadeh (Rice University) and Fathi H. Ghorbel (Rice University)

An Open-Data Driven Risk Assessment Metric for Covid-19 in Texas by County: A Correlation Study Among Possible Risk Factors and an Elementary Unsupervised Machine Learning Analysis: Archita Singh (Cypress Falls High School), Swapnil Shaurya (University of Texas at Austin) and Antony Adair (MD Anderson Cancer Center/UT Health Graduate School of Biomedical Sciences)

Denoising the Fast Monte Carlo Voxel Level Dose Distributions in Proton Beam Radiation Therapy: A Study to Decrease the Computation Time Required: Sanat Dubey (Westwood High School), Antony Adair (MD Anderson Cancer Center/UT Health Graduate School of Biomedical Sciences) and Pablo Yepes (Rice University)

Wednesday October 27, 2021 4:30pm - 5:30pm CDT
Tuesday, November 30

10:00am CST

Data4Good & Responsible AI-Automated Data Collection: How Do You Ensure Data4good?

Data4Good & Responsible AI-Automated Data Collection: How Do You Ensure Data4good?


Web data gathering, especially smart, AI-driven, data retrieving, cleansing, normalization, and aggregation solutions, can significantly reduce the amount of time and resources that organizations have to invest in data collection and preparation.

Though web data collection has existed for a long time, the use of AI for web data gathering has become a game-changer.

As methods for the data domain multiply, so too do calls to ensure that these methods are used “for good,” or at the very least, ethically. But how do we know if we are achieving “good”?

Microsoft Israel R&D Centre's Chief Scientist Dr. Tomer Simon alongside Bright Data's CEO Or Lenchner will explore the different questions raised when approaching data at a mammoth scale. They will also discuss interesting cases that use and leverage data for battling climate change, fighting social injustice, and even saving lives.

The focus of this workshop is to champion a "do no harm" approach when accessing and approaching data using AI and take a closer look at the ethical, compliance-driven processes and questions one must address when doing so, even when approaching what is considered to be "public domain data."

During this workshop, we will also introduce a set of questions we created that can assist any organization in the process and provide different real-life examples of data being used and tested for good.

avatar for Dr. Tomer Simon

Dr. Tomer Simon

Chief Scientist, Microsoft
avatar for Or Lenchner

Or Lenchner

CEO, Bright Data

Tuesday November 30, 2021 10:00am - 11:30am CST