Hi! My name is Nathan Kallus.
I am a post-doctoral associate at MIT and a visting scholar at USC and soon to be starting as assistant professor of operations research and information engineering at Cornell University's NYC Cornell Tech campus.
My research interests include data-driven decision-making, optimization, statistics, and the analytical capacities and challenges of unstructured and large-scale data.
If you want to know more about me I suggest you look at my C.V. below or email me.
My email is nathan.kallus at gmail.com.
Service and affiliations
Research interests: Data-driven decision-making, optimization, statistics, and the analytical capacities and challenges of unstructured and large-scale data.
Papers published or submitted
Learning Preferences from Assortment Choices in a Heterogeneous Population. With M. Udell.
AbstractWe consider the problem of learning the preferences of a heterogeneous customer population by observing their choices from an assortment of products, ads, or other offerings. Our observation model takes a form common in assortment planning: each arriving customer chooses from an assortment of offerings consisting of a subset of all possibilities. One-size-fits-all choice modeling can fit heterogeneous populations quite poorly, and misses the opportunity for assortment customization in online retail. On the other hand, time, revenue, and inventory targets rule out exploring the preferences of every customer or segment. In this paper we propose a mixture choice model with a natural underlying low-dimensional structure, and show how to estimate its parameters. In our model, the preferences of each customer or segment follow a separate parametric choice model, but the underlying structure of these parameters over all the models has low dimension. We show that a nuclear-norm regularized maximum likelihood estimator can learn the preferences of all customers using a number of observations much smaller than the number of item-customer combinations. This result shows the potential for structural assumptions to speed up learning and improve revenues in assortment planning and customization.
On the Predictive Power of Web Intelligence and Social Media. To appear as a chapter in a volume of Lecture Notes in Computer Science, Springer 2015.
Abstract With more information becoming widely accessible and new content created shared on today's web, more are turning to harvesting such data and analyzing it to extract insights. But the relevance of such data to see beyond the present is not clear. We present efforts to predict future events based on web intelligence -- data harvested from the web -- with specific emphasis on social media data and on timed event mentions, thereby quantifying the predictive power of such data. We focus on predicting crowd actions such as large protests and coordinated acts of cyber activism -- predicting their occurrence, specific timeframe, and location. Using natural language processing, statements about events are extracted from content collected from hundred of thousands of open content we sources. Attributes extracted include event type, entities involved and their role, sentiment and tone, and -- most crucially -- the reported timeframe for the occurrence of the event discussed -- whether it be in the past, present, or future. Tweets (Twitter posts) that mention an event to occur reportedly in the future prove to be important predictors. These signals are enhanced by cross referencing with the fragility of the situation as inferred from more traditional media, allowing us to sift out the social media trends that fizzle out before materializing as crowds on the ground.
From Predictive to Prescriptive Analytics. With D. Bertsimas.
AbstractIn this paper, we combine ideas from machine learning (ML) and operations research and management science (OR/MS) in developing a framework, along with specific methods, for using data to prescribe decisions in OR/MS problems. In a departure from other work on data-driven optimization and reflecting our practical experience with the data available in applications of OR/MS, we consider data consisting, not only of observations of quantities with direct effect on costs/revenues, such as demand or returns, but predominantly of observations of associated auxiliary quantities. The main problem of interest is a conditional stochastic optimization problem, given imperfect observations, where the joint probability distributions that specify the problem are unknown. We demonstrate that our proposed solution methods are generally applicable to a wide range of decision problems. We prove that they are computationally tractable and asymptotically optimal under mild conditions even when data is not independent and identically distributed (iid) and even for censored observations. As an analogue to the coefficient of determination R², we develop a metric P termed the coefficient of prescriptiveness to measure the prescriptive content of data and the efficacy of a policy from an operations perspective. To demonstrate the power of our approach in a real-world setting we study an inventory management problem faced by the distribution arm of an international media conglomerate, which ships an average of 1 billion units per year. We leverage both internal data and public online data harvested from IMDb, Rotten Tomatoes, and Google to prescribe operational decisions that outperform baseline measures. Specifically, the data we collect, leveraged by our methods, accounts for an 88% improvement as measured by our coefficient of prescriptiveness.
Robust SAA. Winner of the Best Student Paper Award, MIT Operations Research Center 2013. With D. Bertsimas and V. Gupta.
AbstractSample average approximation (SAA) is possibly the most popular approach to modeling decision making under uncertainty in data-driven settings. In SAA, one approximates a true, unknown probability distribution by the empirical distribution defined by the data. Under mild assumptions, as the amount of data grows, the solutions of SAA-based optimization problems converge asymptotically to the solutions that would be obtained if the underlying distribution were known. In this paper, we propose a general purpose, modification of SAA that retains this asymptotic convergence, but also enjoys a strong finite-sample performance guarantee. The key idea is to define a suitable robust optimization over the set of distributions which are close to the empirical distribution using tools from statistical hypothesis testing. The resulting optimization problem is computationally tractable and solvable using off-the-shelf solvers. We illustrate the approach by studying some specific inventory models in data-driven settings. Computational evidence confirms that our approach significantly outperforms other data-driven approaches to such problems.
Data-Driven Robust Optimization. Finalist, INFORMS Nicholson Paper Competition 2013. With D. Bertsimas and V. Gupta.
AbstractThe last decade has seen an explosion in the availability of data for operations research applications as part of the Big Data revolution. Motivated by this data rich paradigm, we propose a novel schema for utilizing data to design uncertainty sets for robust optimization using statistical hypothesis tests. The approach is flexible and widely applicable, and robust optimization problems built from our new sets are computationally tractable, both theoretically and practically. Furthermore, optimal solutions to these problems enjoy a strong, finite-sample probabilistic guarantee. We also propose concrete guidelines for practitioners and illustrate our approach with applications in portfolio management and queueing. Computational evidence confirms that our data-driven sets significantly outperform conventional robust optimization techniques whenever data is available.
Predicting Crowd Behavior with Big Public Data. In the Proceedings of the 23rd international conference on World Wide Web. Winner, INFORMS Social Media Analytics Best Paper Competition 2015.
AbstractWith public information becoming widely accessible and shared on today's web, greater insights are possible into crowd actions by citizens and non-state actors such as large protests and cyber activism. Turning public data into Big Data, company Recorded Future continually scans over 300,000 open content web sources in 7 languages from all over the world, ranging from mainstream news to government publications to blogs and social media. We study the predictive power of this massive public data in forecasting crowd actions such as large protests and cyber campaigns before they occur. Using natural language processing, event information is extracted from content such as type of event, what entities are involved and in what role, sentiment and tone, and the occurrence time range of the event discussed. The amount of information is staggering and trends can be seen clearly in sheer numbers. In the first half of this paper we show how we use this data to predict large protests in a selection of 19 countries and 37 cities in Asia, Africa, and Europe with high accuracy using standard learning machines. In the second half we delve into predicting the perpetrators and targets of political cyber attacks with a novel application of the naïve Bayes classifier to high-dimensional sequence mining in massive datasets.
Optimal A Priori Balance in the Design of Controlled Experiments.
AbstractWe develop a unified theory of designs for controlled experiments that balance baseline covariates a priori (before treatment and before randomization) using the framework of minimax variance. We establish a "no free lunch" theorem that indicates that, without structural information on the dependence of potential outcomes on baseline covariates, complete randomization is optimal. Restricting the structure of dependence, either parametrically or non-parametrically, leads directly to imbalance metrics and optimal designs. Certain choices of this structure recover known imbalance metrics and designs previously developed ad hoc, including randomized block designs, pairwise-matched designs, and re-randomization. New choices of structure based on reproducing kernel Hilbert spaces lead to new methods, both parametric and non-parametric.
The Power of Optimization Over Randomization in Designing Experiments Involving Small Samples. With D. Bertsimas and M. Johnson. To appear in Operations Research. OR articles in advance. Code.
AbstractRandom assignment, typically seen as the standard in controlled trials, aims to make experimental groups statistically equivalent before treatment. However, with a small sample, which is a practical reality in many disciplines, randomized groups are often too dissimilar to be useful. We propose an approach based on discrete linear optimization to create groups whose discrepancy in their means and variances is several orders of magnitude smaller than with randomization. We provide theoretical and computational evidence that groups created by optimization have exponentially lower discrepancy than those created by randomization.
Scheduling, Revenue Management, and Fairness in an Academic-Hospital Division: An Optimization Approach. With D. Bertsimas and R. Baum. Academic Radiology, Volume 21, Issue 10, October 2014, Pages 1322—1330. PDF. Editorial comment (D. Avrin).
AbstractPhysician staff of academic hospitals today practice in several geographic locations including their main hospital, referred to as the extended campus. With extended campuses expanding, the growing complexity of a single division's schedule means that a naïve approach to scheduling compromises revenue and can fail to consider physician over-exertion. Moreover, it may provide an unfair allocation of individual revenue, desirable or burdensome assignments, and the extent to which the preferences of each individual are met. This has adverse consequences on incentivization and employee satisfaction and is simply against business policy. We identify the daily scheduling of physicians in this context as an operational problem that incorporates scheduling, revenue management, and fairness. Noting previous success of operations management and optimization in each of these disciplines, we propose a simple, unified optimization formulation of this scheduling problem using mixed integer optimization (MIO). Through a study of implementing the approach at the Division of Angiography and Interventional Radiology at the Brigham and Women's Hospital, which is directed by one of the authors, we exemplify the flexibility of the model to adapt to specific applications, the tractability of solving the model in practical settings, and the significant impact of the approach, most notably in increasing revenue significantly while being only more fair and objective.