Understanding Human Behavior through Data Science

With massive amounts of data currently available and being collected, obtaining access to data is seldom the concern. Information is being produced and stored at an unprecedented rate, and increasingly, much of the big data being collected is about human behavior. Our behavior is captured in the information that we provide from our smartphones, computers, televisions, smart speakers, credit cards, or tracking devices from insurance companies. Sifting through this data and deriving insights on human behavior enables organizations to make more effective decisions and develop stronger policies.

Discovering Causaulity within Complexity

However, with all of this available data, it’s possible for data scientists to introduce data into a machine learning model and receive strange or surprising results. Consider the following examples.

  • Personal credit scores not only predict who is likely to default on a loan. They are also strong predictors of the likelihood of being in a car accident and, as such, are often used in insurance quotes.
  • Lifestyle, habits, and purchase preferences not only predict future purchase behavior, but the likelihood of diseases such as diabetes and hypertension.
Human behavior is complex, and while we may be able to accurately predict behavior, it may be more important to understand why that behavior occurs and how to change it.

As human behavior is considered more carefully as a part of business strategy, there is value in thinking about causal inference and the methods and research design that help to demonstrate causality.

Credit Score Distribution in U.S.

Using Methods from Psychology and Social Science to Understand Behavior

Sample Results of an A/B Test

Experimentation and Causality

Sometimes prediction is good enough. Other times, we’re interested in finding the best prediction given some historical observations. We can attempt to predict ...

  • The closing price of a stock given past data.
  • Who is most likely to benefit from some procedure
  • Whether or not someone will to buy a particular product
However, prediction is limited in that it’s based on past experience. If we want to identify an underlying behavioral model or understand why people are behaving the way they are, we have to think about identification strategies to demonstrate causality.

Social scientists and psychologists, in particular, use experimental design to demonstrate causation and to randomize subjects into different conditions. Now a standard in many companies, A/B testing can be useful in testing whether version “A”, the control (usually the current version) or version “B”, the treatment, improves something of interest. A/B tests can be useful in learning how to optimize whichever KPI or aspect of operation companies are interested in improving - click-through rates, time user spend on a site, sales, repeat usage. Any company that has at least a few thousand active users can conduct these tests. With a large customer sample, it’s feasible to automatically collect large amounts of data on users’ behavior and to run concurrent experiments in order to evaluate many ideas quickly with high precision.

Combining Machine Learning and Causal Inference

While A/B tests and experimental designs are standard practice to establish causality, there are still limitations to these methods. Noise may have creeped into the data. Or there could’ve been not one but hundreds of different tests run to predict which features affect users’ behavior. Both cases present a problem (and opportunity) in that the data is a high-dimensional feature set of potential controls and predictors. Machine learning techniques, such as LASSO (least absolute shrinkage and selection operator) or newer methods like Double Machine Learning (Chernozhukov et al., 2016), can be helpful in dimensionality reduction and estimating causal effects.

Another possibility is that A/B tests or experiments couldn’t be run. While designing and running experiments is almost always better than causal inference techniques, there are many situations where running an A/B test or experiment is not possible. This is where large collections of data and different statistical techniques can improve the causal inferences we make. Statistical techniques, such as regression discontinuity design, difference-in-differences, fixed effects modeling, and instrumental variables modeling, have been used to remove the effect of confounding variables and model causal relationships.

Sample LASSO Regression Learning Rate

Surveys and Crowdsourcing Data

While observational data may be widely available, understanding and gaining insight into users’ thoughts and attitudes may also be important. Surveys provide insight into fundamental questions about people’s internal states (emotions, expectations, opinions, etc.) that cannot often be learned from observational data alone. For example, a company may be interested in why people are not clicking on a newly launched section of their website. They have observational data on their users’ click through behavior, time spent on the website, demographic variables (ex. age, region, gender, income, education) but are unable to gain insights into what their users are feeling or thinking while browsing on the website. This is where surveys or focus groups can help in gathering qualitative and quantitative feedback on users’ internal state.

Big data can help in understanding what people do, but surveys can be useful in uncovering why people do certain things or how people are feeling. The utility of focus groups underscores how small but valuable data can be equally or more important than big data.



According to John Wanamaker, “Half the money I spend on advertising is wasted; the trouble is that I don’t know which half.” With the power of A/B testing and causality inference, the effect of a promotion or marketing effort can be quantified. For example, a website is interested whether personalisation increases retention rates on their website. They can launch an A/B test using a personalisation algorithm for half of their users (treatment group) and use the original website for the other half (control). The company can run an experiment, quickly iterate, and pivot if the results are not positive.

Further Reading

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., & Newey, W. K. (2016). Double machine learning for treatment and causal parameters (No. CWP49/16). cemmap working paper.

Medical Research

While medical research studies commonly conducts controlled, randomized studies, medicine and healthcare is a promising industry for implementing data science solutions. Smart devices are becoming more common and patients’ behavioral data can be used to create highly customizable programs. Experiments and their data from smart devices could be used to design an effective program aimed at changing patients’ lifestyle.


Behavioral modeling outside academic contexts is interdisciplinary by nature. We welcome the opportunity to explore a unique application of social-scientific methods to your industry.