When Data Science Fails Clean Technology
Is data science infallible? If all we had to go on were the breathlessly excited articles published in business magazines and the highly polished releases from startups and large tech companies, it would certainly seem so. Think of the articles that have been published this year with titles like - “Artificial Intelligence (AI) to replace all jobs by such and such year”, “Machine learning solves problem faster than humans”, “ Data science shows promise to end world hunger soon” - and so on and so on….
Data science is a relatively new field, but one that combines elements from disciplines that have been around for a while - computer science, statistics, and algebra for example. The difference right now is the sheer power and availability of computational resources like the cloud that allow people to build and run different models and experiment on a scale that we haven’t seen before. And in high-tech companies, we’re also seeing an explosion in the availability of data that allows us to understand systems better and thus build better models.
If we didn’t have several thousands of images of cats on the Internet, Google’s image recognition algorithm would be hard pressed to identify a cat from a horse! Having so much data available that can be used to train algorithms and create better models has made it seem that AI/Machine Learning/Data Science - take your pick, is better at doing things than humans and that most problems can be easily solved if you can build the right algorithm.
But is that true in all fields - especially clean technology?
Let’s look at the different components needed to build successful data science tools in clean technology.
First, the data. Software companies like Google and Facebook and mobile apps like Uber, Doordash and so on hoover up huge amounts of data every day. That means that machine learning models have millions of data points to train on and a lot of data to make predictions from. Also, the data are relatively predictable - if people are using your website or app, then you have a way to design an experiment relatively successfully by changing certain features (adding a button, changing the size, removing links).
Is that what happens in clean technology sectors like water, air, energy and agriculture?
To start with, we have data that has historically been collected at lower frequency than clicking a button on an app. That means there is data distributed through space and time.
Let’s take the example of air quality monitoring stations, given that poor air quality has been in the news in many countries lately. If we’re monitoring air quality in a city for example, you typically have stations with sensors scattered at fixed points across the city. So, we have spatial and temporal data. The data from these stations may be collected every day/week/month or less frequently in some cases and the number of these stations is certainly likely to be in the tens or hundreds at most, not thousands and millions. This means smaller datasets and data of different quality.
In the last few years, there’s been an explosion of interest in building sensors, robots and apps to collect data more frequently over space and time. However, the availability and quality of data is still an issue in many clean tech fields.
And then, since these are natural systems, there is much less control over the type and quality of data compared to a website or mobile app. To go back to the air monitoring example, what happens if an unexpected storm knocks your station or it’s sensors offline? How can the data be dealt with and how do you account for other variables that may have been introduced into your model as a result?
These are the problems where you need someone who’s an expert in air quality to solve or find work arounds - not where algorithms are applied blindly without a deep understanding of the underlying system. That brings me to the second component of our system.
Second, the model. Most machine learning algorithms have been developed for computer science and software applications. The successful applications are ones where the algorithm has a lot of data, the data is useful to the ultimate result and the algorithm can be trained and tested successfully using the data. Web searches, recommendations for shows, purchases, social graphs are all examples of extremely successful machine learning algorithms.
But what happens to algorithms in the clean tech space where some of these assumptions and data requirements do not hold?
Fortunately, machine learning algorithms do not have to be built from scratch for the clean tech sector! However, we do need to adapt existing algorithms to deal with the vagaries of data from the clean tech space. In some cases, that may be as simple as building an error function in an existing optimization model. In others, it may involve integrating existing physical or statistical models of the clean technology system with machine learning models or building a hybrid version of the two.
Let’s look at what happens when the existing air monitoring stations in our example can be integrated with data from an app where people can rank visibility from different locations and health impacts to individuals. Now we have two different datasets - and we need to figure out how to highlight areas with poor air quality right now as well as over the week. That means integrating air station data with an air quality model for prediction into the future and a classification algorithm using the data from the app that can identify areas with poor air quality as they occur.
In all these cases, if the person building these models is not intimately familiar with the system and underlying issues or in other words, you will get odd corner cases, results that don’t meet the requirements of the user or that don’t stand up over time. And that’s where the need for the expert comes in - someone who understands the sector and the limitations of the algorithms and can figure out how to compensate for them or build something that will meet most of the goals.
And that brings me to the third and final component: Storytelling - understanding the user and the questions that solve the users problem.
Building models, finding data, creating impressive visualizations - all these are important components of using data science to solve problems in clean technology. However, the most important component in building effective solutions is understanding what questions need to be answered, whom the data and results impact and what the limitations and pitfalls of the solution are.
To go back to our air pollution example - what questions are we trying to answer and for whom?
One audience is people living in the city who need to understand how the air quality is changing and if children, old people and people with existing health conditions in certain areas are being impacted that day. Simple updates and a clear map of the conditions over the city, as well as resources where health impacts can be mitigated or monitored would be most useful for them.
Another audience would be policy makers and government officials trying to understand how to solve the problem. So, here’s where the causes as identified in the model, people’s reactions and predictions to mitigate or improve the situation come in. In such a system, it would be as important to showcase what the model can’t do as it is to highlight what it can.
And this is where you need someone who can communicate all the complexities of these situations - someone who understands what questions are essential and what are merely important and which ones are easy to solve but do not help the users effectively. And that’s where having an expert in the clean tech sector is essential.
So, when does data science fail clean technology?
Data science fails clean technology when any of these three components are missing.
If you have people who can build the algorithms and code and create complicated software systems, but who do not understand the underlying differences in data from the clean tech field and how it’s different from what they’ve built before - your system will break.
If you use algorithms like a black box and do not see how they need to be modified to fit the clean tech sector, your system will break.
If you do not understand your audience - their questions, what problem they want solved, the kinds of solutions that are actually helpful - your system will break.
But if you can combine all these components into something useful, then you’re well on your way to helping solve the Earth’s problems!