Go wide or go deep? Data and models in clean technology
When faced with a question about agriculture, water, energy, air or another clean technology system, how do you decide to model it? Do you dive down deep into the subject matter and try and figure out what would work? Or do you go wide, look at the interactions between different systems and see how different types of data and models can be combined? Or do you do a mix of both?
Like many other challenges in the data science and clean technology fields, it really depends on the question you’re trying to answer - and the data that are available. If the question you’re trying to answer has a relatively well defined process and sufficient data - then it makes sense to start by diving deep into the subject and looking at different processes and interactions within a relatively narrow field. For example, let’s say that we’re trying to understand the chemical interactions in a water treatment plant process and if there’s a problem with how effectively the treatment process removes a certain chemical like lead. That’s a very specific question with data that’s local, targeted and can be used to answer the question. In this case, a deep model of the process and the water treatment plant can be built and model results obtained relatively quickly.
However, questions and data that are so specific are quite rare in the real world. As interactions between Earth systems and humans become increasingly complex, understanding and modeling them requires a broad and deep understanding of the different processes, how they interact, and the data associated with them.
Let’s take a look at predicting crop yield in an agricultural system. Now, a typical process model would look at the plant, farm management practices, interactions with the weather, soil, water and pests - and use those to predict how the crop will perform in a given season. That’s already several systems interacting with each other! Further, each of those systems needs to be understood in some detail to be able to build an accurate and reliable model - so that means that we need to go deep and wide.
The level of complexity of the model will depend on how broadly the system is defined and where we’re setting the system boundaries. For example, if we were to build a model for the future, we might incorporate a climate change model to explore what the impacts to crop productivity in the region would be. Or look at pollinator behavior over several fields to see how farm management practices are impacting yield and ecosystems - since crop yield depends on pollination. Or we could look at farmer profitability, crop prices in the market and explore how that is impacting farm management practices and ecosystems.
That’s just a look at the systems involved in one of the popular areas in clean technology and how they are defined and modeled. But what about the data needed for the models? Once again, we can get data at different scales. Local or site specific data is useful for building deep, specific process models. We can also get data at larger scales from satellites, drones and robots. Sometimes the data are specific enough to solve the problem. But most often, we need to transform data or combine different data sources in order to solve the question at hand.
And what about data for machine learning models? Machine learning models came from computer science and the high tech industry - so, the kind of data that they expect and that makes the model most effective is high volume, high velocity, big data. But data in clean technology is often smaller and more limited in spatial and temporal extent. For example, crop yield is measured once or at most twice a year. That’s very different from Google search where millions of data points are collected every hour! That often means that data from different systems needs to be combined (e.g. water data in addition to crop data or ecosystem data, economic data and farm data) and that machine learning models need to be adapted to work in clean technology systems.
That’s why we at Ecoformatics look at the entire Earth System. We don’t focus on only agricultural technology or water tech or climate change or energy - we look at all the systems. That way, we can do two things. We can go deep and understand each system - and go wide to build effective machine learning models and generate new insights by looking at interactions between multiple systems. And that’s what we teach as well in our courses and workshops as well.
So you'll see a range of topics in our workshops, newsletters, and blog posts - because we believe that the most innovative solutions come from exploring as widely as possible - and becoming an expert in the subject.
If you’re interested in this approach and in getting started with machine learning for clean technology, join us here for our live, hands-on, virtual workshop and online course on Sunday, July 19th at 11 am Pacific Time.