AI and Underground Monitoring

Edwardov

Member
This is a showcase example of using AI to predict interesting things related to caves. I'm working on a few other projects but I thought I would share this one.

I have trained an AI to study weather and temporal patterns to predict CO2 levels inside Poole's Cavern. The CO2 data came from the BCSC, and I extracted hourly weather data from a numerical model output (https://rda.ucar.edu/datasets/ds094.1/) which matched the BCSC surface weather data but had better coverage throughout the timeseries and extended to the present. The entire dataset was resampled at 10 minute resolution to match the cave logger data which resulted in the interpolation of the hourly weather data.

Essentially I matched data of the scenario (external weather data and temporal patterns) to the result (cave CO2 data), and trained the AI to learn a function relating them. When tested, the trained AI performed very well when trying to predict the CO2 levels from some sample October 2018-October 2019 data that I held back which I obviously knew the CO2 levels for. I then trained the AI on the entire October 2018-October 2019 dataset and predicted CO2 levels for the time period of November 2019-present for which I had the required weather data. So see this as a prediction as to what the CO2 dataset might be like when it gets release by the BCSC (they are currently working on this). The algorithm performed on the test set with a mean squared error of 445, which is +- 21, quite good!

Though this is assuming that the underlying relationship between external weather and internal cave CO2 levels is the same for November 2019-present as it is in the data from October 2018-October 2019 from which it learned. Another assumption is that external weather and temporal patterns are the best predictors of CO2 data for this problem.

I used the following parameters to feed into the algorithm:
atmospheric pressure
wind speed and direction
dew point temperature
temperature
soil moisture content
precipitation
relative humidity
sensible and latent heat flux
hour of day
day of week
week of year
month of year

Just remember that while I can predict far into the future, things will have obviously changed from 2019 to 2020 because of no visitors being present in the showcave and whatever else might have changed that affects CO2. So I would expect my prediction to fall over at some point in the future unless it has been retrained on recent data to learn any changes.

This is just an example of what you can do with AI and some data engineering in Python. I used this sort of stuff in my day job to solve all sorts of problems, so I thought I'd give it a go for caves. If anyone has any interesting problems they have, I'd love to take a look. Or if anyone has any questions just give me a shout.

Cheers
Ed

Training CO2 data
J8WPKFK.png


Predicted Future CO2
cacb04P.png
 

mrodoc

Well-known member
It would be interested how few of  those parameters would give just a crude prediction of likely levels for a specific  cave expressed in CO2 percentage.  Is it possible to ascertain which have the most effect on levels? Then it could become a useful tool
 

Edwardov

Member
mrodoc said:
It would be interested how few of  those parameters would give just a crude prediction of likely levels for a specific  cave expressed in CO2 percentage.  Is it possible to ascertain which have the most effect on levels? Then it could become a useful tool

This model might not predict well for other caves, it needs to be trained for each instance with external weather data and internal logging data to learn a function relating them that is then used for prediction.

Yes it is possible to do that, but the "importance" of each variable in the AI's "mind" is not quite what you want. Importance as defined by the AI is how easily that variable can be used to linearly separate the data into chunks where it knows what the result is in that space from past experience (training). So while a variable might have a high importance in the prediction process, it does not mean that the variable is important in terms of a causal relationship in predicting the outcome. This is in addition to the "correlation does not mean causation" problem, which also has to be taken into account with machine learning.

Aside from that, the more "important" the variable, the better it allows the AI to segment the data space into areas where it knows the result from past experience. By removing less important variables, you get degradation in accuracy but can get a good enough answer from the remaining more important variables. There will be a point where if you remove enough variables, you get the optimal amount of information to provide the best accuracy, while any further variable removal results in worse accuracy. There is a balance to be maintained.
 

Edwardov

Member
dmcfarlane said:
Can we see both data sets plotted on the same axes for easier comparison?

Not sure what you might want to gain from this, the plots are for two different time periods. Overlaying them might give the wrong impression that one is a prediction of the other, which isn't the case. The CO2 data that I have predicted for have not yet been released to see how well the prediction did "in the wild".
 

Bob Mehew

Well-known member
I take your point about "correlation does not mean causation" since whilst precipitation is one potential parameter but stream flow within cave and drip rate might provide better correlation, if not indeed causation.  (Given I suspect a lot of CO2 comes from degassing of water which has percolated through the limestone, see https://caves.org/pub/journal/PDF/V68/v68n1-Baldini.pdf for example.) 

Can you suggest any 'simple' back ground reading material for this AI process?
 

Edwardov

Member
Bob Mehew said:
I take your point about "correlation does not mean causation" since whilst precipitation is one potential parameter but stream flow within cave and drip rate might provide better correlation, if not indeed causation.  (Given I suspect a lot of CO2 comes from degassing of water which has percolated through the limestone, see https://caves.org/pub/journal/PDF/V68/v68n1-Baldini.pdf for example.) 

Can you suggest any 'simple' back ground reading material for this AI process?

This is a great book that I thoroughly recommend for covering the basics and seeing how these things work

https://www.amazon.co.uk/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ref=pd_lpo_14_t_0/259-3590597-3385646?_encoding=UTF8&pd_rd_i=1492032646&pd_rd_r=b3c1e0c2-554d-42ea-93a8-1ae1aca6e893&pd_rd_w=uSdw1&pd_rd_wg=wsEHi&pf_rd_p=7b8e3b03-1439-4489-abd4-4a138cf4eca6&pf_rd_r=5ANXD16GD9C71M4KJCVK&psc=1&refRID=5ANXD16GD9C71M4KJCVK
 

Chocolate fireguard

Active member
Edwardov said:
So while a variable might have a high importance in the prediction process, it does not mean that the variable is important in terms of a causal relationship in predicting the outcome. This is in addition to the "correlation does not mean causation" problem, which also has to be taken into account with machine learning.

I don?t understand that.
It seems to me that you are saying the same thing in both of those sentences i.e. that just because there is a mathematical correlation between 2 things it does not mean that either plays any part in causing the other.
It?s not unusual for me to misunderstand stuff, so would you explain and give examples?

Also, I did wonder if you could miss out dew point from your list of parameters ? with temperature, relative humidity and atmospheric pressure I think you have all the things that fix the dew point so it?s not giving any information, while presumably taking up ?space?.
 

Edwardov

Member
Chocolate fireguard said:
Edwardov said:
So while a variable might have a high importance in the prediction process, it does not mean that the variable is important in terms of a causal relationship in predicting the outcome. This is in addition to the "correlation does not mean causation" problem, which also has to be taken into account with machine learning.

I don?t understand that.
It seems to me that you are saying the same thing in both of those sentences i.e. that just because there is a mathematical correlation between 2 things it does not mean that either plays any part in causing the other.
It?s not unusual for me to misunderstand stuff, so would you explain and give examples?

Also, I did wonder if you could miss out dew point from your list of parameters ? with temperature, relative humidity and atmospheric pressure I think you have all the things that fix the dew point so it?s not giving any information, while presumably taking up ?space?.

So with machine learning there are two things you need to take into account:

Correlation and causation - just because something correlates to a result, does not mean that it causes the result. In fact if we have a variable that correlates with the result, that might be a warning sign that we are cheating, as it is most likely that a combination of variables that individually don't correlate with the result, can explain the result. Confused yet? :p

Variable importance - Just because a variable is important for predicting the feature, does not mean that the variable is important (or contributes the most) to the causation of the result, or even correlates with the result.

So regarding your point about removing features because they are technically represented by combinations of other variables, we still want to keep it. Essentially we want to generate as many combinations of variables as possible, as long as they don't correlate to the result. In the whole machine learning process there is a process devoted to "trimming off the fat" where we remove variables that don't contribute as much information to the prediction. But how would we know if something was important unless we created it and tried it? Therefore we create lots of combinations to see what sticks.
 

2xw

Active member
Any more info on your methods? Which AI did you use, and what did you undertake the analysis in? (c, python, R, what?)

Thanks this is interesting. I presume as the amount of data increases the model becomes better trained
 

Edwardov

Member
Benfool said:
Going by the book the OP mentioned, I'm guessing an LSTM, using Keras and/or Tensorflow within Python.

B

2xw said:
Any more info on your methods? Which AI did you use, and what did you undertake the analysis in? (c, python, R, what?)

Thanks this is interesting. I presume as the amount of data increases the model becomes better trained

This is in python and using a random forest regressor algorithm in scikit learn. Neural networks are only useful when tackling unstructured data (think images, sounds, videos etc.) and people love to use them where they aren't needed just for the hype of using neural networks. This is using structured data, think numeric tables where typical machine learning algorithms generally win over neural networks.

And yes you are correct, once enough data have been collected to produce a viable solution (validated through testing) then the entire dataset is used to train the algorithm which can only help it generalise better. If things are subject to change, then retraining is needed on the previous dataset (if it still applies) and any newer data that needs to be learned. Think about learning a house price from attributes of the house. Maybe in the past, the number of rooms was important. What if suddenly the size of the bathroom is more important? We would need to add this attribute and retrain the AI to learn this new relationship because buying behaviours have changed.
 

Benfool

Member
Ah ha! You're using my sort of ML then! I have over 10 years experience in Data Science and have never trained a neural network - certainly for tabular data there are much better algorithms out there. However I thought that as you have what looks like time series data, an LSTM would be a perfect solution, although I'm no expert.

Have you tried using XGBoost instead of a Random Forest? I personally find that XGBoost, which is a implementation of Gradient Boosting, outperforms every other machine learning algorithm on the market for this type of data. Over the last 4 years or so, its basically the only algorithm I've used in a production environment for the problems I've solved.

Also, I'd be very wary of using all your data to train the model and relying on n-fold cross validation for testing. I personally always hold data back for verification, after using cross-validation for initial testing, that way I know the model I've trained will generalise. I personally like to use my most modern data for this testing, so I can be confident that nothing has changed.

B
 

Edwardov

Member
Benfool said:
Ah ha! You're using my sort of ML then! I have over 10 years experience in Data Science and have never trained a neural network - certainly for tabular data there are much better algorithms out there. However I thought that as you have what looks like time series data, an LSTM would be a perfect solution, although I'm no expert.

Have you tried using XGBoost instead of a Random Forest? I personally find that XGBoost, which is a implementation of Gradient Boosting, outperforms every other machine learning algorithm on the market for this type of data. Over the last 4 years or so, its basically the only algorithm I've used in a production environment for the problems I've solved.

Also, I'd be very wary of using all your data to train the model and relying on n-fold cross validation for testing. I personally always hold data back for verification, after using cross-validation for initial testing, that way I know the model I've trained will generalise. I personally like to use my most modern data for this testing, so I can be confident that nothing has changed.

B

Yes for some problems gradient boosted trees perform better, but the training time can be very costly which is why I avoided it at first, this is a proof of concept as opposed to a final product.

I never said that I used cross-validation for testing, as you are right, this does not test how well an algorithm generalises - this can only be known when tested against the test set. The model was tested for its generalisation error on the test set, then it was trained on the entire learning dataset (train and test) for deployment. You are right that we then can't know how well this final algorithm does (since we don't have a test set any more to see), but in this case we can take the chance that more data will improve the algorithm. You noted about holding out the newest data for testing (perfectly correct, honouring the time-order of the observations), so then why not learn from this newest data (and all of the previous data) to predict for the next newest data to be collected and predicted for? That is why in certain circumstances you learn from the entire dataset to study recent changes after you have an idea how well the algorithm generalises, because newer data might yield patterns that the previous training dataset lacked which are important for predicting for newer data since everything is ordered in time as you rightly noted.
 
Top