Sunday, August 01, 2010

Data horribilia: the HARRY_READ_ME.txt file

(Resurrected from the DK archives, this post was originally posted on 23/11/2009 at the height of CRUdgate—and gained over 24,000 unique visitors in one day. It was neatly followed up by The Pedant-General's remorseless deconstruction of AGW logic.)

With the CRU emails having been examined, it seems that some people—mainly techies—are really starting to dig into the data files. These files are, as far as we can tell, temperature data, modelling results and other such useful files, i.e. these are the files produced and worked on by the CRU teams, as well as considerable amounts of information on—and code from—the actual computer modelling programmes themselves.

In other words, these are the guts of CRU's actual computer models—the data, the code and the applications.

And they are, by all accounts, a total bloody mess.

++++ START INSERT ++++

So, come with me on a wonderful journey as the CRU team realise that not only have they lost great chunks of data but also that their application suites and algorithms are total crap; join your humble Devil and Asimov as we dive into the HARRY_READ_ME.txt (thanks to The Englishman) file and follow the trials and tribulations of Ian "Harry" Harris as he tries to recreate the published data because he has nothing else to go on!

Thrill as he "glosses over" anomalies; let your heart sing as he gets some results to within 0.5 degrees; rejoice as Harry points out that everything is undocumented and that, generally speaking, he hasn't got the first clue as to what's going on with the data!

Chuckle as one of CRU's own admits that much of the centre's data and applications are undocumented, bug-ridden, riddled with holes, missing, uncatalogued and, in short, utterly worthless.

And wonder as you realise that this was v2.10 and that, after this utter fiasco, CRU used the synthetic data and wonky algorithms to produce v3.0!

You'll laugh! You'll cry! You won't wonder why CRU never wanted to release the data! You will wonder why we are even contemplating restructuring the world economy and wasting trillions of dollars on the say-so of data this bad.

++++ END INSERT ++++

Via ever-prolific Tom Nelson, Soylent Green has picked up some geek reports on this material.
Got this from reader, Glenn. I’m out of my depth trying to read the code—and apparently so were several folks at CRU. If what he, and the techies at the links, say is true, it’s no wonder they had to spin this for 10 years—it’s all absolute bullshit.

Here’s Glenn’s take with links:
The hacked e-mails were damning, but the problems they had handling their own data at CRU are a dagger to the heart of the global warming “theory.” There is a large file of comments by a programmer at CRU called HARRY_READ_ME documenting that their data processing and modeling functions were completely out of control.

They fudged so much that NOTHING that came out of CRU can have ANY believability. If the word can be gotten out on this and understood it is the end of the global warming myth. This much bigger than the e-mails. For techie takes on this see:

Link 1

Link 2

To base a re-making of the global economy (i.e. cap-and-trade) on disastrously and hopelessly messed up data like this would be insanity.

Now, this stuff really is beyond me, but I have looked at the links given about and, from what little I can decipher, there do seem to be some issues.

The main issues being that the techies at CRU don't seem to have been able to tell what the hell was going on with the code, let alone anything else.

I shall quote user Asimov—from the top of the page of Link 1 [above]—to give you a flavour of the confusion that seems to have been rife at CRU.
There's a very disturbing "HARRY_READ_ME.txt" file in documents that APPEARS to be somebody trying to fit existing results to data and much of it is about the code that's here. I think there's something very very wrong here...

This file is 15,000 lines of comments, much of it copy/pastes of code or output by somebody (who's harry?) trying to make sense of it all....

Here's two particularly interesting bits, one from early in the file and one from way down:
7. Removed 4-line header from a couple of .glo files and loaded them into Matlab. Reshaped to 360r x 720c and plotted; looks OK for global temp (anomalies) data. Deduce that .glo files, after the header, contain data taken row-by-row starting with the Northernmost, and presented as '8E12.4'. The grid is from -180 to +180 rather than 0 to 360.

This should allow us to deduce the meaning of the co-ordinate pairs used to describe each cell in a .grim file (we know the first number is the lon or column, the second the lat or row - but which way up are the latitudes? And where do the longitudes break?

There is another problem: the values are anomalies, wheras the 'public' .grim files are actual values. So Tim's explanations (in _READ_ME.txt) are incorrect...

8. Had a hunt and found an identically-named temperature database file which did include normals lines at the start of every station. How handy - naming two different files with exactly the same name and relying on their location to differentiate! Aaarrgghh!! Re-ran anomdtb:

Uhm... So they don't even KNOW WHAT THE ****ING DATA MEANS?!?!?!?!

What dumbass names **** that way?!

Talk about cluster****. This whole file is a HUGE ASS example of it. If they deal with data this way, there's no ****ing wonder they've lost **** along they way. This is just unbelievable.

And it's not just one instance of not knowing what the hell is going on either:
The deduction so far is that the DTR-derived CLD is waaay off. The DTR looks OK, well OK in the sense that it doesn;t have prominent bands! So it's either the factors and offsets from the regression, or the way they've been applied in dtr2cld.

Well, dtr2cld is not the world's most complicated program. Wheras cloudreg is, and I immediately found a mistake! Scanning forward to 1951 was done with a loop that, for completely unfathomable reasons, didn't include months! So we read 50 grids instead of 600!!! That may have had something to do with it. I also noticed, as I was correcting THAT, that I reopened the DTR and CLD data files when I should have been opening the bloody station files!! I can only assume that I was being interrupted continually when I was writing this thing. Running with those bits fixed improved matters somewhat, though now there's a problem in that one 5-degree band (10S to 5S) has no stations! This will be due to low station counts in that region, plus removal of duplicate values.

I've only actually read about 1000 lines of this, but started skipping through it to see if it was all like that when I found that second quote above somewhere way down in the file....

CLUSTER.... ****. This isn't science, it's gradeschool for people with big data sets.

Now, I'm no climate modeller or even a professional coder, but it does seem to me that there is just a teeny weeny bit of confusion evidenced in the HARRY_READ_ME.txt file. I mean "teeny weeny" in the sense that whoever wrote this file obviously hadn't got a fucking clue what was going on—and not for want of trying.

But there's more—here's another taster, a few posts down from the one above, from Asimov's analysis of the HARRY_READ_ME.txt (I'm trying to give you a hint about what's going on: the paydirt's coming soon!).
Christ. It gets better.
So.. we don't have the coefficients files (just .eps plots of something). But what are all those monthly files? DON'T KNOW, UNDOCUMENTED. Wherever I look, there are data files, no info about what they are other than their names. And that's useless.. take the above example, the filenames in the _mon and _ann directories are identical, but the contents are not. And the only difference is that one directory is apparently 'monthly' and the other 'annual' - yet both contain monthly files.

Lets ignore the smoking gun in a legal sense, and think about the scientific method for just a moment....

I do believe this is more than one gun and there's some opaque mist coming from the "fun" end. I won't claim it's smoke, but holy ****, this is incredible.

I think that we are all starting to get an impression of what is going on here, right? Piles and piles of undocumented and inconsistent datasets and the techies in CRU utterly baffled by al of it.

But what are they actually trying to do—what is this HARRY_READ_ME.txt all about...? Yep, it's over to Asimov, a few posts down again (what can I say: I like the man's style!)...
I'm just absolutely STUNNED by this ****. **** the legal stuff. RIGHT HERE is the fraud.
These are very promising. The vast majority in both cases are within 0.5 degrees of the published data. However, there are still plenty of values more than a degree out.

He's trying to fit the results of his programs and data to PREVIOUS results.

Yup, somewhere along the way, some stuff has got lost or corrupted. Badly.

This programmer—Ian "Harry" Harris—is attempting to recreate... What? The data? The applications and algorithms that ran the original data? It seems to be the latter, because Harry carries on.
TMP has a comforting 95%+ within half a degree, though one still wonders why it isn't 100% spot on..

DTR fares perhaps even better, over half are spot-on, though about 7.5% are outside a half.

The percentages below is the percentage of accuracy
However, it's not such good news for precip (PRE):
Percentages: 13.93 25.65 11.23 49.20

21. A little experimentation goes a short way..

I tried using the 'stn' option of anomdtb.for. Not completely sure what it's supposed to do, but no matter as it didn't work:

Oh yea, don't forget. He's getting 0.5 and 1 degree differences in results... while they are predicting temperatures to a supposed accuracy of tenths...

Unless I find something MUCH WORSE than what I've already posted, I'll leave the file for your to read and stop spamming the thread with this.

Needless to say, worse is to come...
Ok, one last bit to finish that last one off:
..knowing how long it takes to debug this suite - the experiment endeth here. The option (like all the anomdtb options) is totally undocumented so we'll never know what we lost.

22. Right, time to stop pussyfooting around the niceties of Tim's labyrinthine software
suites - let's have a go at producing CRU TS 3.0! since failing to do that will be the definitive failure of the entire project..

I eagerly await more reading to find the results of that.

Oh, same here, Asimov: same here. Shall we see some more? Why not...
You'd think that where data was coming from would be important to them... You know, the whole accuracy thing..
The IDL gridding program calculates whether or not a station contributes to a cell, using.. graphics. Yes, it plots the station sphere of influence then checks for the colour white in the output. So there is no guarantee that the station number files, which are produced *independently* by anomdtb, will reflect what actually happened!!

Well I've just spent 24 hours trying to get Great Circle Distance calculations working in Fortran, with precisely no success. I've tried the simple method (as used in Tim O's, and the more complex and accurate method found elsewhere (wiki and other places). Neither give me results that are anything near reality. FFS.

Worked out an algorithm from scratch. It seems to give better answers than the others, so we'll go with that.

The problem is, really, the huge numbers of cells potentially involved in one station, particularly at high latitudes.

out of malicious interest, I dumped the first station's coverage to a text file and counted up how many cells it 'influenced'. The station was at 10.6E, 61.0N.

The total number of cells covered was a staggering 476!

Keep in mind how climate models work. They split the world up into cells and treat each cell as a single object... (Complexity thing, only way to get any results at all in reasonable times, even with supercomputers.)

Seriously, this really isn't good.
Bit more to add to the last, then off to bed, so I'll stop spamming. :P
Back to the gridding. I am seriously worried that our flagship gridded data product is produced by Delaunay triangulation - apparently linear as well.

As far as I can see, this renders the station counts totally meaningless.

It also means that we cannot say exactly how the gridded data is arrived at from a statistical perspective - since we're using an off-the-shelf product that isn't documented sufficiently to say that.

Why this wasn't coded up in Fortran I don't know - time pressures perhaps? Was too much effort expended on homogenisation, that there wasn't enough time to write a gridding procedure? Of course, it's too late for me to fix it too. Meh.0

"too late for me to fix it"

I guess it doesn't matter that we're talking about data that's basically determining the way the WHOLE ****ING HUMAN RACE IS GOING TO LIVE for the next few CENTURIES?



Like Asimov, I too much retire to bed—it's much too late, or early. But I shall continue trawling both these threads and the data—I might try to find and post the entire HARRY_READ_ME.txt file for starters—but I would just like to add one quick comment...

I have tried to keep my language moderate throughout all of these CRU articles—the subject matter is way too important for these posts to be written off as being "too sweary"—but there really is only one response to all of this.

Fucking. Hellski.

UPDATE: there's more from the HARRY_READ_ME.txt file—posted by Asimov again.
The problem is that the synthetics are incorporated at 2.5-degrees, NO IDEA why, so saying they affect particular 0.5-degree cells is harder than it should be. So we'll just gloss over that entirely ;0)

ARGH. Just went back to check on synthetic production. Apparently - I have no memory of this at all - we're not doing observed rain days! It's all synthetic from 1990 onwards. So I'm going to need conditionals in the update program to handle that. And separate gridding before 1989. And what TF happens to station counts?

OH **** THIS. It's Sunday evening, I've worked all weekend, and just when I thought it was done I'm hitting yet another problem that's based on the hopeless state of our databases. There is no uniform data integrity, it's just a catalogue of issues that continues to grow as they're found.

Let me just repeat that final line:
There is no uniform data integrity, it's just a catalogue of issues that continues to grow as they're found.

And I will just sign off with another comment from Asimov...
My god people, have you just been skipping over everything I've posted from that HARRY_READ_ME.txt file!?!?

The data itself is a HUGE unknown, even to the researchers themselves as they attempt to decode what's gone before.

Sure, the emails indicate the possibility (and certainty in some cases) of fraud. That one file PROVES HOW UNRELIABLE THE DATA ITSELF IS!!

They "lost" the original data?? I believe it now. v2.10 was run with a ****ton of code that was undocumented, made no sense and was FULL of bugs. Is v3.0 better when half the data from 1980 on is SYNTHETIC?!? Or when it used the output from the buggy 2.10 version (which is all they had) to produce NEW data?!?!

This is a ****ing joke. The emails are FAR from the most damning thing in this. I can't wait for somebody familiar with the code to start going over it and seeing how many "So we'll just gloss over that entirely ;0)" instances exist.

What the hell has been going on over at CRU...? No wonder they didn't want to release their data...

I shall try to find some time to make a more succinct posting at some point over the next few days but, believe me, the main upshot is that none of the CRU data is worth a single shiny penny.

No comments: