Tuesday, December 06, 2011

A back channel confirms that I'm right... sort of

Through a little circuitous back channel I received an unofficial follow up to a blog post about the unused machine code in the GCHQ Code Challenge part 2. The follow up assured me that the unused code was left over after some clean up was done, and that the rest of the data in the file was random filler (as I'd already heard).

But as I'd already determined that it was not actually random (at least at one level) I sent a carrier pigeon out with a question about my follow up blog post.

Listening for secret messages transmitted during The Archers I received a reply indicating that I had indeed 'broken' the encryption on that stream of ASCII data (by guessing the algorithm and reverse engineering the key stream), but that the underlying data was actually random ASCII generated and encrypted using the following Python code:
b = ''.join([chr(randint(0x20, 0x7f)) for i in range(0, 16 * 7)])
c = codec(b, 0x1f, s=5)
The first part generates 112 bytes of random ASCII text and the codec function is apparently the encryption function with the key I had identified (see the 0x1f and 5).

There are a couple of oddities about this code. The randint could generate an 0x7F which is non-printable (and doesn't appear in the decrypted text) and it generates precisely 112 bytes of data (whereas the part 2 actually contains two blocks of 112 and one of 102).

I questioned that by leaving a note in a tree in St. James's Park and received a response by exchanging briefcases at King's Cross Station to the effect that the 102 byte block had simply been truncated to make the code look interesting.

Of course, that could all be completely non-suspicious, but having been told that first it was random filler and then that actually it was blocks of encrypted random printable ASCII filler it wouldn't surprise me if the truth was even more complex (and perhaps rather mundane).

But why go to all this trouble on 'random filler' text? And why update the web site to say "The Challenge Continues"?

I am left slightly baffled, which I'd imagine is just how people working in secret places would like it to be.


ste9an05 said...

Some of us are considering possible steganography in images/codebreaker.jpg as it has symmetry, which is a little unexpected

Steve said...

I thought I'd drop you a comment on your interesting blog entries.

First up, I was bothered too about the unused data in part 2. I've independently verified that the "h75 h10 h01" sequence at h0132 makes sense as it allows decryption to continue beyond the end of the segment, should the message have been longer than the remaining 4 segments (64 bytes). I was a little more perplexed by "hCC" at h0140 - but, perhaps, that invalid byte is there to mark a segment that should never be executed... but, if so, the zero-bytes that follow it make no sense. Similarly, it makes little sense that the data it decodes should start at 01c0 - as the unencrypted byte-code in the first two segments decode only segments h10-h14.

I concur with your view that the remaining sections (h0150-h01bf and h0200-h2ff) contain non-random data. In addition to the cyclic top-bit pattern you've documented, I note:

* The 'premature' end to non-zero data at h02d6 is reminiscent of the end-of-message marker that is preserved in memory from h01f3-h01ff.
* I'm suspicious about the fact that there are 26 trailing zeros - for two reasons... firstly because it matches the equal number of top-bit-clear;top-bit-set bits in your analysis of the three sections... and also because (simply) 26 is the number of letters in our alphabet - possibly hinting some alphabetic code.
* The vast majority of the data has no two successive bytes equal. But, in the last 11 non-zero bytes there are three adjacent equal pairs... h9e at h02cb-h02cc and h2f at h02d2-h02d3 and h4e at h02d4-h02d5. It is also unexpected that the non-zero data terminates with two repeated pairs. This observation makes it very hard for me to conclude that a random source was encoded and truncated arbitrarily as 'filler'.

An avenue I explored today was to note, as you did, that the byte code decodes by exclusive or with a sequence which could be generated from init+step*i, where i indicates the index of the byte and init and step are parameter bytes... the first decode can be parametrised init=hAA step=1 and the second decode init=0 and step=3. I brute-force searched this space looking for strings with significant sequences of printable ASCII... but found nothing interesting. I concluded that, if these sections do contain messages that can be decoded, I don't think it uses the same encryption scheme.

Like yourself, I'm a bit baffled. I don't think the data is random; it seems odd to include these sections where all the rest of the information in stages one and two are used by the end of stage three. I can't believe that the data is there to make it harder to identify the position of the message - as the location of the message can be immediately obtained by identifying the first string of zero bytes.

Etienne de L'Amour said...

Thanks John. Bit disappointed at the challenge, in a way -- but at least, like you, I can now get some sleep and return to the day job. :).

Was hoping it might offer a number of different routes, rather than a "one track" approach, with differing resultant keywords, to make it more inter-disciplinary and to sort out the high fliers from the 99.9% of candidates who -- like me -- were "also rans".

Got totally the wrong skill set for them. vb.net helped a bit; php, js/ajax, mysql, firebird, apache .... Not much use here.

Was also hoping that the exe itself (which can be overwritten, of course, to get it to work, or fetched from localhost) would manipulate the "supposed" keyword sent in the clear ... Nada.

Steve said...

I previously said: "I'm suspicious about the fact that there are 26 trailing zeros" - but, evidently, I can't count - there are 42. This completely undermines my 'alphabetic cypher' hypothesis - but the other points, I think, were valid.

Etienne de L'Amour said...

Scrub that last comment. Got the stage 2 VM working in PHP on my server, and ste9an05 found a great link to a graphical implementation in JS. Don't know why that idea totally passed me by? :)

Etienne de L'Amour said...
This comment has been removed by the author.
Junk said...

Hello guys,
I'm amazed with your technical skills. It's really beautiful example of hacker thinking. See what hit my eyes...

What can these quests tell us about the autor?
I watched videos from Dr Gareth Owen and knowledge to crack this is even far beyond thinking of technical person.
- Use of Assembler-low level to Java-high level programing. (So wide programing language skills are very rare)
- Whole contest is in English. (No Chineese, no Russian. Seeking English person)
- Use of VM, which is quite high tech
- No brute force needed to solve quests
- No cloud quest (Interesting since there is trend in using it)
- Presence of Facebook, Twitter, Google+. (The contest is set to be spread)
- All quests are connected through web (Which is major fault)

As some of you may have seen the Mercury Rising movie, this quest may not be question to get a job, but How many people can crack it?

Different approach to get to the end

As I have previously described there is major fault that all quests are linked via canyoucrackit.co.uk web. Means that compromising this server will get you to the end, even without solving the quests.
- By entering dummy code on http://canyoucrackit.co.uk you will get /index.asp hint. This get's you information that server is running on Microsoft ASP scripting. And we all know that it's hackable.
- The next step might be to run Eeye Retina scan on canyoucrackit.co.uk, find vulnerabilities and get to the web.

Next flaw is that results on the web are static and all solutions leads to one link.

By putting http://canyoucrackit.co.uk into W3C validator http://validator.w3.org/ you will get non valid code... To my surprise!

And even more by doing Google search for "site:canyoucrackit.co.uk" you can find links to all quests
http://canyoucrackit.co.uk/soyoudidit.asp (1st page)
http://www.canyoucrackit.co.uk/15b436de1f9107f3778aad525e5d0b20.js (1st page)
http://canyoucrackit.co.uk/hqDTK7b8K2rvw/a3bfc2af/d2ab1f05/da13f110/key.txt (3rd page)
- No matter how stupid this is... The most easier solution is often the best.

Interesting is also that there is robots.txt, but web is indexed by Google. Possibly they added it later.

This leads me to think that there are 2 teams working on this task.
a) Quest team - very high knowledge assembler guy, java guy, VM guy, Wireshark(LAN) guy
b) Web team - which is very sloppy

And finally you don't even need to complete the quest or do the Google search to get to the end.
- On the GCHQ site http://www.gchq.gov.uk click on "Careers" then "Click here to visit our recruitment portal"
- Then "Jobs" - "Cyber Security Specialists" and you are there!

The end is that this all is just PR... so sad :-(

NivagSwerdna said...

Having solved this myself over the weekend I noted with interest your observation that there might be further data. Along the same lines as Steve I note that the encryption is a simple XOR with a linear increase to the key; by scanning the bytes in the VM memory and looking for suitable candidate combinations of 3 letters I find only the previously discovered Part 2 plaintext. The absence of a 42 42 42 42 signature implies that it doesn't warrant a return to the deadbeef either.
I think I conclude that there is no further message. I hope I'm proven wrong in a few days.