Lossy Text Compression - BitSqueeze algorithm - Silly! ^^

Discuss everything else here

Moderator: ArcWolf

Post Reply
User avatar
Frooxius
Posts: 311
Joined: Tue Jul 28, 2009 6:00 pm
Contact:

Lossy Text Compression - BitSqueeze algorithm - Silly! ^^

Post by Frooxius »

Prequel
OIg*OAABqIYPZn!ZdsQHu%KOdKIls!rcyGAgT/aMAPoqwoRgn#ZdPPXoQos!aDzrlfS$qAiQ+B
Hello there, recently I've been working day/night on some serious projects, so I decided to make something funny/silly instead and inspiration was something I said in one IRC before: that I can write seven letter word with six characters.

Well this thought lead to something that I call Lossy Text compression. I created algorithm called BitSqueeze, because it literally squeezes an extra bit from every character written. It's capable of shortening a bit even short sentences or even words. It also works only with printable characters, making it suitable for easy copying, pasting and spreading of the compressed text.

NOTE: This is just silly, non-serious project made in a few hours, don't take it too seriously, though it can still have quite educational value as proof of concept and so far, people say it's fun to play with :D

The principle

The idea behind this is that when I can loose some information from the original text - mostly capitalization and special characters, I will get a reduced input set, however I can express it using the larger output set - more characters, therefore compressing it a little by loosing some information and then using information that's normally lost, to encode it.

Algorithm translates input string to just 32 characters - 5 bits, that are imputed into a bitstream. Once there's at lest 6 bits in the bitstream, then one of 64 characters from the output set is selected. Decompression works in an opposite way: it converts the input characters into a bit stream, each character adding 6 bits to the stream and then new characters are taken from this stream by using only 5 bits. Thus this algorithm saves 1 bit per character, while loosing capitalization and special characters.

However, there's a problem: I didn't squeeze whole alphabet and numbers into 32 characters, so 32th character is actually special extensive character, that's followed by 3 additional bits, which encode another 8 characters.

I have ideas for even better algorithms, but this was done for fun and as proof of concept, so don't take it too seriously :3

Download Executable + GUI

I made a GUI version of this, so you can play with it easily and have some fun while compressing your text :D You'll need .NET framework. The GUI is written in C++ too.

DOWNLOAD

Image

C++ sourcecode

I've written the algorithm in C++ language, the source (including command-line test of the algorithm) is freely available, feel free to study it and play with it if you want. It's not really very well polished, nor object oriented, because it was done without planning and quickly, just for fun x3

Code: Select all

/*	********************************************
	Lossy Text Compression - BitSqueeze Method
	Written by Tomas "Frooxius" Mariancik September 4 2011
	This is done simply for fun as proof of concept that
	I can write seven letter word with six characters x3
	Don't take this too seriously :3
	Website: frooxius.solirax.org
	E-mail: [email protected]
  ********************************************* */

#include <iostream>
using namespace std;

// For x86 abd some others
typedef unsigned short int UINT16;
#define CHECK_PLATFORM

/* ****** Method BitSqueeze ****** */
struct BitBuffer;
void LossyTextCompressBS(char *input, char *outbuf, unsigned int outbuf_size);
bool BitBufferWrite(UINT16 value, unsigned int bits, BitBuffer &bitbuf, char *&outbuf);
void LossyTextDecompressBS(char *input, char *outbuf, unsigned int outbuf_size);
UINT16 BitBufferRead(unsigned int bits, BitBuffer &bitbuf, char *&input);
UINT16 ReverseBits(UINT16 value, unsigned int bits);
const char *BITSQUEEZETABLE = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#$%&()*+/<>";
template<class X> inline X Min(X a, X b) { return (a<b)?a:b; }

int main()
{
	char *reference = "This is a testing message 1337 hax0r! LOLOL. OMG >.< Are you sure this will work? Some information will be lost.";
	//char *reference = "abcdefghi";
	char cBitSqueeze[255], dBitSqueeze[255];
	LossyTextCompressBS(reference, cBitSqueeze, 255);
	LossyTextDecompressBS(cBitSqueeze, dBitSqueeze, 255);
	cout << reference << endl << "Length: " << strlen(reference) << "\n\n";
	cout << cBitSqueeze << endl << "Length: " << strlen(cBitSqueeze) << "\n\n";
	cout << dBitSqueeze << endl << "Length: " << strlen(dBitSqueeze) << "\n\n";
	cin.get();
}

struct BitBuffer
{
	UINT16 bitbuf;
	unsigned short bitnum;
	BitBuffer() { bitbuf = bitnum = 0; }
};

void LossyTextCompressBS(char *input, char *outbuf, unsigned int outbuf_size)
{
#ifdef CHECK_PLATFORM
	// Can't check this with the preprocessor -_-, throws a simple expression, feel free to alter the error reporting the way you want
	if(sizeof(UINT16) < 2)	// well not strictly UINT16, but meh >.>
		throw "UINT16 HAS WRONG SIZE";
#endif
	BitBuffer bitbuf;
	bool wasspace, skipspace = false;

	while(*input && outbuf_size)
	{
		wasspace = false;
		// encode only certain characters using 5 bits and insert them into the bitbuffer
		if(*input >= 'a' && *input <= 'z')
			outbuf_size -= BitBufferWrite(*input - 'a', 5, bitbuf, outbuf);
		else if (*input >= 'A' && *input <= 'Z')
			outbuf_size -= BitBufferWrite(*input - 'A', 5, bitbuf, outbuf);
		else if (*input >= '0' && *input <= '3')
			outbuf_size -= BitBufferWrite(*input - '0' + 'z' - 'a' + 2, 5, bitbuf, outbuf);
		else if (*input >= '4' && *input <= '9')
			outbuf_size -= BitBufferWrite(31, 5, bitbuf, outbuf) | BitBufferWrite(*input - '4', 3, bitbuf, outbuf);
		else if(*input == '.' || *input == '!' || *input == '?')
			outbuf_size -= BitBufferWrite(31, 5, bitbuf, outbuf) | BitBufferWrite('9'-'4'+1, 3, bitbuf, outbuf);
		else if(*input == '\n' || *input == '\r')
			outbuf_size -= BitBufferWrite(31, 5, bitbuf, outbuf) | BitBufferWrite('9'-'4'+2, 3, bitbuf, outbuf);
		else
		{
			wasspace = true; // indicate that 
			if(!skipspace)
				outbuf_size -= BitBufferWrite('z'+1 - 'a', 5, bitbuf, outbuf); // anythig else is converted to space
		}
		skipspace = wasspace;
		input++;
	}
	// finalize it - insert null character at the end
	if(outbuf_size)
	{
		BitBufferWrite('z'+1 - 'a', 6-bitbuf.bitnum, bitbuf, outbuf);
		*outbuf = 0;
	}
	else
		*(outbuf-1) = 0;
}

// bit buffer takes bits and once there's enough of them to print a character from the squeeze table it stores it in the output
bool BitBufferWrite(UINT16 value, unsigned int bits, BitBuffer &bitbuf, char *&outbuf)
{
	// the value sanitization isn't strictly needed for this example, but in case you wanted to use this elsewhere, it's there ;)
	bitbuf.bitbuf |= (value & ~(0xFFFFU << bits)) << bitbuf.bitnum; // add the new value to the buffer
	// if there's enough to generate a character
	if((bitbuf.bitnum += bits) >= 6)
	{
		*(outbuf++) = *(BITSQUEEZETABLE + ( bitbuf.bitbuf & 0x3FU)) ;
		bitbuf.bitbuf >>= 6;
		bitbuf.bitnum -= 6;
		return true;
	}
	return false;
}

void LossyTextDecompressBS(char *input, char *outbuf, unsigned int outbuf_size)
{
#ifdef CHECK_PLATFORM
	// Can't check this with the preprocessor -_-, throws a simple expression, feel free to alter the error reporting the way you want
	if(sizeof(UINT16) < 2)	// well not strictly UINT16, but meh >.>
		throw "UINT16 HAS WRONG SIZE";
#endif

	BitBuffer bitbuf;
	UINT16 temp;
	while((temp = BitBufferRead(5, bitbuf, input)) != 0xFFFFU && outbuf_size--)
	{
		if(temp <= ('z'-'a'))
			*(outbuf++) = 'a'+temp;
		else if(temp == ('z'-'a'+1))
			*(outbuf++) = ' ';
		else if(temp == 31U)
			*(outbuf++) = "456789.\n"[Min((int)BitBufferRead(3, bitbuf, input), 7)];
		else *(outbuf++) = '0'-('z'-'a'+2)+temp;
	}
	// finish it
	if(outbuf_size)
		*outbuf = 0;
	else
		*(outbuf-1) = 0;
}

UINT16 BitBufferRead(unsigned int bits, BitBuffer &bitbuf, char *&input)
{
	if(bitbuf.bitnum < bits)
	{
		if(*input == 0)
		return 0xFFFFU;

		UINT16 newvalue;
		for(newvalue = 0; (*input != *(BITSQUEEZETABLE+newvalue)) && newvalue < 64; ++newvalue); // not most efficient, can be replaced with generated conversion table
		bitbuf.bitbuf |= newvalue << bitbuf.bitnum;
		bitbuf.bitnum += 6;
		input++;
	}
	UINT16 ret = bitbuf.bitbuf & ~(0xFFFFU << bits);
	bitbuf.bitnum -= bits;
	bitbuf.bitbuf >>= bits;
	return ret;
}

UINT16 ReverseBits(UINT16 value, unsigned int bits)
{
	UINT16 reversed = 0;
	for(unsigned int i = 0; i < bits; ++i)
		reversed |= ((bool)(value & (1<<i))) << (bits-i-1);
	return reversed;
}
Conclusion

So I hope you like this at least a little bit and find it interesting. I'm not sure how useful it might be, but at least we can learn something new about compression I guess :D For an average text, the compression ratio is around 85 %, this also means the longer the text is, the more space will be saved. It's more suitable for written text, don't compress something with many special characters like source codes for example, because these will be lost. For example, here is this paragraph in compressed form:

Code: Select all

sNgPpUhrJDu/cPi)(bj#GPEraYPgOxO(Cr!b!udAdDb#gA!uJAtsKiNOM(BnsZ!n%ulPrI<&SAQKuOlnAQziZzQdeDGo#GPEraYP$s!cqtViGytlDmY+GAgDJy#GGJo#ch*fjsjY#!iTHsKsDazmrutxCrnOPbvseyi)j(o#ZdrlCSxKiLixthbtKtJKAq!rBd/N>ONhKudwsNMranPEEi)LRzir/+q!tYDQrs/+q!mxKOLpGGOTO#shiAjOsgFRrtTuZCrIwsrtqyr!fxuBJO(K!!tYDQhUMEQfo$lsKsT!XiZdQz!w!+OzGgwlFebcS!IdicetsuQxifrlD!OGOfUbrQlUOMCbS%IOdebGkj)(HijAlAT!bIE(KZV#fxutUa$Br!hsKOrs/+GK)hicmrW*OrnTWXErIusg)IlY!A

The original text of the paragraph has 508 characters, the compressed version has 426 characters, thus it's 83.85 % of the original text :3


I hope you'll find this interesting,
yours LOZDruP ;-)
Image
hug_overflow.asm wrote: LOOP: HUG #Peanut, #Grape
PUSH ACC
JMP LOOP
User avatar
IceKitsune
Posts: 5111
Joined: Mon Apr 26, 2010 1:35 pm
Location: Ohio

Re: Lossy Text Compression - BitSqueeze algorithm - Silly! ^

Post by IceKitsune »

Well this is very interesting Frooxius, I think I'll keep this around on my computer for fun.

Edit: I also like how the second window will bounce around the screen first when you open it :P that's cute
User avatar
Sleet
Bringing Foxy Back
Posts: 17291
Joined: Thu Apr 29, 2010 1:32 am
Location: Nephelokokkygia
Contact:

Re: Lossy Text Compression - BitSqueeze algorithm - Silly! ^

Post by Sleet »

Ooh, this is pretty cool. And I actually understand it, minus the C++ details.
Image
Questions? Comments? Concerns? Friendly banter? Feel free to click the "PM" button below!
Cm4F
Posts: 514
Joined: Thu Jul 21, 2011 12:11 pm
Location: Rock Hill, South Carolina
Contact:

Re: Lossy Text Compression - BitSqueeze algorithm - Silly! ^

Post by Cm4F »

Ok, I'm good at computers, but what is all that??????!!!!!!?!??!?????!!??!!!!?!??!!!1
ImageImage

I fought dyslexia, and I won!
User avatar
yehoshua
Posts: 1984
Joined: Wed May 25, 2011 7:32 pm
Location: Canananada

Re: Lossy Text Compression - BitSqueeze algorithm - Silly! ^

Post by yehoshua »

Hmm, not TOO different from Javascript, I was able to sort of understand the code you have there, pretty cool! Meanwhile, I'm making some cool pages like this homepage I made for myself. What do you think?
Sent from my conifer.
Post Reply