Valarm and big, fat JSON and KML (Jason and his Camel?)
The Valarm Tools web application is a typical Java web app running on Tomcat servers. That said, we do quite a lot of RESTful ajax/JSON style data retrieval: in the interface, for public APIs, or for data export (KML, CSV). Although JSON is much lighter-weight and generally more “efficient” than XML for our purposes, it’s still extremely redundant and heavyweight as a format for transporting a ton of data. Of course, we could use a more proprietary “packed” data format but the ease and rapidity (not to mention the easier human-readability and trouble-shooting) offered by JSON would be too much to lose.
GZIP to the rescue!
Luckily, GZIP can compress away all the redundant tags in our JSON data streams, not to mention compress all the rest of the text coming from our servers. Sure, running GZIP on our servers comes at a non-trivial computation cost. But frankly, CPUs are vastly cheaper than bandwidth these days and the user-experience is greatly improved too, especially for larger data sets. In other words, it’s well worth the extra CPU load. Especially with very repetitive streams like JSON or any flavor of XML. (Some simple benchmarks are after the how-to…)
GZIP on Tomcat (or anywhere!)
Pretty much all modern web servers offer a way to easily enable GZIP. But Valarm runs on almost any standards-compliant Java application server, including Tomcat, Jetty, Glassfish, and Resin, just to name a few. Given we are early in our life as a technology company, we’re keen to keep our deployment options open. At the same time, we really don’t want to deal with configuring each app server to handle GZIP (each is configured in it’s own way). And yet, we want our app to ALWAYS have GZIP output, as it provides such a clear benefit. To satisfy this “want” is fortunately very easy in the Java webapp world. We’ll use a GZIP ServletFilter.
A quick Google search turns up a bunch of options. Apparently ehCache offers a nice one, but I couldn’t find any up-to-date documentation. I did find several references to a GZIP Servlet Filter available in Jetty, but they were all quite out of date as well. Thus, this blog entry. I like Jetty quite a lot, and it didn’t take but a few moments to configure the Jetty GZIP ServletFilter in our own webapp.
Installing Jetty GZIP ServletFilter
First, download the Jetty distro: http://docs.codehaus.org/display/JETTY/Downloading+Jetty
I used a Jetty 8.1.x package from the Eclipse Distribution.
Unzip the archive and find the /lib directory. Jetty is hugely modular. So you’ll be adding four jars to your webapp lib:
Then edit your WEB-INF/web.xml and add a block for the GZIP Filter:
You’ll also need to add at least one filter-mapping element. You can add a single element with url-pattern of /* and leave it to the GZIP Filter to decide what to handle. There are some additional configuration options, for example filtering by mime type: see this doc. If you use /* you’ll definitely want to set the mimeTypes option or you might end up needlessly compressing jpgs, etc. Alternatively, you can explicitly add a mapping for each path or file-extension you’d like to have compressed. Since Valarm produces a lot of dynamic JSON, KML, and CSV, I like adding them explicitly. This way, even if the mime content-type isn’t properly set by the programmer who wrote an export module or API/service, the GzipFilter will pick it up. Here’s what our filter-mappings look like:
GzipFilter *.html GzipFilter *.js GzipFilter *.css GzipFilter *.json GzipFilter *.kml GzipFilter *.csv
Seriously: JSON and XML (KML in our case) are grossly repetitive. Here’s what our bandwidth reduction looks like: A KML export without GZIP:
Time taken for tests: 2.792 seconds Complete requests: 20 Failed requests: 0 Write errors: 0 Total transferred: 38946200 bytes HTML transferred: 38940000 bytes Requests per second: 7.16 [#/sec] (mean) Time per request: 139.586 [ms] (mean) Time per request: 139.586 [ms] (mean, across all concurrent requests) Transfer rate: 13623.60 [Kbytes/sec] received
A KML export WITH GZIP:
Time taken for tests: 3.117 seconds Complete requests: 20 Failed requests: 0 Write errors: 0 Total transferred: 1078940 bytes HTML transferred: 1072260 bytes Requests per second: 6.42 [#/sec] (mean) Time per request: 155.867 [ms] (mean) Time per request: 155.867 [ms] (mean, across all concurrent requests) Transfer rate: 338.00 [Kbytes/sec] received
That’s a bandwidth savings of 107226/3894000 or… 97%! This comes at a CPU/Performance cost of 2.792/3.117 or -11% – not trivial, but certainly worth it!
Here are some numbers for JSON. Watch out: the bandwidth savings is awesome, but the CPU usage is bewildering!
JSON with NO GZIP:
Time taken for tests: 1.894 seconds Complete requests: 20 Failed requests: 0 Write errors: 0 Total transferred: 12815560 bytes HTML transferred: 12811420 bytes Requests per second: 10.56 [#/sec] (mean) Time per request: 94.678 [ms] (mean) Time per request: 94.678 [ms] (mean, across all concurrent requests) Transfer rate: 6609.33 [Kbytes/sec] received
JSON WITH GZIP:
Time taken for tests: 7.351 seconds Complete requests: 20 Failed requests: 0 Write errors: 0 Total transferred: 1154000 bytes HTML transferred: 1149380 bytes Requests per second: 2.72 [#/sec] (mean) Time per request: 367.531 [ms] (mean) Time per request: 367.531 [ms] (mean, across all concurrent requests) Transfer rate: 153.31 [Kbytes/sec] received
Well this isn’t so great. Bandwidth is beautifully conserved: 1149380/12811420 (91% savings!) but at a terrible CPU expense: -288%
Ouch! Now what?
One possible reason for the horrible bandwidth/cpu tradeoff for compressing JSON: Both of the KML and JSON tests were performed with relatively “typical” datasets for our application. Our JSON requests are typically much smaller than KML, and JSON is inherently more efficient than KML as a format, so gzip compression as a percentage of the total work is significantly smaller. Nevertheless, I’m doubtful this explains the whole story. The next step would be to profile the GZIP ServletFilter to see if there’s some easy optimization to be had.
User experience trumps CPU costs (as long as we retain profitability!) and bandwidth trumps CPU by far. So for this scenario we’ll be leaving GZIP enabled. Yes, latency is significantly higher, but transfer time across the internet is so much lower that it’s easily palpable in real-life usage.
Test notes: all tests were performed using ab (aka Apache Bench) on my notebook, entirely localhost. It’s a Macbook Pro with a quad-core i7 and 16GB of ram. Tomcat was running inside Eclipse (Juno), with all default configuration, no SSL.