December 9, 2017

How to reduce JSON file size

When I started writing this article, it had a quite different title. To be precise: “Uploading large JSON files to Zoho Reports.” But, reducing JSON size might be useful in so many areas, I decided not to limit it to Zoho and changed the title.

Shortly, I’ll explain how I managed to reduce and split JSON files of several gigabytes to the desired size - size limited by the provided API, with help of few tools.

Introduction

Zoho Reports is a business intelligence tool by Zoho, company behind many, many different saas products. It lets you import, analyze and then visualize the data by creating many kinds of reports. It works well, has plenty of features, and is a much cheaper product compared to its competitors. On the other hand, the design is very outdated and the fact that it’s an Indian product might turn some people away.

In short, Zoho Report has exposed APIs which we’re using to import data from our database (datastore) to it. The problem is that it’s limited by file size of 100MB and data size of 200MB. Exporting all of our data turned out to be quite troublesome, but after some processing, I managed to upload files weighting several gigabytes to it.

I’ve heard of Zoho plenty times before joining my new company but never used it extensively. In our company, we use Zoho Reports for reporting (reports provided to customers in our dashboard). We have lots of meaningful data, which when represented correctly and in an easy way provides a great value to our customers.

Earlier, we did most of the calculations on the backend and then exported calculated data into Zoho in a single table. Then in Zoho’s GUI we create reports and embed as iframe in our web application.

The positive side of this approach was that reports were generated faster as there are no queries/calculations done on Zoho’s side. Also as software developers, we prefer doing various calculations in our language of choice (in this case Golang) instead of being reliant on 3rd party software.

The major drawback was that we missed lots of data in Zoho, therefore being unable to create new reports unless we changed the code. The product owner didn’t like this, as for every new report we generally had to do some modifications to the code, which turned out to be troublesome.

So instead, we decided to export all our data as-is in our database (datastore), and then do the necessary calculation up in Zoho - as Zoho provides SQL-like syntax. As all our relevant data was available up-there, I was able to create any report we wanted with ease.

After creating few query tables (by joining several tables with SQL syntax), even our marketing people were able to create new reports.

Besides reports being generated a little slower (to my surprise the difference wasn’t too big), now we had too much data to export. Which led us to another problem. The API limits.

Exporting large data

When we moved to new reporting methods, I had to export all our data - again. Normally there shouldn’t be any problems, the query and exporting may take a little bit longer. The problem was - Zoho has two limits on their API usage:

  1. The data size being uploaded (in this case JSON) has a limit of 200MB.
  2. The size of the file being upload (in our case .zip) has a limit of 100MB.

The latter wasn’t a big problem, as if a JSON file weighs less than 200MB, zipped content usually weighs up to 10MB. Our JSON files weighted several GBs, and therefore I had a problem to solve. It took me few steps, but I learned a lot from it.

(note: Zoho does provide a GUI/CLI tool for importing .csv files larger than 200MB)

I did the following things to reduce our JSON files from 4GB to <200MB:

1. Remove unnecessary data

Even though the number of columns isn’t calculated in Zoho’s limits, having too many columns, especially unnecessary ones, makes files much bigger.

If you have data for which you are 100% sure that won’t ever be used in any report, don’t export it. Especially if it’s an array.

With Golang I had a problem because default values in Go are not null/nil. Instead, they depend on the type of data you are using:

  • Empty string (“”) for string

  • 0 for int (and other number types)

  • false for boolean (bool)

  • 0001-01-01 00:00:00 +0000 UTC for dates

package main

import (
    "fmt"
    "time"
)

func main() {
    var s string
    var i int
    var b bool
    var t time.Time

    fmt.Printf("Default string: %s Default int: %d Default bool: %v Default time: %v", s, i, b, t)
}
Result:

Default string:  Default int: 0 Default bool: false Default time: 0001-01-01 00:00:00 +0000 UTC%

In Go, this can be prevented by making these variables as pointers, not values (e.g. var s *string). But our structs were already defined like that, especially many different date variables. Changing them and querying the datastore with changed structs would be another mission to resolve.

So I settled on setting arrays/structs we didn’t need to nil/null.

2. Determine size per row

Depending on your needs, you may or may not need to determine what’s the size per line of your JSON. In my case, I needed my JSON files to be less than 200MB, but not by too much as it would require much more work by having to invoke the API more times, and each call took about 5 minutes.

I used jq, a lightweight and flexible command-line JSON processor, to determine the length of my JSON file, executing the following command:

jq '. | length' File.json

Its response includes a number of lines, 375,199 for example. Then by applying simple math, I was able to get an approximate number of lines per 200MBs. If 375k lines take up 4GB, then roughly 8-9k lines should be around 200MB.

3. Splitting into many smaller files

Perhaps this step was possible with jq too, but I haven’t investigated much. Instead, I used bash function named split to split my large JSON files into many smaller ones:

split -l 8500 File.json

This would split one large file into many smaller ones, each containing 8500 lines (as calculated earlier). You can manually check file size at this point, and if it is far too away from what you need (either too large or too small) you can repeat the process by either increasing or decreasing the number of lines per file, as this process takes little to no time.

Once you get smaller files, we have another problem to solve:

4. Fixing JSON tags

If you split large JSON file into many small ones, and the tool used doesn’t know that it’s dealing with JSON, you will get JSON arrays without brackets.

[{"id":1,"name":"John"},
{"id":2,"name":"Mark"},
{"id":3,"name":"Tom"},
{"id":4,"name":"Nick"},
{"id":5,"name":"Richard"},
{"id":6,"name":"George"}]

If you split this file on every 2nd line, you would get:

[{"id":1,"name":"John"},
{"id":2,"name":"Mark"},
{"id":3,"name":"Tom"},
{"id":4,"name":"Nick"},
{"id":5,"name":"Richard"},
{"id":6,"name":"George"}]

This problem might be avoided if I used a splitting tool that would recognize it was working with JSON, like jq, but I moved on from here.

As files were still too large to open with any text editor, especially VSCode (Electron, duh), I used a free hex editor named Hex Fiend.

Hex fiend was able to open files as large as 5GB in a matter of seconds and save the changes in under 10 seconds. Upon adding the brackets manually, I was able to send the data on Zoho’s API as my files were a bit smaller than 200MB.

2018 © Emir Ribic - Some rights reserved; please attribute properly and link back. Code snippets are MIT Licensed

Powered by Hugo & Kiss.