An alternative approach to reading JSON in Pentaho’s Data Integration

Reading JSON via JavaScript in Java is slow

Probably no surprise there, right? But if you’ve ever worked with the JSONInput step in Kettle, you know that it’s anything but a time saver.
After a lot of research and extensive benchmarking, we were able to identify Kettle’s JSONInput step as our largest bottleneck. Plagued with performance troubles, the JSONInput step spans JIRA Issues dating back as far as 2012 (PDI-8809: Investigate JSON parsing performance improvement), as well as multiple other requests for a streaming solution (PDI-9785), the ability to handle large datasets (PDI-10858), and rewriting the step to use a native library (PDI-10344).

Reading JSON via Java in Java is faster
I just wanted to point that out one last time. But seriously, that is how much faster a native Java library can make things. We couldn’t be happier to see the performance improvements in this FastJSON implementation. In decreasing both the runtime and memory consumption necessary to parse JSON and process it through Kettle, we have ensured that our ETL processes will stay performant and reliable while keeping our Product Managers development time low and (relatively) pain free.

Urs Hölzle: Google

Urs Hölzle (German pronunciation: [ˈʊrs ˈhœltslɛ]) is a Swiss software engineer and technology executive. He is the senior vice president of technical infrastructure and Google Fellow at Google. As Google’s eighth employee and its first VP of Engineering, he has shaped much of Google’s development processes and infrastructure

.. Before joining Google, he was an Associate Professor of Computer Science at University of California, Santa Barbara. He received a master’s degree in computer science from ETH Zurich in 1988 and was awarded a Fulbright scholarship that same year. In 1994, he earned a Ph.D. from Stanford University, where his research focused on programming languages and their efficient implementation. Via a startup founded by Hölzle, David Griswold, and Lars Bak (see Strongtalk), that work then evolved into a high-performance Java VMnamed HotSpot, acquired by Sun’s JavaSoft unit in 1997 and from there became Sun’s premier JVM implementation.[2]

He led the design of Google’s very efficient data centers which are said to use less than half the power of a conventional data center.[3] In 2014 he received The Economist’s Innovation Award for his datacenter efficiency work.[4] With Luiz Barroso, he wrote The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.[5] In June 2007, he introduced the Climate Savers Computing Initiative together with Pat Gelsinger which aims to halve the power consumption of desktop computers and servers.

Also in 2007, he and Luiz Barroso wrote “The Case for Energy Proportional Computing” which argued that servers should be designed to use power in proportion to their current load, because they spend much of their time being only partially loaded. This paper is often credited for spurring CPU manufacturers to make their designs much more energy efficient.[6] Today, energy proportional computing has become a standard goal for both server and mobile uses.

In 2011, Hölzle announced a shift in’s alternative energy investment strategy, dropping development of “solar thermal” electricity (for example with BrightSource Energy) because ST was not keeping pace with the rapid price decline of another solar technology – photovoltaics.[7]

In 2012, Hölzle introduced “the G-Scale Network” on which Google had begun managing its petabyte-scale internal data flow via OpenFlow, an open source software system jointly devised by scientists at Stanford and the UC Berkeley and promoted by the Open Networking Foundation. The internal data flow, or network, is distinct from the one that connects users to Google services (Search, Gmail, YouTube, etc.). In the process of describing the new network, Hölzle also confirmed more about Google’s making of its own networking equipment like routers and switches for G-Scale; and said the company wanted, by being open about the changes, to “encourage the industry — hardware, software and ISP’s — to look down this path and say, ‘I can benefit from this.'” He said network utilization was nearing 100% of capacity, a dramatic efficiency improvement.[8]

He is credited for creating Google Gulp for April Fool’s Day in 2005.

He is member of the National Academy of Engineering,[9] and a Fellow of the Association for Computing Machinery (2009)[10], the AAAS (2017)[11], and the Swiss Academies of Arts and Sciences.[12] He is also a board member of the US World Wildlife Fund.[13]

The Oracle-Google Case Will Decide the Future of Software

But since the appeals court has already ruled that APIs are subject to copyright, that could open a whole new frontier of lawsuits aimed at startups and open source projects that have copied APIs in order to ensure their products are compatible with popular commercial products.

For example, several companies have built open source software that works with various cloud services in an attempt to make it easier for customers to easily move their applications from, say, Amazon to their own data centers. Basho and SwiftStack, to name just two, each offer storage products that are compatible with Amazon’s cloud storage service S3. Since APIs are subject to copyright, Amazon could in theory go after both companies for copyright violations.

Meanwhile, many open source operating systems, such as FreeBSD and those based on Linux, use a standard API called POSIX, which is based on the API of AT&T’s Unix operating system. Under the appeals court’s ruling, AT&T could go after the makers of POSIX operating systems.

“Both of those scenarios are more likely after Oracle v. Google,
regardless of how the jury decides,” says Mitch Stoltz, a senior staff attorney at the Electronic Frontier Foundation.

.. Many newer development platforms, including Google’s Go language and Apple’s Swift, are licensed under more liberal terms than Java and allow for-profit companies to use and modify them.