Cut Out the Middle Tier: Generating JSON Directly from Postgres

Easy JSON using row_to_json

The simplest JSON generator is row_to_json() which takes in a tuple value and returns the equivalent JSON dictionary.

SELECT row_to_json(employees)
FROM employees
WHERE employee_id = 1;

The resulting JSON uses the column names for keys, so you get a neat dictionary.

{
  "employee_id": 1,
  "department_id": 1,
  "name": "Paul",
  "start_date": "2018-09-02",
  "fingers": 10,
  "geom": {
    "type": "Point",
    "coordinates": [
      -123.329773,
      48.407326
    ]
  }
}

And look what happens to the geometry column! Because PostGIS includes a cast from geometry to JSON, the geometry column is automatically mapped into GeoJSON in the conversion. This is a useful trick with any custom type: define a cast to JSON and you automatically integrate with the native PostgreSQL JSON generators.

Full result sets using json_agg

Turning a single row into a dictionary is fine for basic record access, but queries frequently require multiple rows to be converted.

Fortunately, there’s an aggregate function for that, json_agg, which carries out the JSON conversion and converts the multiple results into a JSON list.

SELECT json_agg(e) 
FROM (
    SELECT employee_id, name 
    FROM employees
    WHERE department_id = 1
    ) e;

Note that in order to strip down the data in the record, we use a subquery to make a narrower input to json_agg.

[
  {
    "employee_id": 1,
    "name": "Paul"
  },
  {
    "employee_id": 2,
    "name": "Martin"
  }
]

Use JSON Input step to process uneven data

I’m trying to process the following with an JSON Input step:

<span class="pun">{</span><span class="str">"address"</span><span class="pun">:[</span>
  <span class="pun">{</span><span class="str">"AddressId"</span><span class="pun">:</span><span class="str">"1_1"</span><span class="pun">,</span><span class="str">"Street"</span><span class="pun">:</span><span class="str">"A Street"</span><span class="pun">},</span>
  <span class="pun">{</span><span class="str">"AddressId"</span><span class="pun">:</span><span class="str">"1_101"</span><span class="pun">,</span><span class="str">"Street"</span><span class="pun">:</span><span class="str">"Another Street"</span><span class="pun">},</span>
  <span class="pun">{</span><span class="str">"AddressId"</span><span class="pun">:</span><span class="str">"1_102"</span><span class="pun">,</span><span class="str">"Street"</span><span class="pun">:</span><span class="str">"One more street"</span><span class="pun">,</span> <span class="str">"Locality"</span><span class="pun">:</span><span class="str">"Buenos Aires"</span><span class="pun">},</span>
  <span class="pun">{</span><span class="str">"AddressId"</span><span class="pun">:</span><span class="str">"1_102"</span><span class="pun">,</span><span class="str">"Locality"</span><span class="pun">:</span><span class="str">"New York"</span><span class="pun">}</span>
<span class="pun">]}</span>

However this seems not to be possible:

<span class="typ">Json</span> <span class="typ">Input</span><span class="pun">.</span><span class="lit">0</span> <span class="pun">-</span><span class="pln"> ERROR </span><span class="pun">(</span><span class="pln">version </span><span class="lit">4.2</span><span class="pun">.</span><span class="lit">1</span><span class="pun">-</span><span class="pln">stable</span><span class="pun">,</span><span class="pln"> build </span><span class="lit">15952</span> <span class="kwd">from</span> <span class="lit">2011</span><span class="pun">-</span><span class="lit">10</span><span class="pun">-</span><span class="lit">25</span> <span class="lit">15.27</span><span class="pun">.</span><span class="lit">10</span> <span class="kwd">by</span><span class="pln"> buildguy</span><span class="pun">)</span> <span class="pun">:</span> 
<span class="typ">The</span><span class="pln"> data structure </span><span class="kwd">is</span> <span class="kwd">not</span><span class="pln"> the same inside the resource</span><span class="pun">!</span> 
<span class="typ">We</span><span class="pln"> found </span><span class="lit">1</span><span class="pln"> values </span><span class="kwd">for</span><span class="pln"> json path </span><span class="pun">[</span><span class="pln">$</span><span class="pun">..</span><span class="typ">Locality</span><span class="pun">],</span><span class="pln"> which </span><span class="kwd">is</span><span class="pln"> different that the number retourned </span><span class="kwd">for</span><span class="pln"> path </span><span class="pun">[</span><span class="pln">$</span><span class="pun">..</span><span class="typ">Street</span><span class="pun">]</span> <span class="pun">(</span><span class="lit">3509</span><span class="pln"> values</span><span class="pun">).</span> 
<span class="typ">We</span><span class="pln"> MUST have the same number of values </span><span class="kwd">for</span><span class="pln"> all paths</span><span class="pun">.</span>

The step provides Ignore Missing Path flag but it only works if all the rows misses the same path. In that case that step acts as as expected an fills the missing values with null.

This limits the power of this step to read uneven data, which was really one of my priorities.

My step Fields are defined as follows:

JSON Input Fields definition

Am I missing something? Is this the correct behavior?

 

What I have done is use JSON Input using $.address[*] to read to a jsonRow field the full map of each element p.e:

<span class="pun">{</span><span class="str">"address"</span><span class="pun">:[</span>
    <span class="pun">{</span><span class="str">"AddressId"</span><span class="pun">:</span><span class="str">"1_1"</span><span class="pun">,</span><span class="str">"Street"</span><span class="pun">:</span><span class="str">"A Street"</span><span class="pun">},</span>  
    <span class="pun">{</span><span class="str">"AddressId"</span><span class="pun">:</span><span class="str">"1_101"</span><span class="pun">,</span><span class="str">"Street"</span><span class="pun">:</span><span class="str">"Another Street"</span><span class="pun">},</span>  
    <span class="pun">{</span><span class="str">"AddressId"</span><span class="pun">:</span><span class="str">"1_102"</span><span class="pun">,</span><span class="str">"Street"</span><span class="pun">:</span><span class="str">"One more street"</span><span class="pun">,</span> <span class="str">"Locality"</span><span class="pun">:</span><span class="str">"Buenos Aires"</span><span class="pun">},</span>   
    <span class="pun">{</span><span class="str">"AddressId"</span><span class="pun">:</span><span class="str">"1_102"</span><span class="pun">,</span><span class="str">"Locality"</span><span class="pun">:</span><span class="str">"New York"</span><span class="pun">}</span> 
<span class="pun">]}</span>

This results in 4 jsonRows one for each element, p.e. jsonRow = {"AddressId":"1_101","Street":"Another Street"}. Then using a Javascript step I map my values using this:

<span class="kwd">var</span> <span class="typ">AddressId</span> <span class="pun">=</span><span class="pln"> getFromMap</span><span class="pun">(</span><span class="str">'AddressId'</span><span class="pun">,</span><span class="pln"> jsonRow</span><span class="pun">);</span>
<span class="kwd">var</span> <span class="typ">Street</span> <span class="pun">=</span><span class="pln"> getFromMap</span><span class="pun">(</span><span class="str">'Street'</span><span class="pun">,</span><span class="pln"> jsonRow</span><span class="pun">);</span>
<span class="kwd">var</span> <span class="typ">Locality</span> <span class="pun">=</span><span class="pln"> getFromMap</span><span class="pun">(</span><span class="str">'Locality'</span><span class="pun">,</span><span class="pln"> jsonRow</span><span class="pun">);</span>

In a second script tab I inserted minified JSON parse code from https://github.com/douglascrockford/JSON-js and the getFromMap function:

<span class="kwd">function</span><span class="pln"> getFromMap</span><span class="pun">(</span><span class="pln">key</span><span class="pun">,</span><span class="pln">jsonRow</span><span class="pun">){</span>
  <span class="kwd">try</span><span class="pun">{</span>
   <span class="kwd">var</span><span class="pln"> map </span><span class="pun">=</span><span class="pln"> JSON</span><span class="pun">.</span><span class="pln">parse</span><span class="pun">(</span><span class="pln">jsonRow</span><span class="pun">);</span>
  <span class="pun">}</span>
  <span class="kwd">catch</span><span class="pun">(</span><span class="pln">e</span><span class="pun">){</span>
   <span class="kwd">var</span><span class="pln"> message </span><span class="pun">=</span> <span class="str">"Unparsable JSON: "</span><span class="pun">+</span><span class="pln">jsonRow</span><span class="pun">+</span><span class="str">" Desc: "</span><span class="pun">+</span><span class="pln">e</span><span class="pun">.</span><span class="pln">message</span><span class="pun">;</span>
   <span class="kwd">var</span><span class="pln"> nr_errors </span><span class="pun">=</span> <span class="lit">1</span><span class="pun">;</span>
   <span class="kwd">var</span><span class="pln"> field </span><span class="pun">=</span> <span class="str">"jsonRow"</span><span class="pun">;</span>
   <span class="kwd">var</span><span class="pln"> errcode </span><span class="pun">=</span> <span class="str">"JSON_PARSE"</span><span class="pun">;</span><span class="pln">
   _step_</span><span class="pun">.</span><span class="pln">putError</span><span class="pun">(</span><span class="pln">getInputRowMeta</span><span class="pun">(),</span><span class="pln"> row</span><span class="pun">,</span><span class="pln"> nr_errors</span><span class="pun">,</span><span class="pln"> message</span><span class="pun">,</span><span class="pln"> field</span><span class="pun">,</span><span class="pln"> errcode</span><span class="pun">);</span><span class="pln">
   trans_Status </span><span class="pun">=</span><span class="pln"> SKIP_TRANSFORMATION</span><span class="pun">;</span>
   <span class="kwd">return</span> <span class="kwd">null</span><span class="pun">;</span>
  <span class="pun">}</span>

  <span class="kwd">if</span><span class="pun">(</span><span class="pln">map</span><span class="pun">[</span><span class="pln">key</span><span class="pun">]</span> <span class="pun">==</span> <span class="kwd">undefined</span><span class="pun">){</span>
   <span class="kwd">return</span> <span class="kwd">null</span><span class="pun">;</span>
  <span class="pun">}</span><span class="pln">
  trans_Status </span><span class="pun">=</span><span class="pln"> CONTINUE_TRANSFORMATION</span><span class="pun">;</span>
  <span class="kwd">return</span><span class="pln"> map</span><span class="pun">[</span><span class="pln">key</span><span class="pun">]</span>
<span class="pun">}</span>

An alternative approach to reading JSON in Pentaho’s Data Integration

Reading JSON via JavaScript in Java is slow

Probably no surprise there, right? But if you’ve ever worked with the JSONInput step in Kettle, you know that it’s anything but a time saver.
After a lot of research and extensive benchmarking, we were able to identify Kettle’s JSONInput step as our largest bottleneck. Plagued with performance troubles, the JSONInput step spans JIRA Issues dating back as far as 2012 (PDI-8809: Investigate JSON parsing performance improvement), as well as multiple other requests for a streaming solution (PDI-9785), the ability to handle large datasets (PDI-10858), and rewriting the step to use a native library (PDI-10344).

Reading JSON via Java in Java is faster
I just wanted to point that out one last time. But seriously, that is how much faster a native Java library can make things. We couldn’t be happier to see the performance improvements in this FastJSON implementation. In decreasing both the runtime and memory consumption necessary to parse JSON and process it through Kettle, we have ensured that our ETL processes will stay performant and reliable while keeping our Product Managers development time low and (relatively) pain free.

merging many json files into one

jq solution:

jq -s '{ attributes: map(.attributes[0]) }' file*.json
  • -s (--slurp) – instead of running the filter for each JSON object in the input, read the entire input stream into a large array and run the filter just once.

Sample output:

{
  "attributes": [
    {
      "name": "Node",
      "value": "test"
    },
    {
      "name": "version",
      "value": "11.1"
    }
  ]
}