Use JSON Input step to process uneven data

I’m trying to process the following with an JSON Input step:

{"address":[
  {"AddressId":"1_1","Street":"A Street"},
  {"AddressId":"1_101","Street":"Another Street"},
  {"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},
  {"AddressId":"1_102","Locality":"New York"}
]}

However this seems not to be possible:

Json Input.0 - ERROR (version 4.2.1-stable, build 15952 from 2011-10-25 15.27.10 by buildguy) : 
The data structure is not the same inside the resource! 
We found 1 values for json path [$..Locality], which is different that the number retourned for path [$..Street] (3509 values). 
We MUST have the same number of values for all paths.

The step provides Ignore Missing Path flag but it only works if all the rows misses the same path. In that case that step acts as as expected an fills the missing values with null.

This limits the power of this step to read uneven data, which was really one of my priorities.

My step Fields are defined as follows:

JSON Input Fields definition

Am I missing something? Is this the correct behavior?

 

What I have done is use JSON Input using $.address[*] to read to a jsonRow field the full map of each element p.e:

{"address":[
    {"AddressId":"1_1","Street":"A Street"},  
    {"AddressId":"1_101","Street":"Another Street"},  
    {"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},   
    {"AddressId":"1_102","Locality":"New York"} 
]}

This results in 4 jsonRows one for each element, p.e. jsonRow = {"AddressId":"1_101","Street":"Another Street"}. Then using a Javascript step I map my values using this:

var AddressId = getFromMap('AddressId', jsonRow);
var Street = getFromMap('Street', jsonRow);
var Locality = getFromMap('Locality', jsonRow);

In a second script tab I inserted minified JSON parse code from https://github.com/douglascrockford/JSON-js and the getFromMap function:

function getFromMap(key,jsonRow){
  try{
   var map = JSON.parse(jsonRow);
  }
  catch(e){
   var message = "Unparsable JSON: "+jsonRow+" Desc: "+e.message;
   var nr_errors = 1;
   var field = "jsonRow";
   var errcode = "JSON_PARSE";
   _step_.putError(getInputRowMeta(), row, nr_errors, message, field, errcode);
   trans_Status = SKIP_TRANSFORMATION;
   return null;
  }

  if(map[key] == undefined){
   return null;
  }
  trans_Status = CONTINUE_TRANSFORMATION;
  return map[key]
}

An alternative approach to reading JSON in Pentaho’s Data Integration

Reading JSON via JavaScript in Java is slow

Probably no surprise there, right? But if you’ve ever worked with the JSONInput step in Kettle, you know that it’s anything but a time saver.
After a lot of research and extensive benchmarking, we were able to identify Kettle’s JSONInput step as our largest bottleneck. Plagued with performance troubles, the JSONInput step spans JIRA Issues dating back as far as 2012 (PDI-8809: Investigate JSON parsing performance improvement), as well as multiple other requests for a streaming solution (PDI-9785), the ability to handle large datasets (PDI-10858), and rewriting the step to use a native library (PDI-10344).

Reading JSON via Java in Java is faster
I just wanted to point that out one last time. But seriously, that is how much faster a native Java library can make things. We couldn’t be happier to see the performance improvements in this FastJSON implementation. In decreasing both the runtime and memory consumption necessary to parse JSON and process it through Kettle, we have ensured that our ETL processes will stay performant and reliable while keeping our Product Managers development time low and (relatively) pain free.

merging many json files into one

jq solution:

jq -s '{ attributes: map(.attributes[0]) }' file*.json
  • -s (--slurp) – instead of running the filter for each JSON object in the input, read the entire input stream into a large array and run the filter just once.

Sample output:

{
  "attributes": [
    {
      "name": "Node",
      "value": "test"
    },
    {
      "name": "version",
      "value": "11.1"
    }
  ]
}

How can I import a JSON file into PostgreSQL?

You can feed the JSON into a SQL statement that extracts the information and inserts that into the table. If the JSON attributes have exactly the name as the table columns you can do something like this:

with customer_json (doc) as (
   values 
    ('[
      {
        "id": 23635,
        "name": "Jerry Green",
        "comment": "Imported from facebook."
      },
      {
        "id": 23636,
        "name": "John Wayne",
        "comment": "Imported from facebook."
      }
    ]'::json)
)
insert into customer (id, name, comment)
select p.*
from customer_json l
  cross join lateral json_populate_recordset(null::customer, doc) as p
on conflict (id) do update 
  set name = excluded.name, 
      comment = excluded.comment;

New customers will be inserted, existing ones will be updated. The “magic” part is the json_populate_recordset(null::customer, doc) which generates a relational representation of the JSON objects.