Skip to main content

JSONReader

A simple JSON data loader with various options. Either parses the entire string, cleaning it and treat each line as an embedding or performs a recursive depth-first traversal yielding JSON paths. Supports streaming of large JSON data using @discoveryjs/json-ext

Usage

import { JSONReader } from "llamaindex";

const file = "../../PATH/TO/FILE";
const content = new TextEncoder().encode("JSON_CONTENT");

const reader = new JSONReader({ levelsBack: 0, collapseLength: 100 });
const docsFromFile = reader.loadData(file);
const docsFromContent = reader.loadDataAsContent(content);

Options

Basic:

  • streamingThreshold?: The threshold for using streaming mode in MB of the JSON Data. CEstimates characters by calculating bytes: (streamingThreshold * 1024 * 1024) / 2 and comparing against .length of the JSON string. Set undefined to disable streaming or 0 to always use streaming. Default is 50 MB.

  • ensureAscii?: Wether to ensure only ASCII characters be present in the output by converting non-ASCII characters to their unicode escape sequence. Default is false.

  • isJsonLines?: Wether the JSON is in JSON Lines format. If true, will split into lines, remove empty one and parse each line as JSON. Note: Uses a custom streaming parser, most likely less robust than json-ext. Default is false

  • cleanJson?: Whether to clean the JSON by filtering out structural characters ({}, [], and ,). If set to false, it will just parse the JSON, not removing structural characters. Default is true.

  • logger?: A placeholder for a custom logger function.

Depth-First-Traversal:

  • levelsBack?: Specifies how many levels up the JSON structure to include in the output. cleanJson will be ignored. If set to 0, all levels are included. If undefined, parses the entire JSON, treat each line as an embedding and create a document per top-level array. Default is undefined

  • collapseLength?: The maximum length of JSON string representation to be collapsed into a single line. Only applicable when levelsBack is set. Default is undefined

Examples

Input:

{"a": {"1": {"key1": "value1"}, "2": {"key2": "value2"}}, "b": {"3": {"k3": "v3"}, "4": {"k4": "v4"}}}

Default options:

LevelsBack = undefined & cleanJson = true

Output:

"a": {
"1": {
"key1": "value1"
"2": {
"key2": "value2"
"b": {
"3": {
"k3": "v3"
"4": {
"k4": "v4"

Depth-First Traversal all levels:

levelsBack = 0

Output:

a 1 key1 value1
a 2 key2 value2
b 3 k3 v3
b 4 k4 v4

Depth-First Traversal and Collapse:

levelsBack = 0 & collapseLength = 35

Output:

a 1 {"key1":"value1"}
a 2 {"key2":"value2"}
b {"3":{"k3":"v3"},"4":{"k4":"v4"}}

Depth-First Traversal limited levels:

levelsBack = 2

Output:

1 key1 value1
2 key2 value2
3 k3 v3
4 k4 v4

Uncleaned JSON:

levelsBack = undefined & cleanJson = false

Output:

{"a":{"1":{"key1":"value1"},"2":{"key2":"value2"}},"b":{"3":{"k3":"v3"},"4":{"k4":"v4"}}}

ASCII-Conversion:

Input:

{ "message": "こんにちは世界" }

Output:

"message": "\u3053\u3093\u306b\u3061\u306f\u4e16\u754c"

JSON Lines Format:

Input:

{"tweet": "Hello world"}\n{"tweet": "こんにちは世界"}

Output:

"tweet": "Hello world"

"tweet": "こんにちは世界"

API Reference