Query Dataframes
Controlled node
Overview
The Query Dataframes node allows you to execute SQL queries against one or more dataframes (Parquet or CSV files) from your library. This node is useful for filtering, joining, aggregating, and transforming tabular data using standard SQL syntax.
The node accepts multiple input files, assigns them table names (either automatically or via the Table Names input), executes the provided SQL query, and outputs the results as a new Parquet file in your library.
While the node primarily works with Parquet files (.parquet, .pqt), it also supports CSV files (.csv) which are automatically handled during query execution.
Inputs
| Input | Type | Description | Default |
|---|---|---|---|
| Run | Event | Triggers the node to execute the SQL query against the provided files. | - |
| Files | Data | One or more files to query. Accepts Parquet or CSV files from the library. Supports up to 1000 files. | - |
| Table Names | Text | Comma-separated list of table names to use in the SQL query (e.g., "users, orders"). If not provided, files are automatically named df0, df1, etc. based on their order. | df0, df1, ... |
| Query | Code | The SQL query to execute against the dataframes. Use the table names defined in Table Names (or the defaults) to reference the files. | - |
Outputs
| Output | Type | Description |
|---|---|---|
| Done | Event | Fires when the query has completed successfully. |
| Output | Data | A library file object containing the query results as a Parquet file. The object includes id, name, mimeType, dir, and size properties. |
Runtime Behavior and Defaults
When the Query Dataframes node runs:
-
File Validation: The node validates that at least one file is provided and that all files are in a supported format (Parquet or CSV).
-
Table Naming:
- If Table Names is provided as a comma-separated string (e.g.,
"customers, transactions"), these names are used in the SQL query. - If Table Names is provided as an array, those values are used.
- If not provided, files are automatically assigned names
df0,df1,df2, etc., corresponding to their order in the Files input.
- If Table Names is provided as a comma-separated string (e.g.,
-
Query Execution: The SQL query is executed against the dataframes using the specified table names. The query runs in a temporary context and does not modify the original files.
-
Output Generation: The results are saved as a new Parquet file in the library under the
__temporaryFiles__directory. The file is namedquery_result_{timestamp}.parquet. -
Error Handling: If no files are provided, no query is provided, or the query fails, the Output will contain an error object with a descriptive message.
The Output is always a Parquet file, regardless of whether the input files were CSV or Parquet. The output file object follows the standard Intellectible library format:
{
"id": "unique-file-id",
"name": "query_result_1699123456789.parquet",
"mimeType": "application/vnd.apache.parquet",
"dir": "directory-id",
"size": 1024
}
Example
Basic Query on a Single File
-
Connect a Read Parquet node (or CSV to Parquet node) to the Files input of the Query Dataframes node.
-
Leave Table Names empty (or set it to
"sales"). -
In the Query input, enter:
SELECT * FROM df0 WHERE amount > 1000(or
FROM salesif you named the table). -
Trigger the Run event.
-
The Output will contain a new Parquet file with only the rows where the amount exceeds 1000.
Joining Multiple Dataframes
- Connect two Read Parquet nodes to the Files input (combine them using a Combine Data node or connect both to the same input).
- Set Table Names to
"users, orders". - In the Query input, enter:
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.status = 'completed' - The Output will contain the joined results as a new Parquet file.
The node supports up to 1000 input files. Exceeding this limit will result in an error.