Skip to main content

Query Dataframes

Controlled node

Overview

The Query Dataframes node allows you to execute SQL queries against one or more dataframes (Parquet or CSV files) from your library. This node is useful for filtering, joining, aggregating, and transforming tabular data using standard SQL syntax.

The node accepts multiple input files, assigns them table names (either automatically or via the Table Names input), executes the provided SQL query, and outputs the results as a new Parquet file in your library.

Supported File Formats

While the node primarily works with Parquet files (.parquet, .pqt), it also supports CSV files (.csv) which are automatically handled during query execution.

Inputs

InputTypeDescriptionDefault
RunEventTriggers the node to execute the SQL query against the provided files.-
FilesDataOne or more files to query. Accepts Parquet or CSV files from the library. Supports up to 1000 files.-
Table NamesTextComma-separated list of table names to use in the SQL query (e.g., "users, orders"). If not provided, files are automatically named df0, df1, etc. based on their order.df0, df1, ...
QueryCodeThe SQL query to execute against the dataframes. Use the table names defined in Table Names (or the defaults) to reference the files.-

Outputs

OutputTypeDescription
DoneEventFires when the query has completed successfully.
OutputDataA library file object containing the query results as a Parquet file. The object includes id, name, mimeType, dir, and size properties.

Runtime Behavior and Defaults

When the Query Dataframes node runs:

  1. File Validation: The node validates that at least one file is provided and that all files are in a supported format (Parquet or CSV).

  2. Table Naming:

    • If Table Names is provided as a comma-separated string (e.g., "customers, transactions"), these names are used in the SQL query.
    • If Table Names is provided as an array, those values are used.
    • If not provided, files are automatically assigned names df0, df1, df2, etc., corresponding to their order in the Files input.
  3. Query Execution: The SQL query is executed against the dataframes using the specified table names. The query runs in a temporary context and does not modify the original files.

  4. Output Generation: The results are saved as a new Parquet file in the library under the __temporaryFiles__ directory. The file is named query_result_{timestamp}.parquet.

  5. Error Handling: If no files are provided, no query is provided, or the query fails, the Output will contain an error object with a descriptive message.

Output Format

The Output is always a Parquet file, regardless of whether the input files were CSV or Parquet. The output file object follows the standard Intellectible library format:

{
"id": "unique-file-id",
"name": "query_result_1699123456789.parquet",
"mimeType": "application/vnd.apache.parquet",
"dir": "directory-id",
"size": 1024
}

Example

Basic Query on a Single File

  1. Connect a Read Parquet node (or CSV to Parquet node) to the Files input of the Query Dataframes node.

  2. Leave Table Names empty (or set it to "sales").

  3. In the Query input, enter:

    SELECT * FROM df0 WHERE amount > 1000

    (or FROM sales if you named the table).

  4. Trigger the Run event.

  5. The Output will contain a new Parquet file with only the rows where the amount exceeds 1000.

Joining Multiple Dataframes

  1. Connect two Read Parquet nodes to the Files input (combine them using a Combine Data node or connect both to the same input).
  2. Set Table Names to "users, orders".
  3. In the Query input, enter:
    SELECT u.name, o.total 
    FROM users u
    JOIN orders o ON u.id = o.user_id
    WHERE o.status = 'completed'
  4. The Output will contain the joined results as a new Parquet file.
File Limits

The node supports up to 1000 input files. Exceeding this limit will result in an error.