Skip to main content

Input

Load data from files into your workflow.

Sockets

SocketDirectionDescription
outputOutputLoaded data as a LazyFrame

Supported Formats

CSV

Comma-separated values and other delimited text files.

Reading Modes:

ModeDescriptionBest For
ScanLazy loading, streams dataLarge files, memory efficiency
ReadEager loading, loads entire fileSmall files, when you need all data upfront
BatchedProcesses in chunks like DaskVery large files

CSV Options:

OptionDefaultDescription
Source(required)Path to the CSV file
Separator,Field delimiter (comma, tab, pipe, etc.)
Has HeadertrueFirst row contains column names
Quote Character"Character used to quote fields
Encodingutf-8File encoding
Skip Rows0Number of rows to skip at the start
Infer Schema Length100Rows to sample for type inference
Try Parse DatesfalseAttempt to parse date columns
Ignore ErrorsfalseSkip rows with parsing errors

Excel

Microsoft Excel spreadsheet files (.xlsx, .xls, .xlsb).

Reading Mode:

Excel files only support eager reading (the entire file is loaded into memory). This is a limitation of the Excel format - unlike CSV or Parquet, Excel files cannot be lazily streamed.

Excel Options:

OptionDefaultDescription
Source(required)Path to the Excel file
Sheet Number1Sheet to read (1 = first sheet, 0 = all sheets)
Sheet Name(none)Alternative: specify sheet by name
Has HeadertrueFirst row contains column names
Infer Schema Length100Rows to sample for type inference
Drop Empty RowstrueOmit completely empty rows
Drop Empty ColumnstrueOmit empty columns with no headers
Error if EmptytrueRaise error if sheet contains no data
Sheet Selection

You can select a sheet either by number (1-indexed) or by name, but not both. If neither is specified, the first sheet is read.

Parquet

Columnar binary format with embedded schema.

Reading Modes:

ModeDescriptionBest For
ScanLazy loading, predicate pushdownLarge files, memory efficiency
ReadEager loadingSmall files

Parquet Options:

OptionDefaultDescription
Source(required)Path to the Parquet file
N Rows(all)Limit number of rows to read
Columns(all)Specific columns to load
Use StatisticstrueUse file statistics for optimization
ParallelautoParallel reading strategy

IPC / Arrow

Apache Arrow's native binary format (.arrow, .ipc, .feather). Designed for zero-copy data sharing and maximum performance.

Reading Modes:

ModeDescriptionBest For
ScanLazy loading, memory-mappedLarge files, memory efficiency
ReadEager loadingSmall files

IPC Options:

OptionDefaultDescription
Source(required)Path to the Arrow file
N Rows(all)Limit number of rows to read
Memory MaptrueMemory-map the file for efficient access
CachetrueCache scan result (scan mode only)
Rechunkfalse / trueRechunk to contiguous memory (default differs by mode)
Retries2Number of retries on I/O errors

Advanced Options:

OptionDefaultDescription
Row Index Name(none)Add a row index column with this name
Row Index Offset0Start row index at this value
Hive PartitioningfalseInfer partition columns from directory structure
Parse Hive DatestrueTry to parse date columns from Hive partitions
Performance

IPC/Arrow is the fastest format for read/write operations. Use it for intermediate files in your workflow or for sharing data between applications that support Apache Arrow (DuckDB, Spark, R, Julia, etc.).

Avro

Apache Avro binary format (.avro). A row-based format commonly used in big data ecosystems like Hadoop, Kafka, and Spark.

Reading Mode:

Avro files only support eager reading (the entire file is loaded into memory). Unlike IPC or Parquet, Avro does not support lazy scanning in Polars.

Avro Options:

OptionDefaultDescription
Source(required)Path to the Avro file
N Rows(all)Limit number of rows to read
Columns(all)Specific columns to load (by name or index)
When to Use Avro

Avro is ideal when:

  • Integrating with Hadoop, Kafka, or Spark pipelines
  • You need schema evolution support
  • Working with row-based data patterns
  • Interoperability with Java/JVM ecosystems is important

For pure performance within Sigilweaver, consider IPC/Arrow or Parquet instead.

ODS

OpenDocument Spreadsheet (.ods) files from LibreOffice and OpenOffice. Similar to Excel but uses an open standard format.

Reading Mode:

ODS files only support eager reading (the entire file is loaded into memory). Polars does not support lazy scanning or writing for ODS format - it is input-only.

ODS Options:

OptionDefaultDescription
Source(required)Path to the ODS file
Sheet ID1Which sheet to read (1-indexed, 0 = all sheets)
Sheet Name(none)Alternative to Sheet ID - read by name
Has HeadertrueFirst row contains column names
Columns(all)Specific columns to load (by name or index)
Infer Schema Length100Number of rows to scan for type inference
Drop Empty RowstrueRemove completely empty rows
Drop Empty ColstrueRemove completely empty columns
Raise If EmptytrueError if the sheet is empty
When to Use ODS

ODS is ideal when:

  • Working with LibreOffice or OpenOffice spreadsheets
  • You need an open standard alternative to Excel
  • Sharing data between open source office suites
  • Compatibility with government or academic systems using open formats

Note: ODS is input-only - use Excel Output or convert to CSV/Parquet for saving data.

Configuration

  1. Add an Input tool to the canvas
  2. Select the data source type (CSV, Excel, Parquet, IPC/Arrow, Avro, or ODS)
  3. Choose the file using the file picker
  4. Configure format-specific options as needed

Examples

Loading a CSV

  1. Drag Input tool to canvas
  2. Select "CSV" as the data source
  3. Click "Browse" and select your file
  4. Adjust separator if not comma-delimited
  5. Wire the output to downstream tools

Loading an Excel File

  1. Drag Input tool to canvas
  2. Select "Excel" as the data source
  3. Click "Browse" and select your .xlsx, .xls, or .xlsb file
  4. Select the sheet to read (by number or name)
  5. Wire the output to downstream tools

Loading a Parquet

  1. Drag Input tool to canvas
  2. Select "Parquet" as the data source
  3. Click "Browse" and select your file
  4. Wire the output to downstream tools

Loading an IPC/Arrow File

  1. Drag Input tool to canvas
  2. Select "IPC / Arrow" as the data source
  3. Click "Browse" and select your .arrow, .ipc, or .feather file
  4. Wire the output to downstream tools

Loading an Avro File

  1. Drag Input tool to canvas
  2. Select "Avro" as the data source
  3. Click "Browse" and select your .avro file
  4. Optionally limit rows with "Max Rows" option
  5. Wire the output to downstream tools

Loading an ODS File

  1. Drag Input tool to canvas
  2. Select "ODS" as the data source
  3. Click "Browse" and select your .ods file
  4. Select the sheet to read (by number or name)
  5. Configure options like drop empty rows/columns
  6. Wire the output to downstream tools

Notes

  • Lazy loading (Scan mode) is recommended for CSV, Parquet, and IPC/Arrow when working with large files
  • Excel, Avro, and ODS files are always loaded entirely into memory - consider converting large files to CSV, Parquet, or IPC for better performance
  • IPC/Arrow is the fastest format and ideal for intermediate data or sharing between Arrow-compatible applications
  • Avro is best for interoperability with big data ecosystems (Hadoop, Kafka, Spark)
  • ODS is read-only in Sigilweaver (Polars limitation) - use Excel Output or convert to other formats for saving
  • Schema inference may not always be accurate - use Select to cast types if needed
  • File paths must be accessible from where Sigilweaver is running