Text Processing

Text processing components handle splitting, parsing, and formatting of text content for various workflow needs.

Split Text

This component splits text into chunks based on specified criteria. It's ideal for chunking data to be tokenized and embedded into vector databases.

The Split Text component outputs Chunks or DataFrame. The Chunks output returns a list of individual text chunks. The DataFrame output returns a structured data format, with additional text and metadata columns applied.

  1. To use this component in a flow, connect a component that outputs Data or DataFrame to the Split Text component's Data port. This example uses the URL component, which is fetching JSON placeholder data.

  2. In the Split Text component, define your data splitting parameters.

This example splits incoming JSON data at the separator },, so each chunk contains one JSON object.

The order of precedence is Separator, then Chunk Size, and then Chunk Overlap. If any segment after separator splitting is longer than chunk_size, it is split again to fit within chunk_size.

After chunk_size, Chunk Overlap is applied between chunks to maintain context.

  1. Connect a Chat Output component to the Split Text component's DataFrame output to view its output.

  2. Click Playground, and then click Run Flow. The output contains a table of JSON objects split at },.

{
"userId": 1,
"id": 1,
"title": "Introduction to Artificial Intelligence",
"body": "Learn the basics of Artificial Intelligence and its applications in various industries.",
"link": "https://example.com/article1",
"comment_count": 8
},
{
"userId": 2,
"id": 2,
"title": "Web Development with React",
"body": "Build modern web applications using React.js and explore its powerful features.",
"link": "https://example.com/article2",
"comment_count": 12
},
  1. Clear the Separator field, and then run the flow again. Instead of JSON objects, the output contains 50-character lines of text with 10 characters of overlap.

First chunk: "title": "Introduction to Artificial Intelligence"" Second chunk: "elligence", "body": "Learn the basics of Artif" Third chunk: "s of Artificial Intelligence and its applications"

Inputs

Name
Display Name
Info

data_inputs

Input Documents

The data to split.The component accepts Data or DataFrame objects.

chunk_overlap

Chunk Overlap

The number of characters to overlap between chunks. Default: 200.

chunk_size

Chunk Size

The maximum number of characters in each chunk. Default: 1000.

separator

Separator

The character to split on. Default: newline.

text_key

Text Key

The key to use for the text column (advanced). Default: text.

Outputs

Name
Display Name
Info

chunks

Chunks

List of split text chunks as Data objects.

dataframe

DataFrame

List of split text chunks as DataFrame objects.

Parser

This component formats DataFrame or Data objects into text using templates, with an option to convert inputs directly to strings using stringify.

To use this component, create variables for values in the template the same way you would in a Prompt component. For DataFrames, use column names, for example Name: {Name}. For Data objects, use {text}.

To use the Parser component with a Structured Output component, do the following:

  1. Connect a Structured Output component's DataFrame output to the Parser component's DataFrame input.

  2. Connect the File component to the Structured Output component's Message input.

  3. Connect the OpenAI model component's Language Model output to the Structured Output component's Language Model input.

The flow looks like this:

A parser component connected to OpenAI and structured output
  1. In the Structured Output component, click Open Table. This opens a pane for structuring your table. The table contains the rows Name, Description, Type, and Multiple.

  2. Create a table that maps to the data you're loading from the File loader. For example, to create a table for employees, you might have the rows id, name, and email, all of type string.

  3. In the Template field of the Parser component, enter a template for parsing the Structured Output component's DataFrame output into structured text. Create variables for values in the template the same way you would in a Prompt component. For example, to present a table of employees in Markdown:

# Employee Profile
## Personal Information
- **Name:** {name}
- **ID:** {id}
- **Email:** {email}
  1. To run the flow, in the Parser component, click .

  2. To view your parsed text, in the Parser component, click .

  3. Optionally, connect a Chat Output component, and open the Playground to see the output.

For an additional example of using the Parser component to format a DataFrame from a Structured Output component, see the Market Research template flow.

Inputs

Name
Display Name
Info

mode

Mode

Tab selection between "Parser" and "Stringify" modes. "Stringify" converts input to a string instead of using a template.

pattern

Template

Template for formatting using variables in curly brackets. For DataFrames, use column names, such as Name: {Name}. For Data objects, use {text}.

input_data

Data or DataFrame

The input to parse - accepts either a DataFrame or Data object.

sep

Separator

String used to separate rows/items. Default: newline.

clean_data

Clean Data

When stringify is enabled, cleans data by removing empty rows and lines.

Outputs

Name
Display Name
Info

parsed_text

Parsed Text

The resulting formatted text as a Message object.

Usage Notes

  • Intelligent Chunking: Split text into optimal chunks for vector storage and processing

  • Template Formatting: Convert structured data into readable text using custom templates

  • Context Preservation: Maintain context between chunks with overlap settings

  • Flexible Output: Generate both individual chunks and structured DataFrames

  • Variable Support: Use template variables for dynamic content formatting

  • Multiple Modes: Choose between template-based parsing and simple string conversion

Last updated