fabricatio_plot.capabilities.synthesize_data

Module for synthesizing data using LLM capabilities in a concurrent and batched manner.

Classes

SynthesizeData

Abstract base class for synthesizing structured data based on natural language requirements.

Module Contents

class fabricatio_plot.capabilities.synthesize_data.SynthesizeData(/, **data: Any)

Bases: fabricatio_core.capabilities.usages.UseLLM, abc.ABC

Abstract base class for synthesizing structured data based on natural language requirements.

Inherits core functionality from UseLLM and ABC, enabling LLM-driven data generation workflows. Provides methods to generate headers, CSV content, and aggregated data batches.

async generate_header(requirement: str | List[str], **kwargs: Unpack[fabricatio_core.models.kwargs_types.ListingKwargs[str]]) → None | List[str] | List[List[str] | None]

Generate appropriate column headers based on the given requirement(s).

Parameters:

requirement – A single or list of natural language descriptions of the required data.
**kwargs – Additional keyword arguments passed to the underlying LLM processing.

Returns:

A list of generated headers matching the input requirement structure, or None if generation fails.

async generate_csv_data(requirement: str, header: List[str] | None, rows: int = 100, **kwargs: Unpack[fabricatio_core.models.kwargs_types.ValidateKwargs[str]]) → pandas.DataFrame | None

Generate CSV-formatted synthetic data matching the specified requirement and header.

Parameters:

requirement – Natural language description of the required dataset characteristics.
header – Optional list of column names; if not provided, will be auto-generated.
rows – Number of data rows to generate (default: 100).
**kwargs – Additional validation-aware keyword arguments for LLM processing.

Returns:

A pandas DataFrame containing the synthesized data if successful, or None if parsing or generation fails.

async synthesize_data(requirement: str, header: List[str] | None = None, rows: int = 1000, batch_size: int = 100, **kwargs: Unpack[fabricatio_core.models.kwargs_types.ValidateKwargs[str]]) → pandas.DataFrame | None

Synthesize large datasets efficiently by parallel batch generation and concatenation.

Parameters:

requirement – Natural language specification of the desired dataset.
rows – Total number of rows to generate (default: 1000).
batch_size – Number of rows per parallel batch (default: 100).
header – Optional explicit column header list; if omitted, auto-generated.
**kwargs – Validation-aware keyword arguments passed to LLM processing.

Returns:

A unified DataFrame containing all successfully generated data, or None if no batches succeed.