API Reference¶

For key concepts like delimiter-based matching, capture groups, and pattern syntax, see Key Concepts.

Parser¶

`Parser(delimiters: str = ' \\t\\r\\n:,!;%@/()[]', backend: str | None = None)` ¶

High-level parser for extracting structured data from unstructured log messages.

The Parser uses a schema-based approach to identify patterns, extract variables, and generate log types from raw log text. It compiles patterns into a DFA (Deterministic Finite Automaton) for efficient single-pass matching.

Key Features¶

Named capture groups: Use (?<name>pattern) to extract specific values
Priority-based matching: Control which patterns are tried first
Log type generation: Automatically creates templates from matched patterns
Streaming parsing: Efficiently process large log files
Multiple input sources: Parse from strings, files, or streams

Delimiter-Based Matching¶

Unlike standard regex, log-surgeon uses delimiter-based matching where . matches any character except delimiters (spaces, tabs, colons, etc.). This is important for pattern design:

# With default delimiters, "." stops at spaces
parser.add_var("token", r"(?<match>d.*)")
event = parser.parse_event("abc def ghi")
print(event['match'])  # "def" (NOT "def ghi")

# To match across spaces, use explicit character classes
parser.add_var("multi", r"(?<match>d[a-z ]*i)")  # Includes space

Workflow¶

Create a Parser instance with optional custom delimiters
Add variable patterns using add_var()
Call compile() to build the DFA
Parse logs using parse() or parse_event()

Example¶

from log_surgeon import Parser, PATTERN

parser = Parser()
parser.add_var("request", rf"(?<method>GET|POST) (?<path>/[^ ]+)")
parser.add_var("status", rf"status=(?<code>{PATTERN.INT})")
parser.compile()

event = parser.parse_event("GET /api/users status=200")
print(event["method"])  # "GET"
print(event["path"])    # "/api/users"
print(event["code"])    # "200"
print(event.get_log_type())  # "<method> <path> status=<code>"

Note¶

Delimiter choice significantly affects pattern matching. For example, with default delimiters, a pattern like (?<ip>.*) will stop at the first space. To match an IP address, use explicit character classes: (?<ip>[0-9.]+) or the pre-built PATTERN.IPV4.

Example¶

# Default delimiters
parser = Parser()

# Custom delimiters for file path matching
parser = Parser(delimiters=r" \t\r\n:,!;%@()[]")  # Removed "/"

# Minimal delimiters for maximum token length
parser = Parser(delimiters=r" \t\r\n")

# Use Rust backend
parser = Parser(backend="rust")

`add_var(name: str, regex: str, priority: int = 0) -> Parser` ¶

Add a variable pattern to the parser's schema.

Patterns must include at least one named capture group using (?<name>...) syntax. The captured values are accessible on parsed LogEvent objects using dictionary-style access: event["name"].

Parameters:

Name	Type	Description	Default
`name`	`str`	Unique identifier for this variable pattern. Used for schema organization but not directly accessible on LogEvent (use capture group names instead).	required
`regex`	`str`	Regular expression pattern with named capture groups. Use `(?<name>pattern)` syntax to define extractable fields. The `.` metacharacter matches any character except delimiters.	required
`priority`	`int`	Controls pattern matching order (higher = tried first). Default is 0. Use positive values for specific patterns (e.g., IP addresses) Use negative values for generic patterns (e.g., catch-all integers) Variables with equal priority maintain insertion order	`0`

Returns:

Type	Description
`Parser`	Self for method chaining.

Raises:

Type	Description
`ValueError`	If the pattern has no named capture groups, or if capture group names contain delimiter characters.
`AttributeError`	If a variable with the same name already exists.

Note¶

Priority Example: If you have patterns for IP addresses and integers, give IP higher priority so "192.168.1.1" matches as an IP, not four integers.

Example¶

from log_surgeon import Parser, PATTERN

parser = Parser()

# High priority for specific patterns
parser.add_var("ip_address", rf"(?<ip>{PATTERN.IPV4})", priority=10)

# Default priority for normal patterns
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")

# Low priority for generic catch-all patterns
parser.add_var("number", rf"(?<num>{PATTERN.INT})", priority=-1)

parser.compile()

`add_timestamp(name: str, regex: str) -> Parser` ¶

Add a timestamp pattern to the parser's schema.

Timestamps are special patterns that log-surgeon uses for log event boundary detection. When a timestamp pattern matches at the start of a line, it signals the beginning of a new log event.

Parameters:

Name	Type	Description	Default
`name`	`str`	Unique identifier for this timestamp pattern. Multiple timestamp patterns can be added for logs with varying timestamp formats.	required
`regex`	`str`	Regular expression pattern for matching timestamps. Should match the complete timestamp format used in your logs.	required

Returns:

Type	Description
`Parser`	Self for method chaining.

Note¶

Timestamp patterns help log-surgeon correctly handle multi-line log events (e.g., stack traces). Without timestamps, each line is treated as a separate event.

Example¶

parser = Parser()

# ISO 8601 format: 2024-01-15T10:30:00
parser.add_timestamp("iso8601", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")

# Common log format: 15/Jan/2024:10:30:00
parser.add_timestamp("clf", r"\d{2}/[a-zA-Z]{3}/\d{4}:\d{2}:\d{2}:\d{2}")

# Unix timestamp: 1705312200
parser.add_timestamp("unix", r"\d{10}")

parser.compile()

`compile(enable_debug_logs: bool = False) -> None` ¶

Compile the schema and initialize the parser for use.

This method builds a DFA (Deterministic Finite Automaton) from the configured patterns and prepares the parser for log processing. Must be called after adding all variables and timestamps, and before any parsing operations.

Parameters:

Name	Type	Description	Default
`enable_debug_logs`	`bool`	If True, output debug information to stderr during compilation and parsing. Useful for troubleshooting pattern issues. Default is False.	`False`

Raises:

Type	Description
`RuntimeError`	If schema compilation fails due to invalid patterns or conflicting configurations.

Warning¶

After calling compile(), the parser's schema is fixed. Adding new variables or timestamps will not affect the compiled parser. Create a new Parser instance if you need different patterns.

Example¶

parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.add_var("status", r"status=(?<status>[a-zA-Z0-9_]+)")

# Compile once all patterns are defined
parser.compile()

# Now ready for parsing
for event in parser.parse(log_file):
    print(event["value"])

`parse(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Generator[LogEvent, None, None]` ¶

Parse log events from an input source.

Generator that yields LogEvent objects for each parsed event. Supports multiple input types for flexibility in how log data is provided.

Parameters:

Name	Type	Description	Default
`source`	`str \| TextIO \| BinaryIO \| StringIO \| BytesIO`	Input data to parse. Accepts: `str`: String containing log data (one or more lines) `TextIO`: File opened in text mode (`open("file.log", "r")`) `BinaryIO`: File opened in binary mode (`open("file.log", "rb")`) `io.StringIO`: In-memory text stream `io.BytesIO`: In-memory binary stream	required

Yields:

Name	Type	Description
`LogEvent`	`LogEvent`	Parsed event with extracted variables accessible via
	`LogEvent`	dictionary-style access (e.g., `event["field_name"]`).

Raises:

Type	Description
`RuntimeError`	If `compile()` has not been called.
`TypeError`	If source type is not supported.

Note¶

For file objects, the entire content is read into memory before parsing. For very large files, consider reading and parsing in chunks.

Example¶

parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
parser.compile()

# Parse from multi-line string
logs = '''GET /api/users
POST /api/login
GET /api/status'''

for event in parser.parse(logs):
    print(f"{event['method']} {event['path']}")

# Parse from file
with open("access.log") as f:
    for event in parser.parse(f):
        print(event['path'])

# Parse from BytesIO (e.g., from network response)
import io
data = io.BytesIO(b"GET /health\nGET /ready")
for event in parser.parse(data):
    print(event['path'])

`parse_event(payload: str) -> LogEvent | None` ¶

Parse a single log event from a string.

Convenience method for parsing a single log message. For multiple events or streaming parsing, use parse() instead.

Parameters:

Name	Type	Description	Default
`payload`	`str`	Log message string to parse. Can be a single line or multi-line string (e.g., containing a stack trace).	required

Returns:

Type	Description
`LogEvent \| None`	LogEvent containing extracted variables and metadata, or None
`LogEvent \| None`	if no patterns matched.

Raises:

Type	Description
`RuntimeError`	If `compile()` has not been called.

Note¶

This method creates a new stream for each call, which has overhead. For batch processing, use parse() with all log data at once.

Example¶

parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.compile()

event = parser.parse_event("Processing value=42")
if event:
    print(event["value"])        # "42"
    print(event.get_log_type())  # "Processing value=<value>"

LogEvent¶

`LogEvent()` ¶

Represents a parsed log event with extracted variables and metadata.

LogEvent is the result of parsing a log message with Parser. It contains:

The original log message
A log type (template with placeholders for matched variables)
Extracted variables accessible via dictionary-style indexing

Accessing Variables¶

Variables are accessed using dictionary-style indexing with the capture group name defined in your patterns:

event["field_name"]  # Returns the captured value

For patterns that match multiple times, the value is a list. For single matches, the value is unwrapped to a scalar.

Log Types¶

Log types are template strings where matched portions are replaced with placeholder names (e.g., "user="). They are useful for:

Log clustering and deduplication
Pattern frequency analysis
Anomaly detection (new log types indicate new behavior)

Attributes¶

Note: These are internal attributes. Use the public methods to access data.

str

The original log message text.

_var_dict : dict Internal dictionary mapping capture group names to values.

Example¶

parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+) (?<code>\d+)")
parser.compile()

event = parser.parse_event("GET /api/users 200")

# Access extracted variables
print(event["method"])  # "GET"
print(event["path"])    # "/api/users"
print(event["code"])    # "200"

# Access metadata
print(event.get_log_message())  # "GET /api/users 200"
print(event.get_log_type())     # "<method> <path> <code>"

# Get all variables as a dictionary
print(event.get_resolved_dict())
# {"method": "GET", "path": "/api/users", "code": "200"}

`get_log_message() -> str` ¶

Get the original log message text.

Returns the unmodified log message as it was parsed, including any whitespace, newlines, or special characters.

Returns:

Type	Description
`str`	The complete original log message string.

Example¶

event = parser.parse_event("2024-01-01 INFO Processing complete")
print(event.get_log_message())
# "2024-01-01 INFO Processing complete"

`get_log_type() -> str` ¶

Get the log type (template) for this event.

The log type is the original message with matched variables replaced by placeholder names in angle brackets (e.g., <variable_name>). This creates a template that represents the message's structure.

Returns:

Type	Description
`str`	Template string with placeholders for extracted variables.

Raises:

Type	Description
`TypeError`	If log type is not available (internal error).

Note¶

Log types are useful for:

Clustering: Group similar log messages together
Frequency analysis: Count occurrences of each pattern
Anomaly detection: New log types may indicate unusual behavior

Example¶

parser = Parser()
parser.add_var("req", r"(?<method>GET|POST) (?<path>/\S+) (?<code>\d+)")
parser.compile()

# Same pattern, different values
e1 = parser.parse_event("GET /users 200")
e2 = parser.parse_event("POST /login 401")

print(e1.get_log_type())  # "<method> <path> <code>"
print(e2.get_log_type())  # "<method> <path> <code>"

# Both have the same log type despite different values
assert e1.get_log_type() == e2.get_log_type()

`get_capture_group(name: str, raw_output: bool = False) -> str | list[str | int | float] | None` ¶

Get the value of a capture group by name.

Retrieves the extracted value(s) for a named capture group. By default, single values are unwrapped from their list container for convenience.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the capture group to retrieve. Special names: `"@log_type"`: Returns the log type template `"@log_message"`: Returns the original log message	required
`raw_output`	`bool`	If True, always return values as a list, even for single matches. If False (default), single-element lists are unwrapped to scalar values.	`False`

Returns:

Type	Description
`str \| list[str \| int \| float] \| None`	The captured value(s):
`str \| list[str \| int \| float] \| None`	`None` if the capture group wasn't matched
`str \| list[str \| int \| float] \| None`	`str` for single values (when `raw_output=False`)
`str \| list[str \| int \| float] \| None`	`list` for multiple values or when `raw_output=True`

Note¶

Use raw_output=True when you need consistent list handling, such as when iterating over potentially multi-value captures.

Example¶

# Pattern that matches multiple times
parser.add_var("errors", r"error: (?<error>[a-zA-Z0-9_]+)")
event = parser.parse_event("error: timeout error: disconnect")

# Default: single values unwrapped, multiple values as list
event.get_capture_group("error")  # ["timeout", "disconnect"]

# With raw_output: always a list
event.get_capture_group("error", raw_output=True)  # ["timeout", "disconnect"]

# Single match example
event2 = parser.parse_event("error: timeout")
event2.get_capture_group("error")  # "timeout" (unwrapped)
event2.get_capture_group("error", raw_output=True)  # ["timeout"]

# Special names
event.get_capture_group("@log_type")    # "<error> <error>"
event.get_capture_group("@log_message") # "error: timeout error: disconnect"

`get_resolved_dict() -> dict[str, str | list[str | int | float]]` ¶

Get all extracted variables as a dictionary.

Returns a clean dictionary of all capture groups with their values. Single-element lists are unwrapped to scalar values for convenience. Internal fields like "@LogType" are excluded.

Returns:

Type	Description
`dict[str, str \| list[str \| int \| float]]`	Dictionary mapping capture group names to their extracted values.
`dict[str, str \| list[str \| int \| float]]`	Processing applied:
`dict[str, str \| list[str \| int \| float]]`	`@LogType` is excluded (use `get_log_type()` instead)
`dict[str, str \| list[str \| int \| float]]`	Timestamp variants are consolidated under `"timestamp"` key
`dict[str, str \| list[str \| int \| float]]`	Single-value lists are unwrapped to scalar values
`dict[str, str \| list[str \| int \| float]]`	Multi-value captures remain as lists

Note¶

This method is useful for:

Converting log events to JSON or other formats
Passing extracted data to downstream processing
Debugging to see all extracted values at once

Example¶

parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
parser.add_var("status", r"status=(?<code>\d+)")
parser.compile()

event = parser.parse_event("GET /api/users status=200")
result = event.get_resolved_dict()

print(result)
# {
#     "method": "GET",
#     "path": "/api/users",
#     "code": "200"
# }

# Can be easily serialized
import json
print(json.dumps(result))

`getitem(name: str) -> str | list[str | int | float]` ¶

Access a capture group value using dictionary-style indexing.

This is the primary way to access extracted variables from a LogEvent. Single values are automatically unwrapped from lists for convenience.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the capture group as defined in the pattern's `(?<name>...)` syntax.	required

Returns:

Type	Description
`str \| list[str \| int \| float]`	The captured value(s). Returns a scalar for single matches,
`str \| list[str \| int \| float]`	or a list for multiple matches of the same capture group.

Raises:

Type	Description
`KeyError`	If the capture group name does not exist or was not matched.

Example¶

parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
parser.compile()

event = parser.parse_event("GET /api/users")

# Access like a dictionary
print(event["method"])  # "GET"
print(event["path"])    # "/api/users"

# KeyError for missing fields
try:
    event["nonexistent"]
except KeyError as e:
    print(e)  # "Capture group 'nonexistent' not found"

Query¶

`Query(parser: Parser | JsonParser)` ¶

Query builder for parsing log events into structured data formats.

Query provides a fluent interface for extracting, filtering, and exporting log data to pandas DataFrames or PyArrow Tables. It works with both Parser (for text logs) and JsonParser (for JSON logs).

Workflow¶

Create a Query with a compiled Parser or JsonParser
Select fields to extract using select()
Optionally filter events using filter()
Set the input source using from_()
Export using to_dataframe() or to_arrow()

Key Features¶

Field selection: Choose specific fields or use "*" for all
Filtering: Apply lambda predicates to select events
Multiple exports: DataFrame, Arrow Table, or raw rows
Log type analysis: Get unique log types and their counts

Special Fields¶

In addition to capture group names, you can select:

"@log_type": The log type template
"@log_message": The original log message
"*": All capture groups (Parser only)

Example¶

from log_surgeon import Parser, Query, PATTERN

parser = Parser()
parser.add_var("request", rf"(?<method>GET|POST) (?<path>/\S+)")
parser.add_var("status", rf"(?<code>{PATTERN.INT})")
parser.compile()

# Basic query
df = (
    Query(parser)
    .select(["method", "path", "code"])
    .from_(log_file)
    .to_dataframe()
)

# With filtering
errors_df = (
    Query(parser)
    .select(["@log_message", "code"])
    .filter(lambda e: int(e["code"]) >= 400)
    .from_(log_file)
    .to_dataframe()
)

# Log type analysis
query = Query(parser).from_(log_file)
for log_type, count in query.get_log_type_counts().items():
    print(f"{count:5d} {log_type}")

JsonParser Example¶

json_parser = JsonParser(parser).target_fields(["message"])

df = (
    Query(json_parser)
    .select(["extracted.user_id", "extracted.action"])
    .from_(ndjson_file)
    .to_dataframe()
)

Example¶

# With Parser (text logs)
parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.compile()
query = Query(parser)

# With JsonParser (JSON logs)
json_parser = JsonParser(parser).target_fields(["message"])
query = Query(json_parser)

`select(fields: list[str]) -> Query` ¶

Select fields to include in the output.

Specifies which extracted variables and metadata to include when exporting to DataFrame or Arrow Table. Fields appear as columns in the order specified.

Parameters:

Name	Type	Description	Default
`fields`	`list[str]`	List of field names to extract. Supports: Capture groups (from your patterns): `["user_id", "path", "status_code"]` Wildcard (Parser only): `[""]` - Selects all capture groups Metadata fields: - `"@log_type"` - The log type template - `"@log_message"` - The original log message JsonParser fields*: Dot-notation for nested access: `["extracted.user_id"]`	required

Returns:

Type	Description
`Query`	Self for method chaining.

Note¶

The "*" wildcard only works with Parser, not JsonParser. For JsonParser, explicitly list the fields you want from the enriched JSON structure.

Example¶

# Select specific capture groups
query.select(["method", "path", "code"])

# Select all capture groups (Parser only)
query.select(["*"])

# Include metadata with capture groups
query.select(["@log_type", "@log_message", "method", "path"])

# Combine wildcard with metadata
query.select(["@log_type", "*"])

# JsonParser: access nested extracted fields
query.select(["extracted.user_id", "extracted.action", "message"])

`filter(predicate: Callable[[LogEvent], bool] | Callable[[dict[str, Any]], bool]) -> Query` ¶

Filter log events using a predicate function.

Applies a filter to include only events where the predicate returns True. Events where the predicate returns False are excluded from the output.

Parameters:

Name	Type	Description	Default
`predicate`	`Callable[[LogEvent], bool] \| Callable[[dict[str, Any]], bool]`	Function that receives an event and returns a boolean. For Parser: receives a `LogEvent` object For JsonParser: receives a `dict` (the enriched JSON) Return `True` to include the event, `False` to exclude it.	required

Returns:

Type	Description
`Query`	Self for method chaining.

Warning¶

Only one filter can be active. Calling filter() multiple times replaces the previous predicate. Combine conditions in a single predicate using and/or operators.

Note¶

Handle missing fields gracefully with try/except or .get() to avoid errors when a capture group does not match in some events.

Example¶

# Simple value filter
query.filter(lambda e: int(e["status_code"]) >= 400)

# Multiple conditions
query.filter(lambda e: e["method"] == "POST" and int(e["code"]) != 200)

# Safe filter for optional fields
def is_error(event):
    try:
        return int(event["code"]) >= 500
    except (KeyError, ValueError):
        return False
query.filter(is_error)

# JsonParser filter (receives dict)
query.filter(lambda obj: obj.get("level") == "ERROR")

# Access nested JsonParser fields
query.filter(lambda obj: obj.get("extracted", {}).get("user_id") == "admin")

`from_(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Query` ¶

Set the input source containing log data.

Specifies where to read log data from. Must be called before any export method (to_dataframe(), to_arrow(), etc.).

Parameters:

Name	Type	Description	Default
`source`	`str \| TextIO \| BinaryIO \| StringIO \| BytesIO`	Input data to parse. Accepts: `str`: String containing log data `TextIO`: File opened in text mode `BinaryIO`: File opened in binary mode `io.StringIO`: In-memory text stream `io.BytesIO`: In-memory binary stream	required

Returns:

Type	Description
`Query`	Self for method chaining.

Raises:

Type	Description
`TypeError`	If source type is not supported.

Note¶

For file objects, the entire content is read into memory. For very large files, consider processing in chunks.

Example¶

query = Query(parser).select(["method", "path"])

# From string
df = query.from_("GET /api\nPOST /login").to_dataframe()

# From file
with open("access.log") as f:
    df = query.from_(f).to_dataframe()

# From BytesIO (e.g., from HTTP response)
import io
data = io.BytesIO(response.content)
df = query.from_(data).to_dataframe()

`validate_query() -> Query` ¶

Validate that the query is properly configured.

Returns:

Type	Description
`Query`	Self for method chaining

Raises:

Type	Description
`AttributeError`	If fields or stream are not set

`to_dataframe() -> pd.DataFrame` ¶

Export parsed events to a pandas DataFrame.

Parses all events from the configured source, applies any filter, and returns a DataFrame with the selected fields as columns.

Returns:

Type	Description
`DataFrame`	pandas DataFrame with one row per event and one column per
`DataFrame`	selected field. Column order matches the order in `select()`.

Raises:

Type	Description
`ImportError`	If pandas is not installed. Install with: `pip install 'log-surgeon-ffi[dataframe]'` or `pip install pandas`
`AttributeError`	If `select()` or `from_()` was not called.

Example¶

df = (
    Query(parser)
    .select(["method", "path", "code"])
    .from_(log_file)
    .to_dataframe()
)

print(df.head())
#   method       path code
# 0    GET     /users  200
# 1   POST     /login  401
# 2    GET  /products  200

# Use pandas for analysis
print(df["code"].value_counts())
# 200    150
# 404     23
# 500      5

`to_arrow() -> pa.Table` ¶

Export parsed events to a PyArrow Table.

Parses all events from the configured source, applies any filter, and returns a PyArrow Table with the selected fields as columns. Arrow Tables are memory-efficient and ideal for large datasets.

Returns:

Type	Description
`Table`	PyArrow Table with one row per event and one column per
`Table`	selected field. Column order matches the order in `select()`.

Raises:

Type	Description
`ImportError`	If pyarrow is not installed. Install with: `pip install 'log-surgeon-ffi[arrow]'` or `pip install pyarrow`
`AttributeError`	If `select()` or `from_()` was not called.

Note¶

Arrow Tables use columnar storage, which is more memory-efficient than row-based formats for large datasets. They also integrate well with Parquet files and other data processing tools.

Example¶

table = (
    Query(parser)
    .select(["method", "path", "code"])
    .from_(log_file)
    .to_arrow()
)

# Write to Parquet file
import pyarrow.parquet as pq
pq.write_table(table, "logs.parquet")

# Convert to pandas if needed
df = table.to_pandas()

`get_rows() -> list[list[str]]` ¶

Extract raw rows of field values from parsed events.

Lower-level method that returns parsed data as a list of lists. Each inner list represents one event with values in the order specified by select().

Returns:

Type	Description
`list[list[str]]`	List of rows, where each row is a list of string values.
`list[list[str]]`	Row order matches event order; column order matches `select()` order.

Note¶

This method is useful when you need raw data without pandas/pyarrow dependencies. For most use cases, prefer to_dataframe() or to_arrow().

Example¶

query = Query(parser).select(["method", "code"]).from_(log_data)
rows = query.get_rows()

for row in rows:
    method, code = row
    print(f"{method} -> {code}")

`get_log_types() -> Generator[str, None, None]` ¶

Get unique log types from parsed events.

Yields each distinct log type template exactly once, in the order first encountered. Useful for discovering the different message patterns in your logs.

Yields:

Type	Description
`str`	Unique log type strings (templates with variable placeholders).

Note¶

If a filter is set, only matching events contribute log types
For JsonParser, requires include_log_type(True) to be set
Log types are yielded in first-seen order, not sorted

Example¶

query = Query(parser).from_(log_data)

# Discover all message patterns
print("Log patterns found:")
for log_type in query.get_log_types():
    print(f"  {log_type}")

# Output:
#   <method> <path> <code>
#   Connection from <ip>:<port>
#   Error: <error_message>

`get_log_type_counts() -> dict[str, int]` ¶

Count occurrences of each log type.

Counts how many times each distinct log type pattern appears in the log data. Useful for understanding log composition and identifying frequent vs. rare patterns.

Returns:

Type	Description
`dict[str, int]`	Dictionary mapping log type templates to their occurrence counts.
`dict[str, int]`	Not sorted; use `sorted()` if ordering is needed.

Note¶

If a filter is set, only matching events are counted
For JsonParser, requires include_log_type(True) to be set

Example¶

query = Query(parser).from_(log_data)
counts = query.get_log_type_counts()

# Print sorted by frequency (most common first)
for log_type, count in sorted(counts.items(), key=lambda x: -x[1]):
    print(f"{count:6d}  {log_type}")

# Output:
#  15432  <method> <path> <code>
#   2341  Connection from <ip>:<port>
#     17  Error: <error_message>

# Find rare patterns (potential anomalies)
rare = [lt for lt, c in counts.items() if c < 10]
print(f"Rare patterns: {len(rare)}")

JsonParser¶

`JsonParser(parser: Parser)` ¶

Parser for JSON-formatted logs that extracts variables from string fields.

JsonParser wraps an existing Parser and applies its extraction rules to JSON string fields, then merges the extracted variables back into the original JSON. This enables structured extraction from JSON logs while preserving the original JSON structure.

Key Features¶

Flexible field targeting: Extract from all strings or specific fields
Nested field support: Access nested JSON fields using dot-notation
Multiple formats: Parse NDJSON or JSON array formats with auto-detection
Conflict resolution: Configure how extracted keys merge with existing JSON
Streaming support: Efficiently process large NDJSON files line by line

Default Behavior¶

By default, JsonParser extracts from all string fields in the JSON object, recursively traversing nested objects and arrays. Use target_fields() to limit extraction to specific fields for better performance and precision.

Workflow¶

Create a Parser with extraction patterns and compile it
Create a JsonParser wrapping the Parser
Optionally configure target fields and conflict strategy
Parse JSON logs using parse() or parse_one()

Note¶

Extraction only works on string-type fields. Non-string fields (numbers, booleans, nested objects, arrays) are skipped unless they contain strings.

Example¶

from log_surgeon import JsonParser, Parser, ConflictStrategy

# Step 1: Create and configure underlying parser
parser = Parser()
parser.add_var("user_info", r"user=(?<user_id>\d+)")
parser.add_var("action", r"action=(?<action>[a-zA-Z0-9_]+)")
parser.compile()

# Step 2: Create JSON parser with field targeting
json_parser = (
    JsonParser(parser)
    .target_fields(["message", "context.detail"])  # Only parse these fields
    .on_conflict(ConflictStrategy.NEST, key="extracted")
)

# Step 3: Parse JSON logs
input_json = '{"ts": "2024-01-01", "message": "user=123 action=login"}'
result = json_parser.parse_one(input_json)
print(result)
# {
#     'ts': '2024-01-01',
#     'message': 'user=123 action=login',
#     'extracted': {'user_id': '123', 'action': 'login'}
# }

Note¶

Default configuration:

Extracts from all string fields (equivalent to target_fields("*"))
Uses NEST conflict strategy with key "extracted"
Does not include log type in output

Use the fluent API methods to customize behavior:

target_fields(): Limit which JSON fields are parsed
on_conflict(): Configure conflict resolution strategy
include_log_type(): Include log type templates in output

Example¶

# Create and configure the underlying parser
parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.compile()

# Create JSON parser with default settings
json_parser = JsonParser(parser)

# Or customize with fluent API
json_parser = (
    JsonParser(parser)
    .target_fields(["message"])
    .on_conflict(ConflictStrategy.NEST, key="data")
    .include_log_type(True)
)

`target_fields(fields: list[str] | str) -> JsonParser` ¶

Configure which JSON fields to target for variable extraction.

By default, JsonParser extracts from all string fields in the JSON object. Use this method to limit extraction to specific fields for better performance and to avoid unintended matches in other fields.

Parameters:

Name	Type	Description	Default
`fields`	`list[str] \| str`	Field specification. Accepts: `str`: Single field name (e.g., `"message"`) `list[str]`: Multiple field names (e.g., `["message", "context.detail"]`) `""` or `[""]`: Explicitly target all string fields Dot-notation is supported for nested fields (e.g., `"context.message"` accesses `{"context": {"message": "..."}}`).	required

Returns:

Type	Description
`JsonParser`	Self for method chaining.

Raises:

Type	Description
`ValueError`	If `fields` is an empty list.

Note¶

Targeting specific fields is recommended for production use:

Performance: Avoids parsing irrelevant fields
Precision: Prevents false matches in metadata fields
Clarity: Makes extraction intent explicit

Example¶

json_parser = JsonParser(parser)

# Target single field
json_parser.target_fields("message")

# Target multiple fields (including nested)
json_parser.target_fields(["message", "error.details", "context.info"])

# Reset to all string fields
json_parser.target_fields("*")

Nested Field Example¶

json_data = '{"context": {"message": "user=123"}, "other": "ignored"}'

json_parser = JsonParser(parser).target_fields(["context.message"])
result = json_parser.parse_one(json_data)
# Only "context.message" is parsed; "other" is ignored

`on_conflict(strategy: ConflictStrategy, prefix: str = 'extracted.', key: str = 'extracted') -> JsonParser` ¶

Configure how to handle key conflicts between extracted and existing JSON keys.

When an extracted variable name matches an existing key in the JSON object, this setting determines how the conflict is resolved.

Parameters:

Name	Type	Description	Default
`strategy`	`ConflictStrategy`	The conflict resolution strategy to use. See `ConflictStrategy` for available options.	required
`prefix`	`str`	Prefix for extracted keys when using `ConflictStrategy.PREFIX`. Default: `"extracted."`. Only used with PREFIX strategy.	`'extracted.'`
`key`	`str`	Nesting key when using `ConflictStrategy.NEST`. Default: `"extracted"`. Only used with NEST strategy.	`'extracted'`

Returns:

Type	Description
`JsonParser`	Self for method chaining.

Example¶

from log_surgeon import JsonParser, ConflictStrategy

# NEST (default): All extracted values under a nested key
json_parser.on_conflict(ConflictStrategy.NEST, key="parsed")
# Input:  {"message": "user=123"}
# Output: {"message": "user=123", "parsed": {"user_id": "123"}}

# PREFIX: Add prefix to each extracted key
json_parser.on_conflict(ConflictStrategy.PREFIX, prefix="log_")
# Input:  {"message": "user=123"}
# Output: {"message": "user=123", "log_user_id": "123"}

# OVERWRITE: Replace existing keys (use with caution)
json_parser.on_conflict(ConflictStrategy.OVERWRITE)
# Input:  {"user_id": "old", "message": "user=123"}
# Output: {"user_id": "123", "message": "user=123"}  # Warning printed

# RAISE: Fail on conflict (for development/testing)
json_parser.on_conflict(ConflictStrategy.RAISE)
# Raises KeyError if extracted key exists in JSON

`include_log_type(include: bool = True) -> JsonParser` ¶

Configure whether to include log type templates in the output.

Log types are template strings where matched variables are replaced with placeholders (e.g., "user=123" becomes "user="). They are useful for log clustering, pattern analysis, and anomaly detection.

Parameters:

Name	Type	Description	Default
`include`	`bool`	If True, include "@log_type" in extracted variables. Default: True when this method is called.	`True`

Returns:

Type	Description
`JsonParser`	Self for method chaining.

Note¶

When enabled, each parsed field contributes its log type. If multiple fields are parsed, log types are aggregated.

Example¶

json_parser = JsonParser(parser).include_log_type(True)

result = json_parser.parse_one('{"message": "user=123 action=login"}')
print(result["extracted"]["@log_type"])
# "user=<user_id> action=<action>"

`parse(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Generator[dict[str, Any], None, None]` ¶

Parse JSON logs from an input source.

Generator that yields enriched JSON dictionaries with extracted variables merged in. Supports both NDJSON (newline-delimited JSON) and JSON array formats, with automatic format detection.

Parameters:

Name	Type	Description	Default
`source`	`str \| TextIO \| BinaryIO \| StringIO \| BytesIO`	Input data to parse. Accepts: `str`: String containing JSON data (NDJSON or JSON array) `TextIO`: File opened in text mode `BinaryIO`: File opened in binary mode `io.StringIO`: In-memory text stream `io.BytesIO`: In-memory binary stream	required

Yields:

Name	Type	Description
`dict`	`dict[str, Any]`	Enriched JSON object with extracted variables merged according
	`dict[str, Any]`	to the configured conflict strategy.

Raises:

Type	Description
`JSONDecodeError`	If the input contains invalid JSON.
`TypeError`	If the source type is not supported.

Format Detection¶

The format is auto-detected by checking the first non-whitespace character:

Starts with [: Parsed as JSON array (entire content loaded)
Otherwise: Parsed as NDJSON (streamed line by line for file objects)

Note¶

For NDJSON files, parsing is streamed line-by-line to minimize memory usage. JSON arrays must be fully loaded into memory for parsing.

Example¶

# Parse NDJSON from string
ndjson = '''{"message": "user=123"}
{"message": "user=456"}'''

for result in json_parser.parse(ndjson):
    print(result["extracted"]["user_id"])
# 123
# 456

# Parse JSON array
json_array = '[{"message": "user=123"}, {"message": "user=456"}]'
for result in json_parser.parse(json_array):
    print(result["extracted"]["user_id"])

# Stream from file
with open("logs.ndjson") as f:
    for result in json_parser.parse(f):
        print(result)

`parse_one(json_line: str) -> dict[str, Any]` ¶

Parse a single JSON object and return the enriched result.

Convenience method for parsing a single JSON log entry. For multiple entries, use parse() instead.

Parameters:

Name	Type	Description	Default
`json_line`	`str`	A single JSON object as a string. Must be a valid JSON object (starts with `{`), not an array or primitive.	required

Returns:

Type	Description
`dict[str, Any]`	Enriched JSON dictionary with the original fields plus extracted
`dict[str, Any]`	variables merged according to the configured conflict strategy.

Raises:

Type	Description
`JSONDecodeError`	If the input is not valid JSON.

Example¶

result = json_parser.parse_one('{"ts": "2024-01-01", "message": "user=123"}')

# Access original fields
print(result["ts"])  # "2024-01-01"
print(result["message"])  # "user=123"

# Access extracted fields (with default NEST strategy)
print(result["extracted"]["user_id"])  # "123"

# Full result structure
# {
#     "ts": "2024-01-01",
#     "message": "user=123",
#     "extracted": {"user_id": "123"}
# }

ConflictStrategy¶

`ConflictStrategy` ¶

Bases: Enum

Strategy for handling conflicts when extracted keys match existing JSON keys.

When JsonParser extracts variables from JSON fields, the extracted key names might conflict with keys already present in the JSON object. This enum defines how such conflicts are resolved.

Attributes¶

NEST : enum member Place all extracted variables under a nested key (default: "extracted"). This is the safest option as it completely avoids conflicts. Result: {"message": "...", "extracted": {"user_id": "123"}}

enum member

Add a prefix to extracted variable names (default: "extracted."). Warns to stderr if the prefixed key still conflicts. Result: {"message": "...", "extracted.user_id": "123"}

enum member

Replace existing keys with extracted values. Prints a warning to stderr when overwriting occurs. Use with caution as original data is lost. Result: {"user_id": "123"} (original user_id overwritten)

enum member

Raise KeyError when a conflict is detected. Useful for development and testing to catch unexpected conflicts early.

Example¶

from log_surgeon import JsonParser, ConflictStrategy

# Default: nest under "extracted" key
json_parser = JsonParser(parser)

# Custom nest key
json_parser.on_conflict(ConflictStrategy.NEST, key="parsed")
# Result: {"message": "...", "parsed": {"user_id": "123"}}

# Use prefix instead
json_parser.on_conflict(ConflictStrategy.PREFIX, prefix="log_")
# Result: {"message": "...", "log_user_id": "123"}

# Fail on conflict (for testing)
json_parser.on_conflict(ConflictStrategy.RAISE)

SchemaCompiler¶

`SchemaCompiler(delimiters: str = DEFAULT_DELIMITERS)` ¶

Compiler for constructing log-surgeon schema definitions.

SchemaCompiler provides a fluent interface for building schema definitions used by the log-surgeon parsing engine. It handles variable registration, pattern validation, and schema serialization.

Key Responsibilities¶

Register variable patterns with named capture groups
Track capture group names for validation
Manage variable priority for pattern ordering
Generate hidden variable names for internal use
Compile the final schema string

Schema Format¶

The compiled schema is a text format with sections for delimiters, timestamps, and variables:

delimiters: \t\r\n:,!;%@/()[]
timestamp:<pattern>
VariableName:<pattern>

Priority System¶

Variables are ordered in the schema by: 1. Priority (descending): higher priority = appears first 2. Insertion order (ascending): earlier added = appears first

This ordering affects which pattern is tried first when multiple patterns could match the same text.

Note¶

Most users should use the Parser class, which provides a simpler interface and handles schema compilation automatically. Use SchemaCompiler directly only for advanced use cases.

Example¶

from log_surgeon.schema_compiler import SchemaCompiler

compiler = SchemaCompiler()

# Add patterns with priority
compiler.add_var("ip", r"(?<ip>[0-9.]+)", priority=10)
compiler.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
compiler.add_var("int", r"(?<num>\d+)", priority=-1)  # Low priority

# Add timestamp for multi-line event detection
compiler.add_timestamp("iso", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")

# Compile to schema string
schema = compiler.compile()

`add_var(name: str, regex: str, priority: int = 0) -> SchemaCompiler` ¶

Add a variable pattern to the schema.

Patterns must include at least one named capture group using (?<name>...) syntax. The capture group names become the keys for accessing extracted values from parsed events.

Parameters:

Name	Type	Description	Default
`name`	`str`	Unique identifier for this variable pattern.	required
`regex`	`str`	Regular expression with named capture groups. Use `(?<name>pattern)` syntax for extraction.	required
`priority`	`int`	Pattern ordering priority. Higher values are tried first during matching. Default is 0. Use positive values for specific patterns (IP, UUID) Use negative values for generic catch-alls (INT, FLOAT)	`0`

Returns:

Type	Description
`SchemaCompiler`	Self for method chaining.

Raises:

Type	Description
`ValueError`	If pattern has no capture groups, or if names contain delimiter characters.
`AttributeError`	If a variable with this name already exists.

Example¶

compiler = SchemaCompiler()
compiler.add_var("ip", r"(?<ip>[0-9.]+)", priority=10)
compiler.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
compiler.add_var("int", r"(?<num>\d+)", priority=-1)

`add_timestamp(name: str, regex: str) -> SchemaCompiler` ¶

Add a timestamp pattern to the schema.

Timestamps help log-surgeon detect log event boundaries. When a timestamp pattern matches at the start of a line, it signals a new log event, enabling correct handling of multi-line events like stack traces.

Parameters:

Name	Type	Description	Default
`name`	`str`	Unique identifier for this timestamp pattern.	required
`regex`	`str`	Regular expression for matching timestamp formats.	required

Returns:

Type	Description
`SchemaCompiler`	Self for method chaining.

Example¶

compiler = SchemaCompiler()
compiler.add_timestamp("iso", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")
compiler.add_timestamp("unix", r"\d{10}")

`remove_var(var_name: str) -> SchemaCompiler` ¶

Remove a variable from the schema.

Parameters:

Name	Type	Description	Default
`var_name`	`str`	Name of the variable to remove (or its original name if hidden)	required

Returns:

Type	Description
`SchemaCompiler`	Self for method chaining

`get_var(var_name: str) -> Variable` ¶

Get a variable by name.

Parameters:

Name	Type	Description	Default
`var_name`	`str`	Variable name	required

Returns:

Type	Description
`Variable`	The Variable object

`compile() -> str` ¶

Compile the schema to a string for the log-surgeon engine.

Generates the final schema definition that includes delimiters, timestamps, and variables ordered by priority. This string is passed to the log-surgeon C++ library for DFA compilation.

Returns:

Type	Description
`str`	Schema definition string in log-surgeon format.

Note¶

Variables are ordered by: 1. Priority (descending): higher priority patterns first 2. Insertion order (ascending): earlier added patterns first

This ordering determines which pattern is tried first when multiple patterns could match the same text.

Example¶

compiler = SchemaCompiler()
compiler.add_var("ip", r"(?<ip>[0-9.]+)", priority=10)
compiler.add_var("number", r"(?<num>\d+)", priority=-1)
compiler.add_timestamp("ts", r"\d{4}-\d{2}-\d{2}")

schema = compiler.compile()
# Returns formatted schema string with sections for
# delimiters, timestamps, and variables

PATTERN¶

The PATTERN class provides pre-built regex patterns optimized for log parsing.

`PATTERN` ¶

Collection of pre-built regex patterns for common log elements.

PATTERN provides ready-to-use regex patterns optimized for log-surgeon's delimiter-based matching. Use these patterns with add_var() to extract common log elements without writing complex regex manually.

Categories¶

Network Patterns UUID, IPV4, PORT

Numeric Patterns INT, FLOAT

File System Patterns LINUX_FILE_NAME, LINUX_FILE_PATH

Character Sets JAVA_IDENTIFIER, LOG_LINE, LOG_LINE_NO_WHITE_SPACE

Java Patterns JAVA_CLASS_NAME, JAVA_FULLY_QUALIFIED_CLASS_NAME, JAVA_STACK_LOCATION

Usage¶

Embed patterns in your regex using f-strings:

parser.add_var("ip", rf"(?<ip>{PATTERN.IPV4})")
parser.add_var("value", rf"val=(?<v>{PATTERN.INT})")

Note¶

These patterns use log-surgeon regex syntax, which differs slightly from Python regex. Notably, . matches any character except delimiters.

Example¶

from log_surgeon import Parser, PATTERN

parser = Parser()

# Network patterns
parser.add_var("connection", rf"(?<ip>{PATTERN.IPV4}):(?<port>{PATTERN.PORT})")
parser.add_var("request_id", rf"id=(?<uuid>{PATTERN.UUID})")

# Numeric patterns
parser.add_var("metric", rf"(?<value>{PATTERN.FLOAT})")
parser.add_var("count", rf"n=(?<n>{PATTERN.INT})")

# File patterns
parser.add_var("file", rf"(?<path>{PATTERN.LINUX_FILE_PATH})")

# Java patterns
parser.add_var("class", rf"(?<class>{PATTERN.JAVA_FULLY_QUALIFIED_CLASS_NAME})")

parser.compile()

Name	Type	Description	Default
`delimiters`	`str`	String of delimiter characters for tokenization. Default: `" \t\r\n:,!;%@/()[]"` (space, tab, newline, and common punctuation). Common customizations: Remove `:` to match timestamps like "10:30:00" as single tokens Remove `/` to match file paths as single tokens Add `=` to treat key=value pairs as separate tokens	`' \\t\\r\\n:,!;%@/()[]'`
`backend`	`str \| None`	Backend engine to use for parsing. Either "cpp" (uses the C++ log-surgeon library) or "rust" (uses the Rust log-mechanic library via cffi). If not specified, reads from the `LOG_SURGEON_BACKEND` environment variable, defaulting to "cpp".	`None`

API Reference¶

Parser¶

Parser(delimiters: str = ' \\t\\r\\n:,!;%@/()[]', backend: str | None = None) ¶

Key Features¶

Delimiter-Based Matching¶

Workflow¶

Example¶

See Also¶

Note¶

Example¶

add_var(name: str, regex: str, priority: int = 0) -> Parser ¶

Note¶

Example¶

add_timestamp(name: str, regex: str) -> Parser ¶

Note¶

Example¶

compile(enable_debug_logs: bool = False) -> None ¶

Warning¶

Example¶

parse(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Generator[LogEvent, None, None] ¶

Note¶

Example¶

parse_event(payload: str) -> LogEvent | None ¶

Note¶

Example¶

LogEvent¶

LogEvent() ¶

Accessing Variables¶

Log Types¶

Attributes¶

Example¶

See Also¶

get_log_message() -> str ¶

Example¶

get_log_type() -> str ¶

Note¶

Example¶

get_capture_group(name: str, raw_output: bool = False) -> str | list[str | int | float] | None ¶

Note¶

Example¶

get_resolved_dict() -> dict[str, str | list[str | int | float]] ¶

Note¶

Example¶

__getitem__(name: str) -> str | list[str | int | float] ¶

Example¶

Query¶

Query(parser: Parser | JsonParser) ¶

Workflow¶

Key Features¶

Special Fields¶

Example¶

JsonParser Example¶

See Also¶

Example¶

select(fields: list[str]) -> Query ¶

Note¶

Example¶

filter(predicate: Callable[[LogEvent], bool] | Callable[[dict[str, Any]], bool]) -> Query ¶

Warning¶

Note¶

Example¶

from_(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Query ¶

Note¶

Example¶

validate_query() -> Query ¶

to_dataframe() -> pd.DataFrame ¶

Example¶

to_arrow() -> pa.Table ¶

Note¶

Example¶

get_rows() -> list[list[str]] ¶

Note¶

Example¶

get_log_types() -> Generator[str, None, None] ¶

Note¶

Example¶

get_log_type_counts() -> dict[str, int] ¶

Note¶

Example¶

JsonParser¶

`Parser(delimiters: str = ' \\t\\r\\n:,!;%@/()[]', backend: str | None = None)` ¶

`add_var(name: str, regex: str, priority: int = 0) -> Parser` ¶

`add_timestamp(name: str, regex: str) -> Parser` ¶

`compile(enable_debug_logs: bool = False) -> None` ¶

`parse(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Generator[LogEvent, None, None]` ¶

`parse_event(payload: str) -> LogEvent | None` ¶

`LogEvent()` ¶

`get_log_message() -> str` ¶

`get_log_type() -> str` ¶

`get_capture_group(name: str, raw_output: bool = False) -> str | list[str | int | float] | None` ¶

`get_resolved_dict() -> dict[str, str | list[str | int | float]]` ¶

`getitem(name: str) -> str | list[str | int | float]` ¶

`Query(parser: Parser | JsonParser)` ¶

`select(fields: list[str]) -> Query` ¶

`filter(predicate: Callable[[LogEvent], bool] | Callable[[dict[str, Any]], bool]) -> Query` ¶

`from_(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Query` ¶

`validate_query() -> Query` ¶

`to_dataframe() -> pd.DataFrame` ¶

`to_arrow() -> pa.Table` ¶

`get_rows() -> list[list[str]]` ¶

`get_log_types() -> Generator[str, None, None]` ¶

`get_log_type_counts() -> dict[str, int]` ¶

`JsonParser(parser: Parser)` ¶

`target_fields(fields: list[str] | str) -> JsonParser` ¶

`on_conflict(strategy: ConflictStrategy, prefix: str = 'extracted.', key: str = 'extracted') -> JsonParser` ¶

`include_log_type(include: bool = True) -> JsonParser` ¶

`parse(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Generator[dict[str, Any], None, None]` ¶

`parse_one(json_line: str) -> dict[str, Any]` ¶

`ConflictStrategy` ¶

`SchemaCompiler(delimiters: str = DEFAULT_DELIMITERS)` ¶

`add_var(name: str, regex: str, priority: int = 0) -> SchemaCompiler` ¶

`add_timestamp(name: str, regex: str) -> SchemaCompiler` ¶

`remove_var(var_name: str) -> SchemaCompiler` ¶

`get_var(var_name: str) -> Variable` ¶

`compile() -> str` ¶

`PATTERN` ¶