Skip to content

API Reference

For key concepts like delimiter-based matching, capture groups, and pattern syntax, see Key Concepts.

Parser

Parser(delimiters: str = ' \\t\\r\\n:,!;%@/()[]', backend: str | None = None)

High-level parser for extracting structured data from unstructured log messages.

The Parser uses a schema-based approach to identify patterns, extract variables, and generate log types from raw log text. It compiles patterns into a DFA (Deterministic Finite Automaton) for efficient single-pass matching.

Key Features
  • Named capture groups: Use (?<name>pattern) to extract specific values
  • Priority-based matching: Control which patterns are tried first
  • Log type generation: Automatically creates templates from matched patterns
  • Streaming parsing: Efficiently process large log files
  • Multiple input sources: Parse from strings, files, or streams
Delimiter-Based Matching

Unlike standard regex, log-surgeon uses delimiter-based matching where . matches any character except delimiters (spaces, tabs, colons, etc.). This is important for pattern design:

# With default delimiters, "." stops at spaces
parser.add_var("token", r"(?<match>d.*)")
event = parser.parse_event("abc def ghi")
print(event['match'])  # "def" (NOT "def ghi")

# To match across spaces, use explicit character classes
parser.add_var("multi", r"(?<match>d[a-z ]*i)")  # Includes space
Workflow
  1. Create a Parser instance with optional custom delimiters
  2. Add variable patterns using add_var()
  3. Call compile() to build the DFA
  4. Parse logs using parse() or parse_event()
Example
from log_surgeon import Parser, PATTERN

parser = Parser()
parser.add_var("request", rf"(?<method>GET|POST) (?<path>/[^ ]+)")
parser.add_var("status", rf"status=(?<code>{PATTERN.INT})")
parser.compile()

event = parser.parse_event("GET /api/users status=200")
print(event["method"])  # "GET"
print(event["path"])    # "/api/users"
print(event["code"])    # "200"
print(event.get_log_type())  # "<method> <path> status=<code>"
See Also

JsonParser : For parsing JSON-formatted logs. Query : For exporting parsed events to DataFrames. PATTERN : Pre-built regex patterns for common log elements.

Initialize the parser with optional custom delimiters.

Delimiters define token boundaries for log-surgeon's matching engine. The . metacharacter in patterns will NOT match delimiter characters, which affects how patterns are written.

Parameters:

Name Type Description Default
delimiters str

String of delimiter characters for tokenization. Default: " \t\r\n:,!;%@/()[]" (space, tab, newline, and common punctuation).

Common customizations:

  • Remove : to match timestamps like "10:30:00" as single tokens
  • Remove / to match file paths as single tokens
  • Add = to treat key=value pairs as separate tokens
' \\t\\r\\n:,!;%@/()[]'
backend str | None

Backend engine to use for parsing. Either "cpp" (uses the C++ log-surgeon library) or "rust" (uses the Rust log-mechanic library via cffi). If not specified, reads from the LOG_SURGEON_BACKEND environment variable, defaulting to "cpp".

None
Note

Delimiter choice significantly affects pattern matching. For example, with default delimiters, a pattern like (?<ip>.*) will stop at the first space. To match an IP address, use explicit character classes: (?<ip>[0-9.]+) or the pre-built PATTERN.IPV4.

Example
# Default delimiters
parser = Parser()

# Custom delimiters for file path matching
parser = Parser(delimiters=r" \t\r\n:,!;%@()[]")  # Removed "/"

# Minimal delimiters for maximum token length
parser = Parser(delimiters=r" \t\r\n")

# Use Rust backend
parser = Parser(backend="rust")

add_var(name: str, regex: str, priority: int = 0) -> Parser

Add a variable pattern to the parser's schema.

Patterns must include at least one named capture group using (?<name>...) syntax. The captured values are accessible on parsed LogEvent objects using dictionary-style access: event["name"].

Parameters:

Name Type Description Default
name str

Unique identifier for this variable pattern. Used for schema organization but not directly accessible on LogEvent (use capture group names instead).

required
regex str

Regular expression pattern with named capture groups. Use (?<name>pattern) syntax to define extractable fields. The . metacharacter matches any character except delimiters.

required
priority int

Controls pattern matching order (higher = tried first). Default is 0.

  • Use positive values for specific patterns (e.g., IP addresses)
  • Use negative values for generic patterns (e.g., catch-all integers)
  • Variables with equal priority maintain insertion order
0

Returns:

Type Description
Parser

Self for method chaining.

Raises:

Type Description
ValueError

If the pattern has no named capture groups, or if capture group names contain delimiter characters.

AttributeError

If a variable with the same name already exists.

Note

Priority Example: If you have patterns for IP addresses and integers, give IP higher priority so "192.168.1.1" matches as an IP, not four integers.

Example
from log_surgeon import Parser, PATTERN

parser = Parser()

# High priority for specific patterns
parser.add_var("ip_address", rf"(?<ip>{PATTERN.IPV4})", priority=10)

# Default priority for normal patterns
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")

# Low priority for generic catch-all patterns
parser.add_var("number", rf"(?<num>{PATTERN.INT})", priority=-1)

parser.compile()

add_timestamp(name: str, regex: str) -> Parser

Add a timestamp pattern to the parser's schema.

Timestamps are special patterns that log-surgeon uses for log event boundary detection. When a timestamp pattern matches at the start of a line, it signals the beginning of a new log event.

Parameters:

Name Type Description Default
name str

Unique identifier for this timestamp pattern. Multiple timestamp patterns can be added for logs with varying timestamp formats.

required
regex str

Regular expression pattern for matching timestamps. Should match the complete timestamp format used in your logs.

required

Returns:

Type Description
Parser

Self for method chaining.

Note

Timestamp patterns help log-surgeon correctly handle multi-line log events (e.g., stack traces). Without timestamps, each line is treated as a separate event.

Example
parser = Parser()

# ISO 8601 format: 2024-01-15T10:30:00
parser.add_timestamp("iso8601", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")

# Common log format: 15/Jan/2024:10:30:00
parser.add_timestamp("clf", r"\d{2}/[a-zA-Z]{3}/\d{4}:\d{2}:\d{2}:\d{2}")

# Unix timestamp: 1705312200
parser.add_timestamp("unix", r"\d{10}")

parser.compile()

compile(enable_debug_logs: bool = False) -> None

Compile the schema and initialize the parser for use.

This method builds a DFA (Deterministic Finite Automaton) from the configured patterns and prepares the parser for log processing. Must be called after adding all variables and timestamps, and before any parsing operations.

Parameters:

Name Type Description Default
enable_debug_logs bool

If True, output debug information to stderr during compilation and parsing. Useful for troubleshooting pattern issues. Default is False.

False

Raises:

Type Description
RuntimeError

If schema compilation fails due to invalid patterns or conflicting configurations.

Warning

After calling compile(), the parser's schema is fixed. Adding new variables or timestamps will not affect the compiled parser. Create a new Parser instance if you need different patterns.

Example
parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.add_var("status", r"status=(?<status>[a-zA-Z0-9_]+)")

# Compile once all patterns are defined
parser.compile()

# Now ready for parsing
for event in parser.parse(log_file):
    print(event["value"])

parse(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Generator[LogEvent, None, None]

Parse log events from an input source.

Generator that yields LogEvent objects for each parsed event. Supports multiple input types for flexibility in how log data is provided.

Parameters:

Name Type Description Default
source str | TextIO | BinaryIO | StringIO | BytesIO

Input data to parse. Accepts:

  • str: String containing log data (one or more lines)
  • TextIO: File opened in text mode (open("file.log", "r"))
  • BinaryIO: File opened in binary mode (open("file.log", "rb"))
  • io.StringIO: In-memory text stream
  • io.BytesIO: In-memory binary stream
required

Yields:

Name Type Description
LogEvent LogEvent

Parsed event with extracted variables accessible via

LogEvent

dictionary-style access (e.g., event["field_name"]).

Raises:

Type Description
RuntimeError

If compile() has not been called.

TypeError

If source type is not supported.

Note

For file objects, the entire content is read into memory before parsing. For very large files, consider reading and parsing in chunks.

Example
parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
parser.compile()

# Parse from multi-line string
logs = '''GET /api/users
POST /api/login
GET /api/status'''

for event in parser.parse(logs):
    print(f"{event['method']} {event['path']}")

# Parse from file
with open("access.log") as f:
    for event in parser.parse(f):
        print(event['path'])

# Parse from BytesIO (e.g., from network response)
import io
data = io.BytesIO(b"GET /health\nGET /ready")
for event in parser.parse(data):
    print(event['path'])

parse_event(payload: str) -> LogEvent | None

Parse a single log event from a string.

Convenience method for parsing a single log message. For multiple events or streaming parsing, use parse() instead.

Parameters:

Name Type Description Default
payload str

Log message string to parse. Can be a single line or multi-line string (e.g., containing a stack trace).

required

Returns:

Type Description
LogEvent | None

LogEvent containing extracted variables and metadata, or None

LogEvent | None

if no patterns matched.

Raises:

Type Description
RuntimeError

If compile() has not been called.

Note

This method creates a new stream for each call, which has overhead. For batch processing, use parse() with all log data at once.

Example
parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.compile()

event = parser.parse_event("Processing value=42")
if event:
    print(event["value"])        # "42"
    print(event.get_log_type())  # "Processing value=<value>"

LogEvent

LogEvent()

Represents a parsed log event with extracted variables and metadata.

LogEvent is the result of parsing a log message with Parser. It contains:

  • The original log message
  • A log type (template with placeholders for matched variables)
  • Extracted variables accessible via dictionary-style indexing
Accessing Variables

Variables are accessed using dictionary-style indexing with the capture group name defined in your patterns:

event["field_name"]  # Returns the captured value

For patterns that match multiple times, the value is a list. For single matches, the value is unwrapped to a scalar.

Log Types

Log types are template strings where matched portions are replaced with placeholder names (e.g., "user="). They are useful for:

  • Log clustering and deduplication
  • Pattern frequency analysis
  • Anomaly detection (new log types indicate new behavior)
Attributes

Note: These are internal attributes. Use the public methods to access data.

str

The original log message text.

_var_dict : dict Internal dictionary mapping capture group names to values.

Example
parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+) (?<code>\d+)")
parser.compile()

event = parser.parse_event("GET /api/users 200")

# Access extracted variables
print(event["method"])  # "GET"
print(event["path"])    # "/api/users"
print(event["code"])    # "200"

# Access metadata
print(event.get_log_message())  # "GET /api/users 200"
print(event.get_log_type())     # "<method> <path> <code>"

# Get all variables as a dictionary
print(event.get_resolved_dict())
# {"method": "GET", "path": "/api/users", "code": "200"}
See Also

Parser.parse : Parse multiple events from a source. Parser.parse_event : Parse a single event from a string.

Initialize an empty LogEvent.

get_log_message() -> str

Get the original log message text.

Returns the unmodified log message as it was parsed, including any whitespace, newlines, or special characters.

Returns:

Type Description
str

The complete original log message string.

Example
event = parser.parse_event("2024-01-01 INFO Processing complete")
print(event.get_log_message())
# "2024-01-01 INFO Processing complete"

get_log_type() -> str

Get the log type (template) for this event.

The log type is the original message with matched variables replaced by placeholder names in angle brackets (e.g., <variable_name>). This creates a template that represents the message's structure.

Returns:

Type Description
str

Template string with placeholders for extracted variables.

Raises:

Type Description
TypeError

If log type is not available (internal error).

Note

Log types are useful for:

  • Clustering: Group similar log messages together
  • Frequency analysis: Count occurrences of each pattern
  • Anomaly detection: New log types may indicate unusual behavior
Example
parser = Parser()
parser.add_var("req", r"(?<method>GET|POST) (?<path>/\S+) (?<code>\d+)")
parser.compile()

# Same pattern, different values
e1 = parser.parse_event("GET /users 200")
e2 = parser.parse_event("POST /login 401")

print(e1.get_log_type())  # "<method> <path> <code>"
print(e2.get_log_type())  # "<method> <path> <code>"

# Both have the same log type despite different values
assert e1.get_log_type() == e2.get_log_type()

get_capture_group(name: str, raw_output: bool = False) -> str | list[str | int | float] | None

Get the value of a capture group by name.

Retrieves the extracted value(s) for a named capture group. By default, single values are unwrapped from their list container for convenience.

Parameters:

Name Type Description Default
name str

Name of the capture group to retrieve. Special names:

  • "@log_type": Returns the log type template
  • "@log_message": Returns the original log message
required
raw_output bool

If True, always return values as a list, even for single matches. If False (default), single-element lists are unwrapped to scalar values.

False

Returns:

Type Description
str | list[str | int | float] | None

The captured value(s):

str | list[str | int | float] | None
  • None if the capture group wasn't matched
str | list[str | int | float] | None
  • str for single values (when raw_output=False)
str | list[str | int | float] | None
  • list for multiple values or when raw_output=True
Note

Use raw_output=True when you need consistent list handling, such as when iterating over potentially multi-value captures.

Example
# Pattern that matches multiple times
parser.add_var("errors", r"error: (?<error>[a-zA-Z0-9_]+)")
event = parser.parse_event("error: timeout error: disconnect")

# Default: single values unwrapped, multiple values as list
event.get_capture_group("error")  # ["timeout", "disconnect"]

# With raw_output: always a list
event.get_capture_group("error", raw_output=True)  # ["timeout", "disconnect"]

# Single match example
event2 = parser.parse_event("error: timeout")
event2.get_capture_group("error")  # "timeout" (unwrapped)
event2.get_capture_group("error", raw_output=True)  # ["timeout"]

# Special names
event.get_capture_group("@log_type")    # "<error> <error>"
event.get_capture_group("@log_message") # "error: timeout error: disconnect"

get_resolved_dict() -> dict[str, str | list[str | int | float]]

Get all extracted variables as a dictionary.

Returns a clean dictionary of all capture groups with their values. Single-element lists are unwrapped to scalar values for convenience. Internal fields like "@LogType" are excluded.

Returns:

Type Description
dict[str, str | list[str | int | float]]

Dictionary mapping capture group names to their extracted values.

dict[str, str | list[str | int | float]]

Processing applied:

dict[str, str | list[str | int | float]]
  • @LogType is excluded (use get_log_type() instead)
dict[str, str | list[str | int | float]]
  • Timestamp variants are consolidated under "timestamp" key
dict[str, str | list[str | int | float]]
  • Single-value lists are unwrapped to scalar values
dict[str, str | list[str | int | float]]
  • Multi-value captures remain as lists
Note

This method is useful for:

  • Converting log events to JSON or other formats
  • Passing extracted data to downstream processing
  • Debugging to see all extracted values at once
Example
parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
parser.add_var("status", r"status=(?<code>\d+)")
parser.compile()

event = parser.parse_event("GET /api/users status=200")
result = event.get_resolved_dict()

print(result)
# {
#     "method": "GET",
#     "path": "/api/users",
#     "code": "200"
# }

# Can be easily serialized
import json
print(json.dumps(result))

__getitem__(name: str) -> str | list[str | int | float]

Access a capture group value using dictionary-style indexing.

This is the primary way to access extracted variables from a LogEvent. Single values are automatically unwrapped from lists for convenience.

Parameters:

Name Type Description Default
name str

Name of the capture group as defined in the pattern's (?<name>...) syntax.

required

Returns:

Type Description
str | list[str | int | float]

The captured value(s). Returns a scalar for single matches,

str | list[str | int | float]

or a list for multiple matches of the same capture group.

Raises:

Type Description
KeyError

If the capture group name does not exist or was not matched.

Example
parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
parser.compile()

event = parser.parse_event("GET /api/users")

# Access like a dictionary
print(event["method"])  # "GET"
print(event["path"])    # "/api/users"

# KeyError for missing fields
try:
    event["nonexistent"]
except KeyError as e:
    print(e)  # "Capture group 'nonexistent' not found"

Query

Query(parser: Parser | JsonParser)

Query builder for parsing log events into structured data formats.

Query provides a fluent interface for extracting, filtering, and exporting log data to pandas DataFrames or PyArrow Tables. It works with both Parser (for text logs) and JsonParser (for JSON logs).

Workflow
  1. Create a Query with a compiled Parser or JsonParser
  2. Select fields to extract using select()
  3. Optionally filter events using filter()
  4. Set the input source using from_()
  5. Export using to_dataframe() or to_arrow()
Key Features
  • Field selection: Choose specific fields or use "*" for all
  • Filtering: Apply lambda predicates to select events
  • Multiple exports: DataFrame, Arrow Table, or raw rows
  • Log type analysis: Get unique log types and their counts
Special Fields

In addition to capture group names, you can select:

  • "@log_type": The log type template
  • "@log_message": The original log message
  • "*": All capture groups (Parser only)
Example
from log_surgeon import Parser, Query, PATTERN

parser = Parser()
parser.add_var("request", rf"(?<method>GET|POST) (?<path>/\S+)")
parser.add_var("status", rf"(?<code>{PATTERN.INT})")
parser.compile()

# Basic query
df = (
    Query(parser)
    .select(["method", "path", "code"])
    .from_(log_file)
    .to_dataframe()
)

# With filtering
errors_df = (
    Query(parser)
    .select(["@log_message", "code"])
    .filter(lambda e: int(e["code"]) >= 400)
    .from_(log_file)
    .to_dataframe()
)

# Log type analysis
query = Query(parser).from_(log_file)
for log_type, count in query.get_log_type_counts().items():
    print(f"{count:5d} {log_type}")
JsonParser Example
json_parser = JsonParser(parser).target_fields(["message"])

df = (
    Query(json_parser)
    .select(["extracted.user_id", "extracted.action"])
    .from_(ndjson_file)
    .to_dataframe()
)
See Also

Parser : For creating text log parsers. JsonParser : For creating JSON log parsers.

Initialize a query builder with a parser.

Creates a new Query instance that will use the given parser to process log data. The parser must be compiled before use.

Parameters:

Name Type Description Default
parser Parser | JsonParser

A compiled Parser or JsonParser instance. The parser's patterns determine what fields can be selected and filtered.

required
Example
# With Parser (text logs)
parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.compile()
query = Query(parser)

# With JsonParser (JSON logs)
json_parser = JsonParser(parser).target_fields(["message"])
query = Query(json_parser)

select(fields: list[str]) -> Query

Select fields to include in the output.

Specifies which extracted variables and metadata to include when exporting to DataFrame or Arrow Table. Fields appear as columns in the order specified.

Parameters:

Name Type Description Default
fields list[str]

List of field names to extract. Supports:

Capture groups (from your patterns): ["user_id", "path", "status_code"]

Wildcard (Parser only): ["*"] - Selects all capture groups

Metadata fields: - "@log_type" - The log type template - "@log_message" - The original log message

JsonParser fields: Dot-notation for nested access: ["extracted.user_id"]

required

Returns:

Type Description
Query

Self for method chaining.

Note

The "*" wildcard only works with Parser, not JsonParser. For JsonParser, explicitly list the fields you want from the enriched JSON structure.

Example
# Select specific capture groups
query.select(["method", "path", "code"])

# Select all capture groups (Parser only)
query.select(["*"])

# Include metadata with capture groups
query.select(["@log_type", "@log_message", "method", "path"])

# Combine wildcard with metadata
query.select(["@log_type", "*"])

# JsonParser: access nested extracted fields
query.select(["extracted.user_id", "extracted.action", "message"])

filter(predicate: Callable[[LogEvent], bool] | Callable[[dict[str, Any]], bool]) -> Query

Filter log events using a predicate function.

Applies a filter to include only events where the predicate returns True. Events where the predicate returns False are excluded from the output.

Parameters:

Name Type Description Default
predicate Callable[[LogEvent], bool] | Callable[[dict[str, Any]], bool]

Function that receives an event and returns a boolean.

  • For Parser: receives a LogEvent object
  • For JsonParser: receives a dict (the enriched JSON)

Return True to include the event, False to exclude it.

required

Returns:

Type Description
Query

Self for method chaining.

Warning

Only one filter can be active. Calling filter() multiple times replaces the previous predicate. Combine conditions in a single predicate using and/or operators.

Note

Handle missing fields gracefully with try/except or .get() to avoid errors when a capture group does not match in some events.

Example
# Simple value filter
query.filter(lambda e: int(e["status_code"]) >= 400)

# Multiple conditions
query.filter(lambda e: e["method"] == "POST" and int(e["code"]) != 200)

# Safe filter for optional fields
def is_error(event):
    try:
        return int(event["code"]) >= 500
    except (KeyError, ValueError):
        return False
query.filter(is_error)

# JsonParser filter (receives dict)
query.filter(lambda obj: obj.get("level") == "ERROR")

# Access nested JsonParser fields
query.filter(lambda obj: obj.get("extracted", {}).get("user_id") == "admin")

from_(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Query

Set the input source containing log data.

Specifies where to read log data from. Must be called before any export method (to_dataframe(), to_arrow(), etc.).

Parameters:

Name Type Description Default
source str | TextIO | BinaryIO | StringIO | BytesIO

Input data to parse. Accepts:

  • str: String containing log data
  • TextIO: File opened in text mode
  • BinaryIO: File opened in binary mode
  • io.StringIO: In-memory text stream
  • io.BytesIO: In-memory binary stream
required

Returns:

Type Description
Query

Self for method chaining.

Raises:

Type Description
TypeError

If source type is not supported.

Note

For file objects, the entire content is read into memory. For very large files, consider processing in chunks.

Example
query = Query(parser).select(["method", "path"])

# From string
df = query.from_("GET /api\nPOST /login").to_dataframe()

# From file
with open("access.log") as f:
    df = query.from_(f).to_dataframe()

# From BytesIO (e.g., from HTTP response)
import io
data = io.BytesIO(response.content)
df = query.from_(data).to_dataframe()

validate_query() -> Query

Validate that the query is properly configured.

Returns:

Type Description
Query

Self for method chaining

Raises:

Type Description
AttributeError

If fields or stream are not set

to_dataframe() -> pd.DataFrame

Export parsed events to a pandas DataFrame.

Parses all events from the configured source, applies any filter, and returns a DataFrame with the selected fields as columns.

Returns:

Type Description
DataFrame

pandas DataFrame with one row per event and one column per

DataFrame

selected field. Column order matches the order in select().

Raises:

Type Description
ImportError

If pandas is not installed. Install with: pip install 'log-surgeon-ffi[dataframe]' or pip install pandas

AttributeError

If select() or from_() was not called.

Example
df = (
    Query(parser)
    .select(["method", "path", "code"])
    .from_(log_file)
    .to_dataframe()
)

print(df.head())
#   method       path code
# 0    GET     /users  200
# 1   POST     /login  401
# 2    GET  /products  200

# Use pandas for analysis
print(df["code"].value_counts())
# 200    150
# 404     23
# 500      5

to_arrow() -> pa.Table

Export parsed events to a PyArrow Table.

Parses all events from the configured source, applies any filter, and returns a PyArrow Table with the selected fields as columns. Arrow Tables are memory-efficient and ideal for large datasets.

Returns:

Type Description
Table

PyArrow Table with one row per event and one column per

Table

selected field. Column order matches the order in select().

Raises:

Type Description
ImportError

If pyarrow is not installed. Install with: pip install 'log-surgeon-ffi[arrow]' or pip install pyarrow

AttributeError

If select() or from_() was not called.

Note

Arrow Tables use columnar storage, which is more memory-efficient than row-based formats for large datasets. They also integrate well with Parquet files and other data processing tools.

Example
table = (
    Query(parser)
    .select(["method", "path", "code"])
    .from_(log_file)
    .to_arrow()
)

# Write to Parquet file
import pyarrow.parquet as pq
pq.write_table(table, "logs.parquet")

# Convert to pandas if needed
df = table.to_pandas()

get_rows() -> list[list[str]]

Extract raw rows of field values from parsed events.

Lower-level method that returns parsed data as a list of lists. Each inner list represents one event with values in the order specified by select().

Returns:

Type Description
list[list[str]]

List of rows, where each row is a list of string values.

list[list[str]]

Row order matches event order; column order matches select() order.

Note

This method is useful when you need raw data without pandas/pyarrow dependencies. For most use cases, prefer to_dataframe() or to_arrow().

Example
query = Query(parser).select(["method", "code"]).from_(log_data)
rows = query.get_rows()

for row in rows:
    method, code = row
    print(f"{method} -> {code}")

get_log_types() -> Generator[str, None, None]

Get unique log types from parsed events.

Yields each distinct log type template exactly once, in the order first encountered. Useful for discovering the different message patterns in your logs.

Yields:

Type Description
str

Unique log type strings (templates with variable placeholders).

Note
  • If a filter is set, only matching events contribute log types
  • For JsonParser, requires include_log_type(True) to be set
  • Log types are yielded in first-seen order, not sorted
Example
query = Query(parser).from_(log_data)

# Discover all message patterns
print("Log patterns found:")
for log_type in query.get_log_types():
    print(f"  {log_type}")

# Output:
#   <method> <path> <code>
#   Connection from <ip>:<port>
#   Error: <error_message>

get_log_type_counts() -> dict[str, int]

Count occurrences of each log type.

Counts how many times each distinct log type pattern appears in the log data. Useful for understanding log composition and identifying frequent vs. rare patterns.

Returns:

Type Description
dict[str, int]

Dictionary mapping log type templates to their occurrence counts.

dict[str, int]

Not sorted; use sorted() if ordering is needed.

Note
  • If a filter is set, only matching events are counted
  • For JsonParser, requires include_log_type(True) to be set
Example
query = Query(parser).from_(log_data)
counts = query.get_log_type_counts()

# Print sorted by frequency (most common first)
for log_type, count in sorted(counts.items(), key=lambda x: -x[1]):
    print(f"{count:6d}  {log_type}")

# Output:
#  15432  <method> <path> <code>
#   2341  Connection from <ip>:<port>
#     17  Error: <error_message>

# Find rare patterns (potential anomalies)
rare = [lt for lt, c in counts.items() if c < 10]
print(f"Rare patterns: {len(rare)}")

JsonParser

JsonParser(parser: Parser)

Parser for JSON-formatted logs that extracts variables from string fields.

JsonParser wraps an existing Parser and applies its extraction rules to JSON string fields, then merges the extracted variables back into the original JSON. This enables structured extraction from JSON logs while preserving the original JSON structure.

Key Features
  • Flexible field targeting: Extract from all strings or specific fields
  • Nested field support: Access nested JSON fields using dot-notation
  • Multiple formats: Parse NDJSON or JSON array formats with auto-detection
  • Conflict resolution: Configure how extracted keys merge with existing JSON
  • Streaming support: Efficiently process large NDJSON files line by line
Default Behavior

By default, JsonParser extracts from all string fields in the JSON object, recursively traversing nested objects and arrays. Use target_fields() to limit extraction to specific fields for better performance and precision.

Workflow
  1. Create a Parser with extraction patterns and compile it
  2. Create a JsonParser wrapping the Parser
  3. Optionally configure target fields and conflict strategy
  4. Parse JSON logs using parse() or parse_one()
Note

Extraction only works on string-type fields. Non-string fields (numbers, booleans, nested objects, arrays) are skipped unless they contain strings.

Example
from log_surgeon import JsonParser, Parser, ConflictStrategy

# Step 1: Create and configure underlying parser
parser = Parser()
parser.add_var("user_info", r"user=(?<user_id>\d+)")
parser.add_var("action", r"action=(?<action>[a-zA-Z0-9_]+)")
parser.compile()

# Step 2: Create JSON parser with field targeting
json_parser = (
    JsonParser(parser)
    .target_fields(["message", "context.detail"])  # Only parse these fields
    .on_conflict(ConflictStrategy.NEST, key="extracted")
)

# Step 3: Parse JSON logs
input_json = '{"ts": "2024-01-01", "message": "user=123 action=login"}'
result = json_parser.parse_one(input_json)
print(result)
# {
#     'ts': '2024-01-01',
#     'message': 'user=123 action=login',
#     'extracted': {'user_id': '123', 'action': 'login'}
# }
See Also

Parser : For creating extraction patterns. ConflictStrategy : For configuring conflict resolution. Query : For exporting JsonParser results to DataFrames.

Initialize the JSON parser with an underlying Parser.

Creates a JsonParser that applies the given Parser's extraction patterns to JSON string fields. By default, extracts from all string fields.

Parameters:

Name Type Description Default
parser Parser

A compiled Parser instance with extraction patterns defined. Must have compile() called before use. The parser's patterns will be applied to targeted JSON string fields.

required
Note

Default configuration:

  • Extracts from all string fields (equivalent to target_fields("*"))
  • Uses NEST conflict strategy with key "extracted"
  • Does not include log type in output

Use the fluent API methods to customize behavior:

  • target_fields(): Limit which JSON fields are parsed
  • on_conflict(): Configure conflict resolution strategy
  • include_log_type(): Include log type templates in output
Example
# Create and configure the underlying parser
parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.compile()

# Create JSON parser with default settings
json_parser = JsonParser(parser)

# Or customize with fluent API
json_parser = (
    JsonParser(parser)
    .target_fields(["message"])
    .on_conflict(ConflictStrategy.NEST, key="data")
    .include_log_type(True)
)

target_fields(fields: list[str] | str) -> JsonParser

Configure which JSON fields to target for variable extraction.

By default, JsonParser extracts from all string fields in the JSON object. Use this method to limit extraction to specific fields for better performance and to avoid unintended matches in other fields.

Parameters:

Name Type Description Default
fields list[str] | str

Field specification. Accepts:

  • str: Single field name (e.g., "message")
  • list[str]: Multiple field names (e.g., ["message", "context.detail"])
  • "*" or ["*"]: Explicitly target all string fields

Dot-notation is supported for nested fields (e.g., "context.message" accesses {"context": {"message": "..."}}).

required

Returns:

Type Description
JsonParser

Self for method chaining.

Raises:

Type Description
ValueError

If fields is an empty list.

Note

Targeting specific fields is recommended for production use:

  • Performance: Avoids parsing irrelevant fields
  • Precision: Prevents false matches in metadata fields
  • Clarity: Makes extraction intent explicit
Example
json_parser = JsonParser(parser)

# Target single field
json_parser.target_fields("message")

# Target multiple fields (including nested)
json_parser.target_fields(["message", "error.details", "context.info"])

# Reset to all string fields
json_parser.target_fields("*")
Nested Field Example
json_data = '{"context": {"message": "user=123"}, "other": "ignored"}'

json_parser = JsonParser(parser).target_fields(["context.message"])
result = json_parser.parse_one(json_data)
# Only "context.message" is parsed; "other" is ignored

on_conflict(strategy: ConflictStrategy, prefix: str = 'extracted.', key: str = 'extracted') -> JsonParser

Configure how to handle key conflicts between extracted and existing JSON keys.

When an extracted variable name matches an existing key in the JSON object, this setting determines how the conflict is resolved.

Parameters:

Name Type Description Default
strategy ConflictStrategy

The conflict resolution strategy to use. See ConflictStrategy for available options.

required
prefix str

Prefix for extracted keys when using ConflictStrategy.PREFIX. Default: "extracted.". Only used with PREFIX strategy.

'extracted.'
key str

Nesting key when using ConflictStrategy.NEST. Default: "extracted". Only used with NEST strategy.

'extracted'

Returns:

Type Description
JsonParser

Self for method chaining.

Example
from log_surgeon import JsonParser, ConflictStrategy

# NEST (default): All extracted values under a nested key
json_parser.on_conflict(ConflictStrategy.NEST, key="parsed")
# Input:  {"message": "user=123"}
# Output: {"message": "user=123", "parsed": {"user_id": "123"}}

# PREFIX: Add prefix to each extracted key
json_parser.on_conflict(ConflictStrategy.PREFIX, prefix="log_")
# Input:  {"message": "user=123"}
# Output: {"message": "user=123", "log_user_id": "123"}

# OVERWRITE: Replace existing keys (use with caution)
json_parser.on_conflict(ConflictStrategy.OVERWRITE)
# Input:  {"user_id": "old", "message": "user=123"}
# Output: {"user_id": "123", "message": "user=123"}  # Warning printed

# RAISE: Fail on conflict (for development/testing)
json_parser.on_conflict(ConflictStrategy.RAISE)
# Raises KeyError if extracted key exists in JSON

include_log_type(include: bool = True) -> JsonParser

Configure whether to include log type templates in the output.

Log types are template strings where matched variables are replaced with placeholders (e.g., "user=123" becomes "user="). They are useful for log clustering, pattern analysis, and anomaly detection.

Parameters:

Name Type Description Default
include bool

If True, include "@log_type" in extracted variables. Default: True when this method is called.

True

Returns:

Type Description
JsonParser

Self for method chaining.

Note

When enabled, each parsed field contributes its log type. If multiple fields are parsed, log types are aggregated.

Example
json_parser = JsonParser(parser).include_log_type(True)

result = json_parser.parse_one('{"message": "user=123 action=login"}')
print(result["extracted"]["@log_type"])
# "user=<user_id> action=<action>"

parse(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Generator[dict[str, Any], None, None]

Parse JSON logs from an input source.

Generator that yields enriched JSON dictionaries with extracted variables merged in. Supports both NDJSON (newline-delimited JSON) and JSON array formats, with automatic format detection.

Parameters:

Name Type Description Default
source str | TextIO | BinaryIO | StringIO | BytesIO

Input data to parse. Accepts:

  • str: String containing JSON data (NDJSON or JSON array)
  • TextIO: File opened in text mode
  • BinaryIO: File opened in binary mode
  • io.StringIO: In-memory text stream
  • io.BytesIO: In-memory binary stream
required

Yields:

Name Type Description
dict dict[str, Any]

Enriched JSON object with extracted variables merged according

dict[str, Any]

to the configured conflict strategy.

Raises:

Type Description
JSONDecodeError

If the input contains invalid JSON.

TypeError

If the source type is not supported.

Format Detection

The format is auto-detected by checking the first non-whitespace character:

  • Starts with [: Parsed as JSON array (entire content loaded)
  • Otherwise: Parsed as NDJSON (streamed line by line for file objects)
Note

For NDJSON files, parsing is streamed line-by-line to minimize memory usage. JSON arrays must be fully loaded into memory for parsing.

Example
# Parse NDJSON from string
ndjson = '''{"message": "user=123"}
{"message": "user=456"}'''

for result in json_parser.parse(ndjson):
    print(result["extracted"]["user_id"])
# 123
# 456

# Parse JSON array
json_array = '[{"message": "user=123"}, {"message": "user=456"}]'
for result in json_parser.parse(json_array):
    print(result["extracted"]["user_id"])

# Stream from file
with open("logs.ndjson") as f:
    for result in json_parser.parse(f):
        print(result)

parse_one(json_line: str) -> dict[str, Any]

Parse a single JSON object and return the enriched result.

Convenience method for parsing a single JSON log entry. For multiple entries, use parse() instead.

Parameters:

Name Type Description Default
json_line str

A single JSON object as a string. Must be a valid JSON object (starts with {), not an array or primitive.

required

Returns:

Type Description
dict[str, Any]

Enriched JSON dictionary with the original fields plus extracted

dict[str, Any]

variables merged according to the configured conflict strategy.

Raises:

Type Description
JSONDecodeError

If the input is not valid JSON.

Example
result = json_parser.parse_one('{"ts": "2024-01-01", "message": "user=123"}')

# Access original fields
print(result["ts"])  # "2024-01-01"
print(result["message"])  # "user=123"

# Access extracted fields (with default NEST strategy)
print(result["extracted"]["user_id"])  # "123"

# Full result structure
# {
#     "ts": "2024-01-01",
#     "message": "user=123",
#     "extracted": {"user_id": "123"}
# }

ConflictStrategy

ConflictStrategy

Bases: Enum

Strategy for handling conflicts when extracted keys match existing JSON keys.

When JsonParser extracts variables from JSON fields, the extracted key names might conflict with keys already present in the JSON object. This enum defines how such conflicts are resolved.

Attributes

NEST : enum member Place all extracted variables under a nested key (default: "extracted"). This is the safest option as it completely avoids conflicts. Result: {"message": "...", "extracted": {"user_id": "123"}}

enum member

Add a prefix to extracted variable names (default: "extracted."). Warns to stderr if the prefixed key still conflicts. Result: {"message": "...", "extracted.user_id": "123"}

enum member

Replace existing keys with extracted values. Prints a warning to stderr when overwriting occurs. Use with caution as original data is lost. Result: {"user_id": "123"} (original user_id overwritten)

enum member

Raise KeyError when a conflict is detected. Useful for development and testing to catch unexpected conflicts early.

Example
from log_surgeon import JsonParser, ConflictStrategy

# Default: nest under "extracted" key
json_parser = JsonParser(parser)

# Custom nest key
json_parser.on_conflict(ConflictStrategy.NEST, key="parsed")
# Result: {"message": "...", "parsed": {"user_id": "123"}}

# Use prefix instead
json_parser.on_conflict(ConflictStrategy.PREFIX, prefix="log_")
# Result: {"message": "...", "log_user_id": "123"}

# Fail on conflict (for testing)
json_parser.on_conflict(ConflictStrategy.RAISE)

SchemaCompiler

SchemaCompiler(delimiters: str = DEFAULT_DELIMITERS)

Compiler for constructing log-surgeon schema definitions.

SchemaCompiler provides a fluent interface for building schema definitions used by the log-surgeon parsing engine. It handles variable registration, pattern validation, and schema serialization.

Key Responsibilities
  • Register variable patterns with named capture groups
  • Track capture group names for validation
  • Manage variable priority for pattern ordering
  • Generate hidden variable names for internal use
  • Compile the final schema string
Schema Format

The compiled schema is a text format with sections for delimiters, timestamps, and variables:

delimiters: \t\r\n:,!;%@/()[]
timestamp:<pattern>
VariableName:<pattern>
Priority System

Variables are ordered in the schema by: 1. Priority (descending): higher priority = appears first 2. Insertion order (ascending): earlier added = appears first

This ordering affects which pattern is tried first when multiple patterns could match the same text.

Note

Most users should use the Parser class, which provides a simpler interface and handles schema compilation automatically. Use SchemaCompiler directly only for advanced use cases.

Example
from log_surgeon.schema_compiler import SchemaCompiler

compiler = SchemaCompiler()

# Add patterns with priority
compiler.add_var("ip", r"(?<ip>[0-9.]+)", priority=10)
compiler.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
compiler.add_var("int", r"(?<num>\d+)", priority=-1)  # Low priority

# Add timestamp for multi-line event detection
compiler.add_timestamp("iso", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")

# Compile to schema string
schema = compiler.compile()
See Also

Parser : High-level interface for log parsing. Variable : Data class representing a variable definition.

Initialize a schema compiler.

Parameters:

Name Type Description Default
delimiters str

String of delimiter characters for tokenization. Default includes space, tab, newline, and common punctuation.

DEFAULT_DELIMITERS

add_var(name: str, regex: str, priority: int = 0) -> SchemaCompiler

Add a variable pattern to the schema.

Patterns must include at least one named capture group using (?<name>...) syntax. The capture group names become the keys for accessing extracted values from parsed events.

Parameters:

Name Type Description Default
name str

Unique identifier for this variable pattern.

required
regex str

Regular expression with named capture groups. Use (?<name>pattern) syntax for extraction.

required
priority int

Pattern ordering priority. Higher values are tried first during matching. Default is 0.

  • Use positive values for specific patterns (IP, UUID)
  • Use negative values for generic catch-alls (INT, FLOAT)
0

Returns:

Type Description
SchemaCompiler

Self for method chaining.

Raises:

Type Description
ValueError

If pattern has no capture groups, or if names contain delimiter characters.

AttributeError

If a variable with this name already exists.

Example
compiler = SchemaCompiler()
compiler.add_var("ip", r"(?<ip>[0-9.]+)", priority=10)
compiler.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
compiler.add_var("int", r"(?<num>\d+)", priority=-1)

add_timestamp(name: str, regex: str) -> SchemaCompiler

Add a timestamp pattern to the schema.

Timestamps help log-surgeon detect log event boundaries. When a timestamp pattern matches at the start of a line, it signals a new log event, enabling correct handling of multi-line events like stack traces.

Parameters:

Name Type Description Default
name str

Unique identifier for this timestamp pattern.

required
regex str

Regular expression for matching timestamp formats.

required

Returns:

Type Description
SchemaCompiler

Self for method chaining.

Example
compiler = SchemaCompiler()
compiler.add_timestamp("iso", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")
compiler.add_timestamp("unix", r"\d{10}")

remove_var(var_name: str) -> SchemaCompiler

Remove a variable from the schema.

Parameters:

Name Type Description Default
var_name str

Name of the variable to remove (or its original name if hidden)

required

Returns:

Type Description
SchemaCompiler

Self for method chaining

get_var(var_name: str) -> Variable

Get a variable by name.

Parameters:

Name Type Description Default
var_name str

Variable name

required

Returns:

Type Description
Variable

The Variable object

compile() -> str

Compile the schema to a string for the log-surgeon engine.

Generates the final schema definition that includes delimiters, timestamps, and variables ordered by priority. This string is passed to the log-surgeon C++ library for DFA compilation.

Returns:

Type Description
str

Schema definition string in log-surgeon format.

Note

Variables are ordered by: 1. Priority (descending): higher priority patterns first 2. Insertion order (ascending): earlier added patterns first

This ordering determines which pattern is tried first when multiple patterns could match the same text.

Example
compiler = SchemaCompiler()
compiler.add_var("ip", r"(?<ip>[0-9.]+)", priority=10)
compiler.add_var("number", r"(?<num>\d+)", priority=-1)
compiler.add_timestamp("ts", r"\d{4}-\d{2}-\d{2}")

schema = compiler.compile()
# Returns formatted schema string with sections for
# delimiters, timestamps, and variables

PATTERN

The PATTERN class provides pre-built regex patterns optimized for log parsing.

PATTERN

Collection of pre-built regex patterns for common log elements.

PATTERN provides ready-to-use regex patterns optimized for log-surgeon's delimiter-based matching. Use these patterns with add_var() to extract common log elements without writing complex regex manually.

Categories

Network Patterns UUID, IPV4, PORT

Numeric Patterns INT, FLOAT

File System Patterns LINUX_FILE_NAME, LINUX_FILE_PATH

Character Sets JAVA_IDENTIFIER, LOG_LINE, LOG_LINE_NO_WHITE_SPACE

Java Patterns JAVA_CLASS_NAME, JAVA_FULLY_QUALIFIED_CLASS_NAME, JAVA_STACK_LOCATION

Usage

Embed patterns in your regex using f-strings:

parser.add_var("ip", rf"(?<ip>{PATTERN.IPV4})")
parser.add_var("value", rf"val=(?<v>{PATTERN.INT})")
Note

These patterns use log-surgeon regex syntax, which differs slightly from Python regex. Notably, . matches any character except delimiters.

Example
from log_surgeon import Parser, PATTERN

parser = Parser()

# Network patterns
parser.add_var("connection", rf"(?<ip>{PATTERN.IPV4}):(?<port>{PATTERN.PORT})")
parser.add_var("request_id", rf"id=(?<uuid>{PATTERN.UUID})")

# Numeric patterns
parser.add_var("metric", rf"(?<value>{PATTERN.FLOAT})")
parser.add_var("count", rf"n=(?<n>{PATTERN.INT})")

# File patterns
parser.add_var("file", rf"(?<path>{PATTERN.LINUX_FILE_PATH})")

# Java patterns
parser.add_var("class", rf"(?<class>{PATTERN.JAVA_FULLY_QUALIFIED_CLASS_NAME})")

parser.compile()
See Also

Parser.add_var : Method for adding patterns to a parser.