API Reference¶
For key concepts like delimiter-based matching, capture groups, and pattern syntax, see Key Concepts.
Parser¶
Parser(delimiters: str = ' \\t\\r\\n:,!;%@/()[]', backend: str | None = None)
¶
High-level parser for extracting structured data from unstructured log messages.
The Parser uses a schema-based approach to identify patterns, extract variables, and generate log types from raw log text. It compiles patterns into a DFA (Deterministic Finite Automaton) for efficient single-pass matching.
Key Features¶
- Named capture groups: Use
(?<name>pattern)to extract specific values - Priority-based matching: Control which patterns are tried first
- Log type generation: Automatically creates templates from matched patterns
- Streaming parsing: Efficiently process large log files
- Multiple input sources: Parse from strings, files, or streams
Delimiter-Based Matching¶
Unlike standard regex, log-surgeon uses delimiter-based matching where .
matches any character except delimiters (spaces, tabs, colons, etc.).
This is important for pattern design:
# With default delimiters, "." stops at spaces
parser.add_var("token", r"(?<match>d.*)")
event = parser.parse_event("abc def ghi")
print(event['match']) # "def" (NOT "def ghi")
# To match across spaces, use explicit character classes
parser.add_var("multi", r"(?<match>d[a-z ]*i)") # Includes space
Workflow¶
- Create a Parser instance with optional custom delimiters
- Add variable patterns using
add_var() - Call
compile()to build the DFA - Parse logs using
parse()orparse_event()
Example¶
from log_surgeon import Parser, PATTERN
parser = Parser()
parser.add_var("request", rf"(?<method>GET|POST) (?<path>/[^ ]+)")
parser.add_var("status", rf"status=(?<code>{PATTERN.INT})")
parser.compile()
event = parser.parse_event("GET /api/users status=200")
print(event["method"]) # "GET"
print(event["path"]) # "/api/users"
print(event["code"]) # "200"
print(event.get_log_type()) # "<method> <path> status=<code>"
See Also¶
JsonParser : For parsing JSON-formatted logs. Query : For exporting parsed events to DataFrames. PATTERN : Pre-built regex patterns for common log elements.
Initialize the parser with optional custom delimiters.
Delimiters define token boundaries for log-surgeon's matching engine.
The . metacharacter in patterns will NOT match delimiter characters,
which affects how patterns are written.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
delimiters
|
str
|
String of delimiter characters for tokenization.
Default: Common customizations:
|
' \\t\\r\\n:,!;%@/()[]'
|
backend
|
str | None
|
Backend engine to use for parsing. Either "cpp" (uses the
C++ log-surgeon library) or "rust" (uses the Rust log-mechanic
library via cffi). If not specified, reads from the
|
None
|
Note¶
Delimiter choice significantly affects pattern matching. For example,
with default delimiters, a pattern like (?<ip>.*) will stop at the
first space. To match an IP address, use explicit character classes:
(?<ip>[0-9.]+) or the pre-built PATTERN.IPV4.
Example¶
# Default delimiters
parser = Parser()
# Custom delimiters for file path matching
parser = Parser(delimiters=r" \t\r\n:,!;%@()[]") # Removed "/"
# Minimal delimiters for maximum token length
parser = Parser(delimiters=r" \t\r\n")
# Use Rust backend
parser = Parser(backend="rust")
add_var(name: str, regex: str, priority: int = 0) -> Parser
¶
Add a variable pattern to the parser's schema.
Patterns must include at least one named capture group using (?<name>...)
syntax. The captured values are accessible on parsed LogEvent objects using
dictionary-style access: event["name"].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique identifier for this variable pattern. Used for schema organization but not directly accessible on LogEvent (use capture group names instead). |
required |
regex
|
str
|
Regular expression pattern with named capture groups.
Use |
required |
priority
|
int
|
Controls pattern matching order (higher = tried first). Default is 0.
|
0
|
Returns:
| Type | Description |
|---|---|
Parser
|
Self for method chaining. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the pattern has no named capture groups, or if capture group names contain delimiter characters. |
AttributeError
|
If a variable with the same name already exists. |
Note¶
Priority Example: If you have patterns for IP addresses and integers, give IP higher priority so "192.168.1.1" matches as an IP, not four integers.
Example¶
from log_surgeon import Parser, PATTERN
parser = Parser()
# High priority for specific patterns
parser.add_var("ip_address", rf"(?<ip>{PATTERN.IPV4})", priority=10)
# Default priority for normal patterns
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
# Low priority for generic catch-all patterns
parser.add_var("number", rf"(?<num>{PATTERN.INT})", priority=-1)
parser.compile()
add_timestamp(name: str, regex: str) -> Parser
¶
Add a timestamp pattern to the parser's schema.
Timestamps are special patterns that log-surgeon uses for log event boundary detection. When a timestamp pattern matches at the start of a line, it signals the beginning of a new log event.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique identifier for this timestamp pattern. Multiple timestamp patterns can be added for logs with varying timestamp formats. |
required |
regex
|
str
|
Regular expression pattern for matching timestamps. Should match the complete timestamp format used in your logs. |
required |
Returns:
| Type | Description |
|---|---|
Parser
|
Self for method chaining. |
Note¶
Timestamp patterns help log-surgeon correctly handle multi-line log events (e.g., stack traces). Without timestamps, each line is treated as a separate event.
Example¶
parser = Parser()
# ISO 8601 format: 2024-01-15T10:30:00
parser.add_timestamp("iso8601", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")
# Common log format: 15/Jan/2024:10:30:00
parser.add_timestamp("clf", r"\d{2}/[a-zA-Z]{3}/\d{4}:\d{2}:\d{2}:\d{2}")
# Unix timestamp: 1705312200
parser.add_timestamp("unix", r"\d{10}")
parser.compile()
compile(enable_debug_logs: bool = False) -> None
¶
Compile the schema and initialize the parser for use.
This method builds a DFA (Deterministic Finite Automaton) from the configured patterns and prepares the parser for log processing. Must be called after adding all variables and timestamps, and before any parsing operations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enable_debug_logs
|
bool
|
If True, output debug information to stderr during compilation and parsing. Useful for troubleshooting pattern issues. Default is False. |
False
|
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If schema compilation fails due to invalid patterns or conflicting configurations. |
Warning¶
After calling compile(), the parser's schema is fixed. Adding new
variables or timestamps will not affect the compiled parser. Create
a new Parser instance if you need different patterns.
Example¶
parse(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Generator[LogEvent, None, None]
¶
Parse log events from an input source.
Generator that yields LogEvent objects for each parsed event. Supports multiple input types for flexibility in how log data is provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | TextIO | BinaryIO | StringIO | BytesIO
|
Input data to parse. Accepts:
|
required |
Yields:
| Name | Type | Description |
|---|---|---|
LogEvent |
LogEvent
|
Parsed event with extracted variables accessible via |
LogEvent
|
dictionary-style access (e.g., |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If |
TypeError
|
If source type is not supported. |
Note¶
For file objects, the entire content is read into memory before parsing. For very large files, consider reading and parsing in chunks.
Example¶
parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
parser.compile()
# Parse from multi-line string
logs = '''GET /api/users
POST /api/login
GET /api/status'''
for event in parser.parse(logs):
print(f"{event['method']} {event['path']}")
# Parse from file
with open("access.log") as f:
for event in parser.parse(f):
print(event['path'])
# Parse from BytesIO (e.g., from network response)
import io
data = io.BytesIO(b"GET /health\nGET /ready")
for event in parser.parse(data):
print(event['path'])
parse_event(payload: str) -> LogEvent | None
¶
Parse a single log event from a string.
Convenience method for parsing a single log message. For multiple
events or streaming parsing, use parse() instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
payload
|
str
|
Log message string to parse. Can be a single line or multi-line string (e.g., containing a stack trace). |
required |
Returns:
| Type | Description |
|---|---|
LogEvent | None
|
LogEvent containing extracted variables and metadata, or None |
LogEvent | None
|
if no patterns matched. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If |
Note¶
This method creates a new stream for each call, which has overhead.
For batch processing, use parse() with all log data at once.
Example¶
LogEvent¶
LogEvent()
¶
Represents a parsed log event with extracted variables and metadata.
LogEvent is the result of parsing a log message with Parser. It contains:
- The original log message
- A log type (template with placeholders for matched variables)
- Extracted variables accessible via dictionary-style indexing
Accessing Variables¶
Variables are accessed using dictionary-style indexing with the capture group name defined in your patterns:
For patterns that match multiple times, the value is a list. For single matches, the value is unwrapped to a scalar.
Log Types¶
Log types are template strings where matched portions are replaced with
placeholder names (e.g., "user=
- Log clustering and deduplication
- Pattern frequency analysis
- Anomaly detection (new log types indicate new behavior)
Attributes¶
Note: These are internal attributes. Use the public methods to access data.
_var_dict : dict Internal dictionary mapping capture group names to values.
Example¶
parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+) (?<code>\d+)")
parser.compile()
event = parser.parse_event("GET /api/users 200")
# Access extracted variables
print(event["method"]) # "GET"
print(event["path"]) # "/api/users"
print(event["code"]) # "200"
# Access metadata
print(event.get_log_message()) # "GET /api/users 200"
print(event.get_log_type()) # "<method> <path> <code>"
# Get all variables as a dictionary
print(event.get_resolved_dict())
# {"method": "GET", "path": "/api/users", "code": "200"}
See Also¶
Parser.parse : Parse multiple events from a source. Parser.parse_event : Parse a single event from a string.
Initialize an empty LogEvent.
get_log_message() -> str
¶
Get the original log message text.
Returns the unmodified log message as it was parsed, including any whitespace, newlines, or special characters.
Returns:
| Type | Description |
|---|---|
str
|
The complete original log message string. |
Example¶
get_log_type() -> str
¶
Get the log type (template) for this event.
The log type is the original message with matched variables replaced
by placeholder names in angle brackets (e.g., <variable_name>).
This creates a template that represents the message's structure.
Returns:
| Type | Description |
|---|---|
str
|
Template string with placeholders for extracted variables. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If log type is not available (internal error). |
Note¶
Log types are useful for:
- Clustering: Group similar log messages together
- Frequency analysis: Count occurrences of each pattern
- Anomaly detection: New log types may indicate unusual behavior
Example¶
parser = Parser()
parser.add_var("req", r"(?<method>GET|POST) (?<path>/\S+) (?<code>\d+)")
parser.compile()
# Same pattern, different values
e1 = parser.parse_event("GET /users 200")
e2 = parser.parse_event("POST /login 401")
print(e1.get_log_type()) # "<method> <path> <code>"
print(e2.get_log_type()) # "<method> <path> <code>"
# Both have the same log type despite different values
assert e1.get_log_type() == e2.get_log_type()
get_capture_group(name: str, raw_output: bool = False) -> str | list[str | int | float] | None
¶
Get the value of a capture group by name.
Retrieves the extracted value(s) for a named capture group. By default, single values are unwrapped from their list container for convenience.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the capture group to retrieve. Special names:
|
required |
raw_output
|
bool
|
If True, always return values as a list, even for single matches. If False (default), single-element lists are unwrapped to scalar values. |
False
|
Returns:
| Type | Description |
|---|---|
str | list[str | int | float] | None
|
The captured value(s): |
str | list[str | int | float] | None
|
|
str | list[str | int | float] | None
|
|
str | list[str | int | float] | None
|
|
Note¶
Use raw_output=True when you need consistent list handling, such as
when iterating over potentially multi-value captures.
Example¶
# Pattern that matches multiple times
parser.add_var("errors", r"error: (?<error>[a-zA-Z0-9_]+)")
event = parser.parse_event("error: timeout error: disconnect")
# Default: single values unwrapped, multiple values as list
event.get_capture_group("error") # ["timeout", "disconnect"]
# With raw_output: always a list
event.get_capture_group("error", raw_output=True) # ["timeout", "disconnect"]
# Single match example
event2 = parser.parse_event("error: timeout")
event2.get_capture_group("error") # "timeout" (unwrapped)
event2.get_capture_group("error", raw_output=True) # ["timeout"]
# Special names
event.get_capture_group("@log_type") # "<error> <error>"
event.get_capture_group("@log_message") # "error: timeout error: disconnect"
get_resolved_dict() -> dict[str, str | list[str | int | float]]
¶
Get all extracted variables as a dictionary.
Returns a clean dictionary of all capture groups with their values. Single-element lists are unwrapped to scalar values for convenience. Internal fields like "@LogType" are excluded.
Returns:
| Type | Description |
|---|---|
dict[str, str | list[str | int | float]]
|
Dictionary mapping capture group names to their extracted values. |
dict[str, str | list[str | int | float]]
|
Processing applied: |
dict[str, str | list[str | int | float]]
|
|
dict[str, str | list[str | int | float]]
|
|
dict[str, str | list[str | int | float]]
|
|
dict[str, str | list[str | int | float]]
|
|
Note¶
This method is useful for:
- Converting log events to JSON or other formats
- Passing extracted data to downstream processing
- Debugging to see all extracted values at once
Example¶
parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
parser.add_var("status", r"status=(?<code>\d+)")
parser.compile()
event = parser.parse_event("GET /api/users status=200")
result = event.get_resolved_dict()
print(result)
# {
# "method": "GET",
# "path": "/api/users",
# "code": "200"
# }
# Can be easily serialized
import json
print(json.dumps(result))
__getitem__(name: str) -> str | list[str | int | float]
¶
Access a capture group value using dictionary-style indexing.
This is the primary way to access extracted variables from a LogEvent. Single values are automatically unwrapped from lists for convenience.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the capture group as defined in the pattern's
|
required |
Returns:
| Type | Description |
|---|---|
str | list[str | int | float]
|
The captured value(s). Returns a scalar for single matches, |
str | list[str | int | float]
|
or a list for multiple matches of the same capture group. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the capture group name does not exist or was not matched. |
Example¶
parser = Parser()
parser.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
parser.compile()
event = parser.parse_event("GET /api/users")
# Access like a dictionary
print(event["method"]) # "GET"
print(event["path"]) # "/api/users"
# KeyError for missing fields
try:
event["nonexistent"]
except KeyError as e:
print(e) # "Capture group 'nonexistent' not found"
Query¶
Query(parser: Parser | JsonParser)
¶
Query builder for parsing log events into structured data formats.
Query provides a fluent interface for extracting, filtering, and exporting log data to pandas DataFrames or PyArrow Tables. It works with both Parser (for text logs) and JsonParser (for JSON logs).
Workflow¶
- Create a Query with a compiled Parser or JsonParser
- Select fields to extract using
select() - Optionally filter events using
filter() - Set the input source using
from_() - Export using
to_dataframe()orto_arrow()
Key Features¶
- Field selection: Choose specific fields or use
"*"for all - Filtering: Apply lambda predicates to select events
- Multiple exports: DataFrame, Arrow Table, or raw rows
- Log type analysis: Get unique log types and their counts
Special Fields¶
In addition to capture group names, you can select:
"@log_type": The log type template"@log_message": The original log message"*": All capture groups (Parser only)
Example¶
from log_surgeon import Parser, Query, PATTERN
parser = Parser()
parser.add_var("request", rf"(?<method>GET|POST) (?<path>/\S+)")
parser.add_var("status", rf"(?<code>{PATTERN.INT})")
parser.compile()
# Basic query
df = (
Query(parser)
.select(["method", "path", "code"])
.from_(log_file)
.to_dataframe()
)
# With filtering
errors_df = (
Query(parser)
.select(["@log_message", "code"])
.filter(lambda e: int(e["code"]) >= 400)
.from_(log_file)
.to_dataframe()
)
# Log type analysis
query = Query(parser).from_(log_file)
for log_type, count in query.get_log_type_counts().items():
print(f"{count:5d} {log_type}")
JsonParser Example¶
json_parser = JsonParser(parser).target_fields(["message"])
df = (
Query(json_parser)
.select(["extracted.user_id", "extracted.action"])
.from_(ndjson_file)
.to_dataframe()
)
See Also¶
Parser : For creating text log parsers. JsonParser : For creating JSON log parsers.
Initialize a query builder with a parser.
Creates a new Query instance that will use the given parser to process log data. The parser must be compiled before use.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parser
|
Parser | JsonParser
|
A compiled Parser or JsonParser instance. The parser's patterns determine what fields can be selected and filtered. |
required |
Example¶
# With Parser (text logs)
parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.compile()
query = Query(parser)
# With JsonParser (JSON logs)
json_parser = JsonParser(parser).target_fields(["message"])
query = Query(json_parser)
select(fields: list[str]) -> Query
¶
Select fields to include in the output.
Specifies which extracted variables and metadata to include when exporting to DataFrame or Arrow Table. Fields appear as columns in the order specified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fields
|
list[str]
|
List of field names to extract. Supports: Capture groups (from your patterns):
Wildcard (Parser only):
Metadata fields:
- JsonParser fields:
Dot-notation for nested access: |
required |
Returns:
| Type | Description |
|---|---|
Query
|
Self for method chaining. |
Note¶
The "*" wildcard only works with Parser, not JsonParser. For JsonParser,
explicitly list the fields you want from the enriched JSON structure.
Example¶
# Select specific capture groups
query.select(["method", "path", "code"])
# Select all capture groups (Parser only)
query.select(["*"])
# Include metadata with capture groups
query.select(["@log_type", "@log_message", "method", "path"])
# Combine wildcard with metadata
query.select(["@log_type", "*"])
# JsonParser: access nested extracted fields
query.select(["extracted.user_id", "extracted.action", "message"])
filter(predicate: Callable[[LogEvent], bool] | Callable[[dict[str, Any]], bool]) -> Query
¶
Filter log events using a predicate function.
Applies a filter to include only events where the predicate returns True. Events where the predicate returns False are excluded from the output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predicate
|
Callable[[LogEvent], bool] | Callable[[dict[str, Any]], bool]
|
Function that receives an event and returns a boolean.
Return |
required |
Returns:
| Type | Description |
|---|---|
Query
|
Self for method chaining. |
Warning¶
Only one filter can be active. Calling filter() multiple times
replaces the previous predicate. Combine conditions in a single
predicate using and/or operators.
Note¶
Handle missing fields gracefully with try/except or .get() to avoid
errors when a capture group does not match in some events.
Example¶
# Simple value filter
query.filter(lambda e: int(e["status_code"]) >= 400)
# Multiple conditions
query.filter(lambda e: e["method"] == "POST" and int(e["code"]) != 200)
# Safe filter for optional fields
def is_error(event):
try:
return int(event["code"]) >= 500
except (KeyError, ValueError):
return False
query.filter(is_error)
# JsonParser filter (receives dict)
query.filter(lambda obj: obj.get("level") == "ERROR")
# Access nested JsonParser fields
query.filter(lambda obj: obj.get("extracted", {}).get("user_id") == "admin")
from_(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Query
¶
Set the input source containing log data.
Specifies where to read log data from. Must be called before any
export method (to_dataframe(), to_arrow(), etc.).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | TextIO | BinaryIO | StringIO | BytesIO
|
Input data to parse. Accepts:
|
required |
Returns:
| Type | Description |
|---|---|
Query
|
Self for method chaining. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If source type is not supported. |
Note¶
For file objects, the entire content is read into memory. For very large files, consider processing in chunks.
Example¶
query = Query(parser).select(["method", "path"])
# From string
df = query.from_("GET /api\nPOST /login").to_dataframe()
# From file
with open("access.log") as f:
df = query.from_(f).to_dataframe()
# From BytesIO (e.g., from HTTP response)
import io
data = io.BytesIO(response.content)
df = query.from_(data).to_dataframe()
validate_query() -> Query
¶
Validate that the query is properly configured.
Returns:
| Type | Description |
|---|---|
Query
|
Self for method chaining |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If fields or stream are not set |
to_dataframe() -> pd.DataFrame
¶
Export parsed events to a pandas DataFrame.
Parses all events from the configured source, applies any filter, and returns a DataFrame with the selected fields as columns.
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas DataFrame with one row per event and one column per |
DataFrame
|
selected field. Column order matches the order in |
Raises:
| Type | Description |
|---|---|
ImportError
|
If pandas is not installed. Install with:
|
AttributeError
|
If |
Example¶
to_arrow() -> pa.Table
¶
Export parsed events to a PyArrow Table.
Parses all events from the configured source, applies any filter, and returns a PyArrow Table with the selected fields as columns. Arrow Tables are memory-efficient and ideal for large datasets.
Returns:
| Type | Description |
|---|---|
Table
|
PyArrow Table with one row per event and one column per |
Table
|
selected field. Column order matches the order in |
Raises:
| Type | Description |
|---|---|
ImportError
|
If pyarrow is not installed. Install with:
|
AttributeError
|
If |
Note¶
Arrow Tables use columnar storage, which is more memory-efficient than row-based formats for large datasets. They also integrate well with Parquet files and other data processing tools.
Example¶
get_rows() -> list[list[str]]
¶
Extract raw rows of field values from parsed events.
Lower-level method that returns parsed data as a list of lists.
Each inner list represents one event with values in the order
specified by select().
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
List of rows, where each row is a list of string values. |
list[list[str]]
|
Row order matches event order; column order matches |
Note¶
This method is useful when you need raw data without pandas/pyarrow
dependencies. For most use cases, prefer to_dataframe() or to_arrow().
Example¶
get_log_types() -> Generator[str, None, None]
¶
Get unique log types from parsed events.
Yields each distinct log type template exactly once, in the order first encountered. Useful for discovering the different message patterns in your logs.
Yields:
| Type | Description |
|---|---|
str
|
Unique log type strings (templates with variable placeholders). |
Note¶
- If a filter is set, only matching events contribute log types
- For JsonParser, requires
include_log_type(True)to be set - Log types are yielded in first-seen order, not sorted
Example¶
get_log_type_counts() -> dict[str, int]
¶
Count occurrences of each log type.
Counts how many times each distinct log type pattern appears in the log data. Useful for understanding log composition and identifying frequent vs. rare patterns.
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
Dictionary mapping log type templates to their occurrence counts. |
dict[str, int]
|
Not sorted; use |
Note¶
- If a filter is set, only matching events are counted
- For JsonParser, requires
include_log_type(True)to be set
Example¶
query = Query(parser).from_(log_data)
counts = query.get_log_type_counts()
# Print sorted by frequency (most common first)
for log_type, count in sorted(counts.items(), key=lambda x: -x[1]):
print(f"{count:6d} {log_type}")
# Output:
# 15432 <method> <path> <code>
# 2341 Connection from <ip>:<port>
# 17 Error: <error_message>
# Find rare patterns (potential anomalies)
rare = [lt for lt, c in counts.items() if c < 10]
print(f"Rare patterns: {len(rare)}")
JsonParser¶
JsonParser(parser: Parser)
¶
Parser for JSON-formatted logs that extracts variables from string fields.
JsonParser wraps an existing Parser and applies its extraction rules to JSON string fields, then merges the extracted variables back into the original JSON. This enables structured extraction from JSON logs while preserving the original JSON structure.
Key Features¶
- Flexible field targeting: Extract from all strings or specific fields
- Nested field support: Access nested JSON fields using dot-notation
- Multiple formats: Parse NDJSON or JSON array formats with auto-detection
- Conflict resolution: Configure how extracted keys merge with existing JSON
- Streaming support: Efficiently process large NDJSON files line by line
Default Behavior¶
By default, JsonParser extracts from all string fields in the JSON object,
recursively traversing nested objects and arrays. Use target_fields() to
limit extraction to specific fields for better performance and precision.
Workflow¶
- Create a Parser with extraction patterns and compile it
- Create a JsonParser wrapping the Parser
- Optionally configure target fields and conflict strategy
- Parse JSON logs using
parse()orparse_one()
Note¶
Extraction only works on string-type fields. Non-string fields (numbers, booleans, nested objects, arrays) are skipped unless they contain strings.
Example¶
from log_surgeon import JsonParser, Parser, ConflictStrategy
# Step 1: Create and configure underlying parser
parser = Parser()
parser.add_var("user_info", r"user=(?<user_id>\d+)")
parser.add_var("action", r"action=(?<action>[a-zA-Z0-9_]+)")
parser.compile()
# Step 2: Create JSON parser with field targeting
json_parser = (
JsonParser(parser)
.target_fields(["message", "context.detail"]) # Only parse these fields
.on_conflict(ConflictStrategy.NEST, key="extracted")
)
# Step 3: Parse JSON logs
input_json = '{"ts": "2024-01-01", "message": "user=123 action=login"}'
result = json_parser.parse_one(input_json)
print(result)
# {
# 'ts': '2024-01-01',
# 'message': 'user=123 action=login',
# 'extracted': {'user_id': '123', 'action': 'login'}
# }
See Also¶
Parser : For creating extraction patterns. ConflictStrategy : For configuring conflict resolution. Query : For exporting JsonParser results to DataFrames.
Initialize the JSON parser with an underlying Parser.
Creates a JsonParser that applies the given Parser's extraction patterns to JSON string fields. By default, extracts from all string fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parser
|
Parser
|
A compiled Parser instance with extraction patterns defined.
Must have |
required |
Note¶
Default configuration:
- Extracts from all string fields (equivalent to
target_fields("*")) - Uses NEST conflict strategy with key "extracted"
- Does not include log type in output
Use the fluent API methods to customize behavior:
target_fields(): Limit which JSON fields are parsedon_conflict(): Configure conflict resolution strategyinclude_log_type(): Include log type templates in output
Example¶
# Create and configure the underlying parser
parser = Parser()
parser.add_var("metric", r"value=(?<value>\d+)")
parser.compile()
# Create JSON parser with default settings
json_parser = JsonParser(parser)
# Or customize with fluent API
json_parser = (
JsonParser(parser)
.target_fields(["message"])
.on_conflict(ConflictStrategy.NEST, key="data")
.include_log_type(True)
)
target_fields(fields: list[str] | str) -> JsonParser
¶
Configure which JSON fields to target for variable extraction.
By default, JsonParser extracts from all string fields in the JSON object. Use this method to limit extraction to specific fields for better performance and to avoid unintended matches in other fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fields
|
list[str] | str
|
Field specification. Accepts:
Dot-notation is supported for nested fields (e.g., |
required |
Returns:
| Type | Description |
|---|---|
JsonParser
|
Self for method chaining. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Note¶
Targeting specific fields is recommended for production use:
- Performance: Avoids parsing irrelevant fields
- Precision: Prevents false matches in metadata fields
- Clarity: Makes extraction intent explicit
Example¶
json_parser = JsonParser(parser)
# Target single field
json_parser.target_fields("message")
# Target multiple fields (including nested)
json_parser.target_fields(["message", "error.details", "context.info"])
# Reset to all string fields
json_parser.target_fields("*")
Nested Field Example¶
on_conflict(strategy: ConflictStrategy, prefix: str = 'extracted.', key: str = 'extracted') -> JsonParser
¶
Configure how to handle key conflicts between extracted and existing JSON keys.
When an extracted variable name matches an existing key in the JSON object, this setting determines how the conflict is resolved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy
|
ConflictStrategy
|
The conflict resolution strategy to use.
See |
required |
prefix
|
str
|
Prefix for extracted keys when using |
'extracted.'
|
key
|
str
|
Nesting key when using |
'extracted'
|
Returns:
| Type | Description |
|---|---|
JsonParser
|
Self for method chaining. |
Example¶
from log_surgeon import JsonParser, ConflictStrategy
# NEST (default): All extracted values under a nested key
json_parser.on_conflict(ConflictStrategy.NEST, key="parsed")
# Input: {"message": "user=123"}
# Output: {"message": "user=123", "parsed": {"user_id": "123"}}
# PREFIX: Add prefix to each extracted key
json_parser.on_conflict(ConflictStrategy.PREFIX, prefix="log_")
# Input: {"message": "user=123"}
# Output: {"message": "user=123", "log_user_id": "123"}
# OVERWRITE: Replace existing keys (use with caution)
json_parser.on_conflict(ConflictStrategy.OVERWRITE)
# Input: {"user_id": "old", "message": "user=123"}
# Output: {"user_id": "123", "message": "user=123"} # Warning printed
# RAISE: Fail on conflict (for development/testing)
json_parser.on_conflict(ConflictStrategy.RAISE)
# Raises KeyError if extracted key exists in JSON
include_log_type(include: bool = True) -> JsonParser
¶
Configure whether to include log type templates in the output.
Log types are template strings where matched variables are replaced with
placeholders (e.g., "user=123" becomes "user=
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include
|
bool
|
If True, include "@log_type" in extracted variables. Default: True when this method is called. |
True
|
Returns:
| Type | Description |
|---|---|
JsonParser
|
Self for method chaining. |
Note¶
When enabled, each parsed field contributes its log type. If multiple fields are parsed, log types are aggregated.
Example¶
parse(source: str | TextIO | BinaryIO | io.StringIO | io.BytesIO) -> Generator[dict[str, Any], None, None]
¶
Parse JSON logs from an input source.
Generator that yields enriched JSON dictionaries with extracted variables merged in. Supports both NDJSON (newline-delimited JSON) and JSON array formats, with automatic format detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | TextIO | BinaryIO | StringIO | BytesIO
|
Input data to parse. Accepts:
|
required |
Yields:
| Name | Type | Description |
|---|---|---|
dict |
dict[str, Any]
|
Enriched JSON object with extracted variables merged according |
dict[str, Any]
|
to the configured conflict strategy. |
Raises:
| Type | Description |
|---|---|
JSONDecodeError
|
If the input contains invalid JSON. |
TypeError
|
If the source type is not supported. |
Format Detection¶
The format is auto-detected by checking the first non-whitespace character:
- Starts with
[: Parsed as JSON array (entire content loaded) - Otherwise: Parsed as NDJSON (streamed line by line for file objects)
Note¶
For NDJSON files, parsing is streamed line-by-line to minimize memory usage. JSON arrays must be fully loaded into memory for parsing.
Example¶
# Parse NDJSON from string
ndjson = '''{"message": "user=123"}
{"message": "user=456"}'''
for result in json_parser.parse(ndjson):
print(result["extracted"]["user_id"])
# 123
# 456
# Parse JSON array
json_array = '[{"message": "user=123"}, {"message": "user=456"}]'
for result in json_parser.parse(json_array):
print(result["extracted"]["user_id"])
# Stream from file
with open("logs.ndjson") as f:
for result in json_parser.parse(f):
print(result)
parse_one(json_line: str) -> dict[str, Any]
¶
Parse a single JSON object and return the enriched result.
Convenience method for parsing a single JSON log entry. For multiple
entries, use parse() instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
json_line
|
str
|
A single JSON object as a string. Must be a valid JSON
object (starts with |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Enriched JSON dictionary with the original fields plus extracted |
dict[str, Any]
|
variables merged according to the configured conflict strategy. |
Raises:
| Type | Description |
|---|---|
JSONDecodeError
|
If the input is not valid JSON. |
Example¶
result = json_parser.parse_one('{"ts": "2024-01-01", "message": "user=123"}')
# Access original fields
print(result["ts"]) # "2024-01-01"
print(result["message"]) # "user=123"
# Access extracted fields (with default NEST strategy)
print(result["extracted"]["user_id"]) # "123"
# Full result structure
# {
# "ts": "2024-01-01",
# "message": "user=123",
# "extracted": {"user_id": "123"}
# }
ConflictStrategy¶
ConflictStrategy
¶
Bases: Enum
Strategy for handling conflicts when extracted keys match existing JSON keys.
When JsonParser extracts variables from JSON fields, the extracted key names might conflict with keys already present in the JSON object. This enum defines how such conflicts are resolved.
Attributes¶
NEST : enum member
Place all extracted variables under a nested key (default: "extracted").
This is the safest option as it completely avoids conflicts.
Result: {"message": "...", "extracted": {"user_id": "123"}}
enum member
Add a prefix to extracted variable names (default: "extracted.").
Warns to stderr if the prefixed key still conflicts.
Result: {"message": "...", "extracted.user_id": "123"}
enum member
Replace existing keys with extracted values. Prints a warning to stderr
when overwriting occurs. Use with caution as original data is lost.
Result: {"user_id": "123"} (original user_id overwritten)
enum member
Raise KeyError when a conflict is detected. Useful for development and testing to catch unexpected conflicts early.
Example¶
from log_surgeon import JsonParser, ConflictStrategy
# Default: nest under "extracted" key
json_parser = JsonParser(parser)
# Custom nest key
json_parser.on_conflict(ConflictStrategy.NEST, key="parsed")
# Result: {"message": "...", "parsed": {"user_id": "123"}}
# Use prefix instead
json_parser.on_conflict(ConflictStrategy.PREFIX, prefix="log_")
# Result: {"message": "...", "log_user_id": "123"}
# Fail on conflict (for testing)
json_parser.on_conflict(ConflictStrategy.RAISE)
SchemaCompiler¶
SchemaCompiler(delimiters: str = DEFAULT_DELIMITERS)
¶
Compiler for constructing log-surgeon schema definitions.
SchemaCompiler provides a fluent interface for building schema definitions used by the log-surgeon parsing engine. It handles variable registration, pattern validation, and schema serialization.
Key Responsibilities¶
- Register variable patterns with named capture groups
- Track capture group names for validation
- Manage variable priority for pattern ordering
- Generate hidden variable names for internal use
- Compile the final schema string
Schema Format¶
The compiled schema is a text format with sections for delimiters, timestamps, and variables:
Priority System¶
Variables are ordered in the schema by: 1. Priority (descending): higher priority = appears first 2. Insertion order (ascending): earlier added = appears first
This ordering affects which pattern is tried first when multiple patterns could match the same text.
Note¶
Most users should use the Parser class, which provides a simpler interface and handles schema compilation automatically. Use SchemaCompiler directly only for advanced use cases.
Example¶
from log_surgeon.schema_compiler import SchemaCompiler
compiler = SchemaCompiler()
# Add patterns with priority
compiler.add_var("ip", r"(?<ip>[0-9.]+)", priority=10)
compiler.add_var("request", r"(?<method>GET|POST) (?<path>/\S+)")
compiler.add_var("int", r"(?<num>\d+)", priority=-1) # Low priority
# Add timestamp for multi-line event detection
compiler.add_timestamp("iso", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")
# Compile to schema string
schema = compiler.compile()
See Also¶
Parser : High-level interface for log parsing. Variable : Data class representing a variable definition.
Initialize a schema compiler.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
delimiters
|
str
|
String of delimiter characters for tokenization. Default includes space, tab, newline, and common punctuation. |
DEFAULT_DELIMITERS
|
add_var(name: str, regex: str, priority: int = 0) -> SchemaCompiler
¶
Add a variable pattern to the schema.
Patterns must include at least one named capture group using
(?<name>...) syntax. The capture group names become the keys
for accessing extracted values from parsed events.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique identifier for this variable pattern. |
required |
regex
|
str
|
Regular expression with named capture groups.
Use |
required |
priority
|
int
|
Pattern ordering priority. Higher values are tried first during matching. Default is 0.
|
0
|
Returns:
| Type | Description |
|---|---|
SchemaCompiler
|
Self for method chaining. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If pattern has no capture groups, or if names contain delimiter characters. |
AttributeError
|
If a variable with this name already exists. |
Example¶
add_timestamp(name: str, regex: str) -> SchemaCompiler
¶
Add a timestamp pattern to the schema.
Timestamps help log-surgeon detect log event boundaries. When a timestamp pattern matches at the start of a line, it signals a new log event, enabling correct handling of multi-line events like stack traces.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique identifier for this timestamp pattern. |
required |
regex
|
str
|
Regular expression for matching timestamp formats. |
required |
Returns:
| Type | Description |
|---|---|
SchemaCompiler
|
Self for method chaining. |
Example¶
remove_var(var_name: str) -> SchemaCompiler
¶
Remove a variable from the schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
var_name
|
str
|
Name of the variable to remove (or its original name if hidden) |
required |
Returns:
| Type | Description |
|---|---|
SchemaCompiler
|
Self for method chaining |
get_var(var_name: str) -> Variable
¶
Get a variable by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
var_name
|
str
|
Variable name |
required |
Returns:
| Type | Description |
|---|---|
Variable
|
The Variable object |
compile() -> str
¶
Compile the schema to a string for the log-surgeon engine.
Generates the final schema definition that includes delimiters, timestamps, and variables ordered by priority. This string is passed to the log-surgeon C++ library for DFA compilation.
Returns:
| Type | Description |
|---|---|
str
|
Schema definition string in log-surgeon format. |
Note¶
Variables are ordered by: 1. Priority (descending): higher priority patterns first 2. Insertion order (ascending): earlier added patterns first
This ordering determines which pattern is tried first when multiple patterns could match the same text.
Example¶
compiler = SchemaCompiler()
compiler.add_var("ip", r"(?<ip>[0-9.]+)", priority=10)
compiler.add_var("number", r"(?<num>\d+)", priority=-1)
compiler.add_timestamp("ts", r"\d{4}-\d{2}-\d{2}")
schema = compiler.compile()
# Returns formatted schema string with sections for
# delimiters, timestamps, and variables
PATTERN¶
The PATTERN class provides pre-built regex patterns optimized for log parsing.
PATTERN
¶
Collection of pre-built regex patterns for common log elements.
PATTERN provides ready-to-use regex patterns optimized for log-surgeon's
delimiter-based matching. Use these patterns with add_var() to extract
common log elements without writing complex regex manually.
Categories¶
Network Patterns UUID, IPV4, PORT
Numeric Patterns INT, FLOAT
File System Patterns LINUX_FILE_NAME, LINUX_FILE_PATH
Character Sets JAVA_IDENTIFIER, LOG_LINE, LOG_LINE_NO_WHITE_SPACE
Java Patterns JAVA_CLASS_NAME, JAVA_FULLY_QUALIFIED_CLASS_NAME, JAVA_STACK_LOCATION
Usage¶
Embed patterns in your regex using f-strings:
parser.add_var("ip", rf"(?<ip>{PATTERN.IPV4})")
parser.add_var("value", rf"val=(?<v>{PATTERN.INT})")
Note¶
These patterns use log-surgeon regex syntax, which differs slightly from
Python regex. Notably, . matches any character except delimiters.
Example¶
from log_surgeon import Parser, PATTERN
parser = Parser()
# Network patterns
parser.add_var("connection", rf"(?<ip>{PATTERN.IPV4}):(?<port>{PATTERN.PORT})")
parser.add_var("request_id", rf"id=(?<uuid>{PATTERN.UUID})")
# Numeric patterns
parser.add_var("metric", rf"(?<value>{PATTERN.FLOAT})")
parser.add_var("count", rf"n=(?<n>{PATTERN.INT})")
# File patterns
parser.add_var("file", rf"(?<path>{PATTERN.LINUX_FILE_PATH})")
# Java patterns
parser.add_var("class", rf"(?<class>{PATTERN.JAVA_FULLY_QUALIFIED_CLASS_NAME})")
parser.compile()
See Also¶
Parser.add_var : Method for adding patterns to a parser.