sketchkit.datasets package

Submodules

sketchkit.datasets.controlsketch module

class sketchkit.datasets.controlsketch.ControlSketch(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

ControlSketch dataset loader (SketchDataset-style).

Directory structure (after download & extract):
controlsketch_sketches/

├── train/<category>/.svg ├── validation/<category>/.svg └── test/<category>/*.svg

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download and extract the ControlSketch dataset.

Downloads the dataset archive from CISLAB CDN or Google Drive, extracts it into the dataset root, and removes the source file if successful.

_get_single(idx: int) Sketch[source]

Get one sketch by index.

Parameters:

idx (int) – Global index of the sketch.

Returns:

A Sketch object containing parsed paths.

Return type:

Sketch

Raises:

IndexError – If the index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Build metadata index for dataset items.

Scans the extracted dataset directory for train/validation/test splits, collects all SVG files with their category and split, and saves results into a cached Parquet file.

Raises:

FileNotFoundError – If the dataset directory is not found.

extra_repr() str[source]

Return dataset summary.

Returns:

A string with number of categories and sample counts

for train/validation/test splits.

Return type:

str

md5_sum = 'c45dad0c08988df3d4036e85e5363e8a'
metadata = ['id', 'sub_id', 'category', 'split', 'filepath']
sketchkit.datasets.controlsketch.parse_svg(svg_path: str) list[Path][source]

Parse an SVG file into a list of Path objects.

Parameters:

svg_path (str) – Path to the SVG file to parse.

Returns:

Parsed sketch represented as a list of Path objects.

Return type:

list[Path]

Raises:
  • AssertionError – If a transform attribute is encountered.

  • Exception – If an arc segment is encountered.

sketchkit.datasets.creative_sketch module

class sketchkit.datasets.creative_sketch.CreativeSketch(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The Creative Sketch dataset contains vector sketches of birds and creatures with detailed part annotations.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the Creative Sketch dataset.

Downloads the dataset from Google Drive and extracts it to the root directory.

Raises:

Exception – If download fails for any reason.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

If a sketch not in memory, load it from disk.

Parameters:

idx (int) – Index of the sketch to retrieve.

Returns:

A Sketch object containing the drawing data as paths.

Return type:

Sketch

Raises:

IndexError – If index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Loads all sketch data from the JSON files into memory for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, sub_id, global ID, and whether the sample is good. If metadata cache exists, loads from the cached file.

_load_sketch_data(idx)[source]

Load a single sketch data by index.

Parameters:

idx (int) – Index of the sketch to load.

Returns:

List of paths for the sketch.

Return type:

list

extra_repr() str[source]

Return extra information for the string representation.

Returns:

Additional information to include in __repr__.

Return type:

str

md5_sum = '84c57a01499321bd6080b1f76754d709'
metadata = ['id', 'category', 'sub_id', 'is_good']

sketchkit.datasets.differsketching module

class sketchkit.datasets.differsketching.DifferSketching(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The DifferSketching dataset contains vector sketches with pressure and timing data.

This dataset includes multiple drawing types (original, global, stroke, reg) across various categories, with each sketch containing stroke sequences with pressure and timing information.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the DifferSketching dataset.

Downloads the dataset archive from either Google Drive or CISLAB CDN mirror. Creates the root directory if it doesn’t exist.

Raises:

Exception – If download fails.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

Parameters:

idx (int) – Index of the sketch to retrieve.

Returns:

A Sketch object containing the drawing data as paths.

Return type:

Sketch

Raises:

IndexError – If index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, type, file path, and global ID. If metadata cache exists, loads from the cached file.

The metadata DataFrame contains columns: - id: Global unique identifier across all sketches - file_path: Path to the JSON file - category: Drawing category name - type: Drawing type (original, global, stroke, reg)

_load_sketch(idx: int) Sketch[source]

Load a single sketch by index.

extra_repr() str[source]

Return extra information for the string representation.

md5_sum = 'ab0d9202dbaca58339b0b8ae6c1be0f3'
metadata = ['id', 'file_path', 'category', 'type']
sketchkit.datasets.differsketching.load_differsketching_json(json_path: str) Sketch[source]

Load a JSON sketch and convert it to a Sketch instance.

Parameters:

json_path (str) – Path to a DifferSketching JSON file.

Returns:

A Sketch object representing the vector sketch.

Return type:

Sketch

sketchkit.datasets.fscoco module

class sketchkit.datasets.fscoco.FSCOCO(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

FS-COCO dataset loader for vector sketches stored in .npy format.

This dataset class can download, verify, index, and load FS-COCO vector sketches. Each sketch is represented as polyline points and converted to cubic Bézier curves.

md5_sum

MD5 checksum for verifying dataset integrity.

Type:

str

metadata

Column names used in the metadata table.

Type:

list[str]

_abc_impl = <_abc._abc_data object>
static _array_to_sketch(data: ndarray) Sketch[source]

Convert a (N, 3) array to a Sketch with multiple `Path`s.

Each continuous stroke (sequence of points until a pen-up/end flag) is converted into one Path. Adjacent point pairs are mapped to cubic Bézier Curve`s using `line_to_cubic.

Parameters:

data – Array of shape (N, 3) with columns [x, y, flag]. Flag meanings (common convention): `0`=pen-down/continue, `1`=pen-up, `2`=end-of-drawing.

Returns:

A sketch composed of zero or more paths.

Return type:

Sketch

Raises:

ValueError – If data is not a 2D array with exactly 3 columns.

_check_integrity() bool[source]

Check on-disk dataset integrity.

Returns:

True if the dataset directory exists and its MD5 matches the expected checksum; False otherwise.

Return type:

bool

_download()[source]

Download and extract FS-COCO if the integrity check fails.

The archive is fetched from the CISLAB mirror or the official host, extracted into self.root, and a checksum is printed for verification.

_get_single(idx: int) Sketch[source]

Get a single sketch by index.

Parameters:

idx – Index of the sketch to retrieve.

Returns:

A Sketch object built from the underlying array.

Return type:

Sketch

Raises:

IndexError – If idx is out of bounds.

_load_all()[source]

Preload all .npy arrays into memory.

This is useful when load_all=True is requested and you want to avoid disk I/O during iteration.

_load_items_metadata()[source]

Build or load the items metadata table.

Scans the dataset directory to create a table with the following columns: - id: A globally unique integer identifier for each sketch. - shard: The shard index (1..100). - file_path: Relative path to the .npy file (relative to self.root).

The table is cached in .metadata.parquet for faster subsequent loads.

_load_single_array(idx: int) ndarray[source]

Load a single .npy array for the given index.

Parameters:

idx – Row index in the metadata table.

Returns:

Array of shape (N, 3) containing [x, y, flag].

Return type:

np.ndarray

extra_repr() str[source]

Return a short human-readable summary for debugging.

Returns:

Summary string containing file count and shard coverage.

Return type:

str

md5_sum: str = '5e5c08c2e6877b164b5d04f3ce8b9f89'
metadata = ['id', 'shard', 'file_path']

sketchkit.datasets.gmu_sketch_cleanup module

class sketchkit.datasets.gmu_sketch_cleanup.GMUSketchCleanup(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

GMU Rough Sketch Cleanup dataset (SVG parsing version).

Directory structure (after extraction): Benchmark_Dataset/

├── GT/ ├── Rough/JPG/ ├── Rough/PNG/ ├── Rough/SVG/ └── sketch_tags.csv

  • Only the id column is mandatory; additional columns like split (GT/Rough) and file_path (SVG path) are optional.

  • _get_single: Reads from raw_data; if missing, parses the SVG from disk and caches it.

  • Supports downloading from the official website or CISLAB mirror.

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Checks the integrity of the cached dataset by comparing MD5 checksums.

_download()[source]

Downloads and extracts the GMU-Sketch-Cleanup dataset.

_get_single(idx: int) Sketch[source]

Retrieves a single item by index.

Parameters:

idx – Index of the item.

Returns:

A Sketch object.

Raises:

IndexError – If the index is out of range.

_load_all()[source]

Preloads all items into memory.

_load_items_metadata()[source]

Loads or generates metadata for the dataset.

_load_one(idx: int)[source]

Loads a single item from disk.

Parameters:

idx – Index of the item.

Returns:

Parsed SVG data.

extra_repr() str[source]

Returns additional string representation of the dataset.

md5_sum: str = '8dbea6e9cd42810c80f41595a633ff33'
metadata = ['id', 'split', 'file_path']
sketchkit.datasets.gmu_sketch_cleanup.parse_gmu_svg(svg_file: str) tuple[tuple[int, int], list[list[ndarray]], int][source]

Parses a GMU SVG file into (width, height), path_list, and total_segment_num.

Parameters:

svg_file – Path to the SVG file.

Returns:

  • (width, height): Dimensions of the SVG canvas.

  • path_list: A list of paths, where each path contains multiple cubic segments.

  • total_segment_num: Total number of cubic segments in the SVG.

Return type:

A tuple containing

sketchkit.datasets.hzy_sketch module

sketchkit.datasets.hzy_sketch.download_and_extract_data(output_folder, remove_sourcefile=True, cislab_source=True)[source]
class sketchkit.datasets.hzy_sketch.hzySketch(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The hzy dataset, which contains drawing process for high-quality anime line arts.

Parameters:

data_base – The directory of storing the dataset, which is automatically downloaded if not existed.

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the QuickDraw dataset.

Downloads the categories.txt file and all category .npz files from either the original Google storage or CISLAB CDN mirror. Creates the root directory if it doesn’t exist.

Raises:

Exception – If download fails for any file.

_get_single(idx: int) Sketch[source]

Retrieves a sketch by its index.

Parameters:

idx – The index of the sketch to retrieve.

Returns:

The parsed sketch as a Sketch object.

_load_all()[source]

Load all dataset items into memory.

This method should load all dataset items into memory for faster access. It’s called when load_all is True during initialization or when explicitly requested.

Note

The loaded data should be stored in self.raw_data for later access by __getitem__.

_load_items_metadata()[source]

Load metadata for all items in the dataset into a pandas DataFrame.

This method should populate the items_metadata DataFrame with information about all available items in the dataset. The metadata is used to provide quick access to item information without loading the actual data.

Note

This method is called after successful integrity check and should populate self.items_metadata as a pandas DataFrame with appropriate columns for dataset-specific metadata.

md5_sum = 'bb0d0fdc6eefa2e2ab41f990cc8cd5f4'
sketchkit.datasets.hzy_sketch.load_hzy_sketch_json(drawing_path)[source]

Load a vector sketch from a .json file and convert to SketchVG format. :param drawing_path: Path to the .json file. :type drawing_path: str

Returns:

A SketchVG object representing the sketch.

Return type:

Sketch

sketchkit.datasets.opensketch module

class sketchkit.datasets.opensketch.OpenSketch(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The OpenSketch dataset contains vector sketches represented with polyline curves.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the OpenSketch dataset.

Raises:

Exception – If download fails for any file.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

If a sketch not in memory, load all sketches in the same category from disk.

Parameters:

idx (int) – Index of the sketch to retrieve.

Returns:

A Sketch object containing the drawing data as paths.

Return type:

Sketch

Raises:

IndexError – If index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, split (train/valid/test), global ID, and sub ID within category. If metadata cache exists, loads from the cached file.

The metadata DataFrame contains columns: - category: Drawing category name - split: Data split (train/valid/test) - id: Global unique identifier across all sketches

gdrive_id = '1wZf3lkqSsqYTIdGqT0wrdryCcfl4f731'
md5_sum = 'e1e5ec89ac1345c87ae5d92ab90a0725'
metadata = ['id', 'file_name']
sketchkit.datasets.opensketch.parse_single_path(path_str: str) list[tuple[float, float]] | None[source]

Parses an SVG path ‘d’ attribute string into Bézier control points.

Converts various SVG path commands (Move, CubicBezier, Line) into a standardized list of control points for cubic Bézier curves.

Parameters:

path_str – The string from the ‘d’ attribute of an SVG <path> element.

Returns:

A list of (x, y) control points, or None if the path is invalid or empty.

sketchkit.datasets.opensketch.parse_svg(svg_file: str) list[list[list[tuple[float, float]]]][source]

Parses an entire SVG file to extract all path data.

Reads an SVG file, finds all <path> elements, and uses parse_single_path to convert them into a list of curves, where each curves is represented by its four control points.

Parameters:

svg_file – The file path to the SVG file.

Returns:

list of paths -> list of curves -> list of 4 control points. Returns an empty list if the file cannot be parsed.

Return type:

A nested list representing the sketch. The structure is

sketchkit.datasets.photosketching module

class sketchkit.datasets.photosketching.PhotoSketching(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The PhotoSketching dataset loader.

PhotoSketching is a dataset for photo-to-sketch generation, containing paired images and their corresponding sketch representations. The dataset includes: - 1,000 outdoor photos - 5,000 SVG format sketches (5 sketches per photo, with stroke timestamps) - 15,000 PNG format rendered sketches (each SVG sketch rendered with 3 different line widths)

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

metadata

List of metadata column names for the dataset.

Type:

list

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

_download()[source]

Download the PhotoSketching dataset.

Downloads the all-in-one.zip archive from either Google Drive or CISLAB CDN mirror. Extracts the archive to the root directory.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a pickle file containing metadata for all SVG sketches including file paths and unique IDs. If metadata cache exists, loads from the cached file.

_load_single_sketch(idx: int) Sketch[source]

Load a single sketch from disk.

extra_repr() str[source]

Return extra information for the string representation.

md5_sum = '8f1b44299a3023276f169b658793a1ab'
metadata = ['id', 'svg_path', 'png_path', 'photo_path']
sketchkit.datasets.photosketching.parse_photosketching_svg(svg_file)[source]

Parse an SVG file into width, height, path list, total segment number, and path attributes.

Parameters:

svg_file – The path to the SVG file.

Returns:

  • (width, height): The dimensions of the SVG canvas.

  • path_list: A list of paths, where each path contains strokes.

  • total_segment_num: The total number of segments in the SVG file [Segment here is curve in sketchkit].

  • path_attributes: A list of dictionaries containing path attributes.

Return type:

A tuple containing

Raises:

Exception – If unsupported elements like transforms or arcs are encountered.

sketchkit.datasets.quickdraw module

class sketchkit.datasets.quickdraw.QuickDraw(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

QuickDraw dataset loader and interface.

The QuickDraw dataset contains millions of drawings across 345 categories, collected from the Quick, Draw! game. Each drawing is represented as a sequence of strokes in stroke-3 format.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the QuickDraw dataset.

Downloads the categories.txt file and all category .npz files from either the original Google storage or CISLAB CDN mirror. Creates the root directory if it doesn’t exist.

Raises:

Exception – If download fails for any file.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

If a sketch not in memory, load all sketches in the same category from disk.

Parameters:

idx (int) – Index of the sketch to retrieve.

Returns:

A Sketch object containing the drawing data as paths.

Return type:

Sketch

Raises:

IndexError – If index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, split (train/valid/test), global ID, and sub ID within category. If metadata cache exists, loads from the cached file.

The metadata DataFrame contains columns: - category: Drawing category name - split: Data split (train/valid/test) - id: Global unique identifier across all sketches - sub_id: Identifier within the category and split

extra_repr() str[source]

Return extra information for the string representation.

This method can be overridden by subclasses to provide additional dataset-specific information in the string representation.

Returns:

Additional information to include in __repr__. Empty by default.

Return type:

str

md5_sum = '4d3c05094288833fccc0d28af45af4cd'
metadata = ['id', 'sub_id', 'category', 'split']
sketchkit.datasets.quickdraw.parse_stroke3_format(stroke_data)[source]

Convert stroke-3 format in QuickDraw into SketchKit’s Sketch format.

The stroke-3 format consists of points with (dx, dy, pen_state) where: - dx, dy: relative displacement from the previous point - pen_state: 0 for drawing (pen down), 1 for lifting (pen up)

This function converts the format into absolute coordinates and groups continuous drawing segments into paths, with each stroke represented as a cubic Bezier curve.

Parameters:

stroke_data (np.ndarray) – Array of shape (N_point, 3), where each point is in stroke-3 format (dx, dy, pen_state). Pen state is 0 (drawing) or 1 (lifting).

Returns:

A tuple containing:
  • path_list (list[Path]): List of Path objects, each containing stroke segments as cubic Bezier curves.

  • total_segment_num (int): Total number of stroke segments across all paths.

Return type:

tuple

sketchkit.datasets.sketchx_pris module

class sketchkit.datasets.sketchx_pris.SketchXPRIS(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

SketchX-PRIS-Dataset loader and interface.

The SketchX-PRIS-Dataset contains 20000 drawings across 25 categories, collected by SKetchX (http://sketchx.eecs.qmul.ac.uk/) and PRIS (http://www.pris.net.cn/). Each drawing is represented as a sequence of strokes in stroke-3 format.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download(remove_zip: bool = True)[source]

Download SketchX-PRIS Dataset from GitHub and unzip it.

Raises:

Exception – If download fails for any file.

_get_single(idx)[source]

Retrieve a sketch item by index.

Parameters:

idx (int) – Index of the item to retrieve. Must be in range [0, len(dataset)).

Returns:

The sketch object at the specified index.

Return type:

Sketch

Raises:

IndexError – If idx is out of bounds.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, group ID, global ID, and sub ID within category. If metadata cache exists, loads from the cached file.

The metadata DataFrame contains columns: - category: Drawing category name - id: Global unique identifier across all sketches - sub_id: Identifier within the category and split

md5_sum = '67ac28f552b857fc84de4f60775a13a0'
metadata = ['id', 'sub_id', 'category']
sketchkit.datasets.sketchx_pris.parse_stroke3_format(stroke_data)[source]

Convert stroke-3 format in QuickDraw into SketchKit’s Sketch format.

The stroke-3 format consists of points with (dx, dy, pen_state) where: - dx, dy: relative displacement from the previous point - pen_state: 0 for drawing (pen down), 1 for lifting (pen up)

This function converts the format into absolute coordinates and groups continuous drawing segments into paths, with each stroke represented as a cubic Bezier curve.

Parameters:

stroke_data (np.ndarray) – Array of shape (N_point, 3), where each point is in stroke-3 format (dx, dy, pen_state). Pen state is 0 (drawing) or 1 (lifting).

Returns:

A tuple containing:
  • path_list (list[Path]): List of Path objects, each containing stroke segments as cubic Bezier curves.

  • total_segment_num (int): Total number of stroke segments across all paths.

Return type:

tuple

sketchkit.datasets.sketchy module

class sketchkit.datasets.sketchy.Sketchy(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The Sketchy Database (https://sketchy.eye.gatech.edu/), SVG subset.

This loader expects the archive sketches-06-04.7z to extract into a sketchy/ directory whose immediate subfolders are category names, each containing multiple .svg files. Any checked.txt or invalid.txt files present in category folders are ignored.

After extraction, SVGs listed in per-class invalid.txt files are automatically deleted (this behavior can be disabled via prune_invalid=False in download_sketchy_database).

The data is represented using cubic Bézier curves consistent with SketchKit.

Example

>>> data = Sketchy()
>>> len(data)
20000
>>> s = data[0]
>>> isinstance(s, Sketch)
True
Parameters:
  • root – Root directory where the dataset will be stored.

  • load_all – Whether to load all data into memory at initialization.

  • prune – Whether to prune invalid SVGs during download.

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download and extract The Sketchy Database (SVG version).

_get_single(idx: int) Sketch[source]

Load a single sketch as a Sketch instance.

Parameters:

idx – Index into the dataset.

Returns:

A Sketch object composed of cubic Bézier Path/`Curve`s.

_load_all()[source]

Load all dataset items into memory.

_load_items_metadata()[source]

Load metadata for all items in the dataset into a pandas DataFrame.

_load_sketch(idx: int) Sketch[source]

Load a single sketch by index.

filename: str = 'sketches-06-04.7z'
gdrive_id: str = '1Qr8HhjRuGqgDONHigGszyHG_awCstivo'
md5_sum = 'b0277b8d9413d7914971f1dae909f324'
metadata = ['id', 'pruned', 'class_name', 'filename', 'path', 'sketch_name']
sketchkit.datasets.sketchy.parse_single_path(path_str: str)[source]

Parse a single SVG path ‘d’ string into cubic Bézier control points.

The returned list follows SketchKit’s cubic Bézier convention: [start] + N * [ctrl1, ctrl2, end] where N is the number of segments.

Parameters:

path_str – The ‘d’ attribute string from an SVG <path>.

Returns:

A list of (x, y) control points if the path contains drawable segments. Returns None if the path is a trivial Move with no drawable segment.

Raises:

Exception – If unsupported SVG commands (e.g., Arc) are encountered.

sketchkit.datasets.sketchy.parse_svg(svg_file: str)[source]

Parse an SVG file into SketchKit-compatible cubic Bézier curves.

The parser assumes: * No transforms on <path> elements. * The root <svg> includes a ‘viewBox’ with x=y=0.

Parameters:

svg_file – Path to an SVG file.

Returns:

(sketch_width, sketch_height), path_list, total_segment_num where:

  • (sketch_width, sketch_height) is derived from the viewBox.

  • path_list is a list of curves: list of (N_curves, 4, 2).

  • total_segment_num is the total number of cubic segments.

Raises:

AssertionError – If unsupported shapes or transforms are encountered.

sketchkit.datasets.tracing_vs_freehand module

class sketchkit.datasets.tracing_vs_freehand.TracingVsFreehand(root: Path | str | None = None, load_all: bool = False, cislab_source: bool = True)[source]

Bases: SketchDataset

Tracing-vs-Freehand dataset loader.

This dataset contains sketches categorized as freehand drawings, registered drawings, and tracings, stored in SVG format. This loader handles automatic downloading, integrity checking, and parsing of these SVG files.

md5_sum

MD5 checksum for the extracted dataset directory.

Type:

str

URL

Download URL for the dataset zip file.

Type:

str

URL = 'https://cislab.hkust-gz.edu.cn/projects/sketchkit/datasets/TracingVsFreehand/sketch.zip'
_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Checks if the dataset is present and uncorrupted.

Returns:

True if the dataset’s integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Downloads and extracts the dataset from the source URL.

Raises:

RuntimeError – If the download or extraction process fails.

_get_single(idx: int) Sketch[source]

Retrieves a single sketch from the dataset by its index.

Parameters:

idx (int) – The index of the sketch to retrieve.

Returns:

The sketch object at the specified index.

Return type:

Sketch

Raises:

IndexError – If the index is out of bounds.

_load_all()[source]

Loads all sketch data into memory if load_all is True.

_load_items_metadata()[source]

Scans the dataset directory to create and cache metadata.

This method generates a pandas DataFrame with metadata for each sketch and caches it as a .parquet file for fast subsequent loading.

Raises:

FileNotFoundError – If no SVG files are found after scanning.

extra_repr() str[source]

Returns a string with extra information about the dataset.

Returns:

A string containing dataset statistics by sketch type.

Return type:

str

md5_sum = 'd069ddc535281d50e271ee8bcbcd091e'
metadata = ['id', 'file_path', 'sketch_type', 'category', 'file_id']
sketchkit.datasets.tracing_vs_freehand._parse_svg_to_sketch(svg_path: str) Sketch[source]

Parses an SVG file and converts its content into a Sketch object.

Parameters:

svg_path (str) – The file path to the SVG file to be parsed.

Returns:

An object representing the parsed sketch. Returns an empty Sketch

if the file cannot be parsed or is not found.

Return type:

Sketch

sketchkit.datasets.tu_berlin module

class sketchkit.datasets.tu_berlin.TUBerlin(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The TU-Berlin dataset contains vector sketches represented with cubic Bézier curves across 250 categories. Each category contains 800 sketches.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the TU-Berlin dataset.

Downloads the categories.txt file and all category .npz files from either the original Google storage or CISLAB CDN mirror. Creates the root directory if it doesn’t exist.

Raises:

Exception – If download fails for any file.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

If a sketch not in memory, load all sketches in the same category from disk.

Parameters:

idx (int) – Index of the sketch to retrieve.

Returns:

A Sketch object containing the drawing data as paths.

Return type:

Sketch

Raises:

IndexError – If index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, split (train/valid/test), global ID, and sub ID within category. If metadata cache exists, loads from the cached file.

The metadata DataFrame contains columns: - category: Drawing category name - split: Data split (train/valid/test) - id: Global unique identifier across all sketches - sub_id: Identifier within the category and split

extra_repr() str[source]

Return extra information for the string representation.

This method can be overridden by subclasses to provide additional dataset-specific information in the string representation.

Returns:

Additional information to include in __repr__. Empty by default.

Return type:

str

md5_sum = 'aa7d8ae9c8bf5f5cb6d28cee9741737c'
metadata = ['id', 'sub_id', 'category', 'split']
sketchkit.datasets.tu_berlin.parse_single_path(path_str)[source]
sketchkit.datasets.tu_berlin.parse_svg(svg_file)[source]
sketchkit.datasets.tu_berlin.parse_transform(svg_transform)[source]

Module contents

class sketchkit.datasets.ControlSketch(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

ControlSketch dataset loader (SketchDataset-style).

Directory structure (after download & extract):
controlsketch_sketches/

├── train/<category>/.svg ├── validation/<category>/.svg └── test/<category>/*.svg

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download and extract the ControlSketch dataset.

Downloads the dataset archive from CISLAB CDN or Google Drive, extracts it into the dataset root, and removes the source file if successful.

_get_single(idx: int) Sketch[source]

Get one sketch by index.

Parameters:

idx (int) – Global index of the sketch.

Returns:

A Sketch object containing parsed paths.

Return type:

Sketch

Raises:

IndexError – If the index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Build metadata index for dataset items.

Scans the extracted dataset directory for train/validation/test splits, collects all SVG files with their category and split, and saves results into a cached Parquet file.

Raises:

FileNotFoundError – If the dataset directory is not found.

extra_repr() str[source]

Return dataset summary.

Returns:

A string with number of categories and sample counts

for train/validation/test splits.

Return type:

str

md5_sum = 'c45dad0c08988df3d4036e85e5363e8a'
metadata = ['id', 'sub_id', 'category', 'split', 'filepath']
class sketchkit.datasets.GMUSketchCleanup(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

GMU Rough Sketch Cleanup dataset (SVG parsing version).

Directory structure (after extraction): Benchmark_Dataset/

├── GT/ ├── Rough/JPG/ ├── Rough/PNG/ ├── Rough/SVG/ └── sketch_tags.csv

  • Only the id column is mandatory; additional columns like split (GT/Rough) and file_path (SVG path) are optional.

  • _get_single: Reads from raw_data; if missing, parses the SVG from disk and caches it.

  • Supports downloading from the official website or CISLAB mirror.

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Checks the integrity of the cached dataset by comparing MD5 checksums.

_download()[source]

Downloads and extracts the GMU-Sketch-Cleanup dataset.

_get_single(idx: int) Sketch[source]

Retrieves a single item by index.

Parameters:

idx – Index of the item.

Returns:

A Sketch object.

Raises:

IndexError – If the index is out of range.

_load_all()[source]

Preloads all items into memory.

_load_items_metadata()[source]

Loads or generates metadata for the dataset.

_load_one(idx: int)[source]

Loads a single item from disk.

Parameters:

idx – Index of the item.

Returns:

Parsed SVG data.

extra_repr() str[source]

Returns additional string representation of the dataset.

md5_sum: str = '8dbea6e9cd42810c80f41595a633ff33'
metadata = ['id', 'split', 'file_path']
class sketchkit.datasets.OpenSketch(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The OpenSketch dataset contains vector sketches represented with polyline curves.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the OpenSketch dataset.

Raises:

Exception – If download fails for any file.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

If a sketch not in memory, load all sketches in the same category from disk.

Parameters:

idx (int) – Index of the sketch to retrieve.

Returns:

A Sketch object containing the drawing data as paths.

Return type:

Sketch

Raises:

IndexError – If index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, split (train/valid/test), global ID, and sub ID within category. If metadata cache exists, loads from the cached file.

The metadata DataFrame contains columns: - category: Drawing category name - split: Data split (train/valid/test) - id: Global unique identifier across all sketches

gdrive_id = '1wZf3lkqSsqYTIdGqT0wrdryCcfl4f731'
md5_sum = 'e1e5ec89ac1345c87ae5d92ab90a0725'
metadata = ['id', 'file_name']
class sketchkit.datasets.PhotoSketching(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The PhotoSketching dataset loader.

PhotoSketching is a dataset for photo-to-sketch generation, containing paired images and their corresponding sketch representations. The dataset includes: - 1,000 outdoor photos - 5,000 SVG format sketches (5 sketches per photo, with stroke timestamps) - 15,000 PNG format rendered sketches (each SVG sketch rendered with 3 different line widths)

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

metadata

List of metadata column names for the dataset.

Type:

list

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

_download()[source]

Download the PhotoSketching dataset.

Downloads the all-in-one.zip archive from either Google Drive or CISLAB CDN mirror. Extracts the archive to the root directory.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a pickle file containing metadata for all SVG sketches including file paths and unique IDs. If metadata cache exists, loads from the cached file.

_load_single_sketch(idx: int) Sketch[source]

Load a single sketch from disk.

extra_repr() str[source]

Return extra information for the string representation.

md5_sum = '8f1b44299a3023276f169b658793a1ab'
metadata = ['id', 'svg_path', 'png_path', 'photo_path']
class sketchkit.datasets.QuickDraw(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

QuickDraw dataset loader and interface.

The QuickDraw dataset contains millions of drawings across 345 categories, collected from the Quick, Draw! game. Each drawing is represented as a sequence of strokes in stroke-3 format.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the QuickDraw dataset.

Downloads the categories.txt file and all category .npz files from either the original Google storage or CISLAB CDN mirror. Creates the root directory if it doesn’t exist.

Raises:

Exception – If download fails for any file.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

If a sketch not in memory, load all sketches in the same category from disk.

Parameters:

idx (int) – Index of the sketch to retrieve.

Returns:

A Sketch object containing the drawing data as paths.

Return type:

Sketch

Raises:

IndexError – If index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, split (train/valid/test), global ID, and sub ID within category. If metadata cache exists, loads from the cached file.

The metadata DataFrame contains columns: - category: Drawing category name - split: Data split (train/valid/test) - id: Global unique identifier across all sketches - sub_id: Identifier within the category and split

extra_repr() str[source]

Return extra information for the string representation.

This method can be overridden by subclasses to provide additional dataset-specific information in the string representation.

Returns:

Additional information to include in __repr__. Empty by default.

Return type:

str

md5_sum = '4d3c05094288833fccc0d28af45af4cd'
metadata = ['id', 'sub_id', 'category', 'split']
class sketchkit.datasets.SketchXPRIS(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

SketchX-PRIS-Dataset loader and interface.

The SketchX-PRIS-Dataset contains 20000 drawings across 25 categories, collected by SKetchX (http://sketchx.eecs.qmul.ac.uk/) and PRIS (http://www.pris.net.cn/). Each drawing is represented as a sequence of strokes in stroke-3 format.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download(remove_zip: bool = True)[source]

Download SketchX-PRIS Dataset from GitHub and unzip it.

Raises:

Exception – If download fails for any file.

_get_single(idx)[source]

Retrieve a sketch item by index.

Parameters:

idx (int) – Index of the item to retrieve. Must be in range [0, len(dataset)).

Returns:

The sketch object at the specified index.

Return type:

Sketch

Raises:

IndexError – If idx is out of bounds.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, group ID, global ID, and sub ID within category. If metadata cache exists, loads from the cached file.

The metadata DataFrame contains columns: - category: Drawing category name - id: Global unique identifier across all sketches - sub_id: Identifier within the category and split

md5_sum = '67ac28f552b857fc84de4f60775a13a0'
metadata = ['id', 'sub_id', 'category']
class sketchkit.datasets.Sketchy(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The Sketchy Database (https://sketchy.eye.gatech.edu/), SVG subset.

This loader expects the archive sketches-06-04.7z to extract into a sketchy/ directory whose immediate subfolders are category names, each containing multiple .svg files. Any checked.txt or invalid.txt files present in category folders are ignored.

After extraction, SVGs listed in per-class invalid.txt files are automatically deleted (this behavior can be disabled via prune_invalid=False in download_sketchy_database).

The data is represented using cubic Bézier curves consistent with SketchKit.

Example

>>> data = Sketchy()
>>> len(data)
20000
>>> s = data[0]
>>> isinstance(s, Sketch)
True
Parameters:
  • root – Root directory where the dataset will be stored.

  • load_all – Whether to load all data into memory at initialization.

  • prune – Whether to prune invalid SVGs during download.

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download and extract The Sketchy Database (SVG version).

_get_single(idx: int) Sketch[source]

Load a single sketch as a Sketch instance.

Parameters:

idx – Index into the dataset.

Returns:

A Sketch object composed of cubic Bézier Path/`Curve`s.

_load_all()[source]

Load all dataset items into memory.

_load_items_metadata()[source]

Load metadata for all items in the dataset into a pandas DataFrame.

_load_sketch(idx: int) Sketch[source]

Load a single sketch by index.

filename: str = 'sketches-06-04.7z'
gdrive_id: str = '1Qr8HhjRuGqgDONHigGszyHG_awCstivo'
md5_sum = 'b0277b8d9413d7914971f1dae909f324'
metadata = ['id', 'pruned', 'class_name', 'filename', 'path', 'sketch_name']
class sketchkit.datasets.TUBerlin(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The TU-Berlin dataset contains vector sketches represented with cubic Bézier curves across 250 categories. Each category contains 800 sketches.

md5_sum

MD5 checksum for dataset integrity verification.

Type:

str

References

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the TU-Berlin dataset.

Downloads the categories.txt file and all category .npz files from either the original Google storage or CISLAB CDN mirror. Creates the root directory if it doesn’t exist.

Raises:

Exception – If download fails for any file.

_get_single(idx: int) Sketch[source]

Get a sketch by index.

If a sketch not in memory, load all sketches in the same category from disk.

Parameters:

idx (int) – Index of the sketch to retrieve.

Returns:

A Sketch object containing the drawing data as paths.

Return type:

Sketch

Raises:

IndexError – If index is out of range.

_load_all()[source]

Load all sketch data into memory if load_all is enabled.

Concatenates all sketch data from all categories and splits into a single numpy array for faster access. Only loads if self.load_all is True.

_load_items_metadata()[source]

Load and cache metadata for all items in the dataset.

Creates a parquet file containing metadata for all sketches including category, split (train/valid/test), global ID, and sub ID within category. If metadata cache exists, loads from the cached file.

The metadata DataFrame contains columns: - category: Drawing category name - split: Data split (train/valid/test) - id: Global unique identifier across all sketches - sub_id: Identifier within the category and split

extra_repr() str[source]

Return extra information for the string representation.

This method can be overridden by subclasses to provide additional dataset-specific information in the string representation.

Returns:

Additional information to include in __repr__. Empty by default.

Return type:

str

md5_sum = 'aa7d8ae9c8bf5f5cb6d28cee9741737c'
metadata = ['id', 'sub_id', 'category', 'split']
class sketchkit.datasets.TracingVsFreehand(root: Path | str | None = None, load_all: bool = False, cislab_source: bool = True)[source]

Bases: SketchDataset

Tracing-vs-Freehand dataset loader.

This dataset contains sketches categorized as freehand drawings, registered drawings, and tracings, stored in SVG format. This loader handles automatic downloading, integrity checking, and parsing of these SVG files.

md5_sum

MD5 checksum for the extracted dataset directory.

Type:

str

URL

Download URL for the dataset zip file.

Type:

str

URL = 'https://cislab.hkust-gz.edu.cn/projects/sketchkit/datasets/TracingVsFreehand/sketch.zip'
_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Checks if the dataset is present and uncorrupted.

Returns:

True if the dataset’s integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Downloads and extracts the dataset from the source URL.

Raises:

RuntimeError – If the download or extraction process fails.

_get_single(idx: int) Sketch[source]

Retrieves a single sketch from the dataset by its index.

Parameters:

idx (int) – The index of the sketch to retrieve.

Returns:

The sketch object at the specified index.

Return type:

Sketch

Raises:

IndexError – If the index is out of bounds.

_load_all()[source]

Loads all sketch data into memory if load_all is True.

_load_items_metadata()[source]

Scans the dataset directory to create and cache metadata.

This method generates a pandas DataFrame with metadata for each sketch and caches it as a .parquet file for fast subsequent loading.

Raises:

FileNotFoundError – If no SVG files are found after scanning.

extra_repr() str[source]

Returns a string with extra information about the dataset.

Returns:

A string containing dataset statistics by sketch type.

Return type:

str

md5_sum = 'd069ddc535281d50e271ee8bcbcd091e'
metadata = ['id', 'file_path', 'sketch_type', 'category', 'file_id']
class sketchkit.datasets.hzySketch(root: str | Path | None = None, load_all: bool = False, cislab_source: bool = False, skip_integrity_check: bool = False)[source]

Bases: SketchDataset

The hzy dataset, which contains drawing process for high-quality anime line arts.

Parameters:

data_base – The directory of storing the dataset, which is automatically downloaded if not existed.

_abc_impl = <_abc._abc_data object>
_check_integrity() bool[source]

Check the integrity of the cached dataset using MD5 checksum.

Returns:

True if the dataset integrity is verified, False otherwise.

Return type:

bool

_download()[source]

Download the QuickDraw dataset.

Downloads the categories.txt file and all category .npz files from either the original Google storage or CISLAB CDN mirror. Creates the root directory if it doesn’t exist.

Raises:

Exception – If download fails for any file.

_get_single(idx: int) Sketch[source]

Retrieves a sketch by its index.

Parameters:

idx – The index of the sketch to retrieve.

Returns:

The parsed sketch as a Sketch object.

_load_all()[source]

Load all dataset items into memory.

This method should load all dataset items into memory for faster access. It’s called when load_all is True during initialization or when explicitly requested.

Note

The loaded data should be stored in self.raw_data for later access by __getitem__.

_load_items_metadata()[source]

Load metadata for all items in the dataset into a pandas DataFrame.

This method should populate the items_metadata DataFrame with information about all available items in the dataset. The metadata is used to provide quick access to item information without loading the actual data.

Note

This method is called after successful integrity check and should populate self.items_metadata as a pandas DataFrame with appropriate columns for dataset-specific metadata.

md5_sum = 'bb0d0fdc6eefa2e2ab41f990cc8cd5f4'