.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/string_handling.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_string_handling.py: Handling strings and dates with skrub ============================ This example demonstrates how skrub can be used to preprocess datasets with strings or dates for TabICL to make better predictions. .. GENERATED FROM PYTHON SOURCE LINES 10-27 Preparing the dataset ------------------------------------- Real-world datasets often contain complex heterogeneous data that benefits from more sophisticated preprocessing. For these scenarios, we recommend `skrub `_, a powerful library designed specifically for advanced tabular data preparation. Why use skrub? - Handles diverse data types (numerical, categorical, text, datetime, etc.) - Provides robust preprocessing for dirty data - Offers sophisticated feature engineering capabilities - Supports multi-table integration and joins. To install skrub, use ``pip install -U skrub``. TabICL can handle numerical and categorical columns natively, but will treat string columns as categorical. Here, we show how using skrub can provide TabICL with better string encoding. We use the "open payments" dataset. .. GENERATED FROM PYTHON SOURCE LINES 27-47 .. code-block:: Python from sklearn.model_selection import cross_val_score from sklearn.pipeline import make_pipeline import skrub.datasets from skrub import TableVectorizer, DatetimeEncoder, StringEncoder import pandas as pd import time import numpy as np from tabicl import TabICLClassifier data = skrub.datasets.fetch_open_payments() X, y = data.X, data.y rng = np.random.RandomState(0) subset_indices = rng.permutation(y.shape[0])[:600] X, y = X.iloc[subset_indices], y[subset_indices] # subsample for fast experiments pd.set_option('display.max_columns', None) pd.set_option('display.width', None) X .. rst-class:: sphx-glr-script-out .. code-block:: none Downloading 'open_payments' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/open_payments.zip (attempt 1/3) .. raw:: html

	Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name	Dispute_Status_for_Publication	Name_of_Associated_Covered_Device_or_Medical_Supply1	Name_of_Associated_Covered_Drug_or_Biological1	Physician_Specialty
17550	Stryker Corporation	No	KNEES	NaN	Podiatric Medicine & Surgery Service Providers...
60387	Pfizer Inc.	No	NaN	TOVIAZ	Allopathic & Osteopathic Physicians\|Internal M...
58324	Sanofi Pasteur Inc.	No	NaN	SKLICE	Allopathic & Osteopathic Physicians\|Orthopaedi...
4438	Cornerstone Therapeutics Inc.	No	NaN	CARDENE IV 40mg Sodium Chloride	Allopathic & Osteopathic Physicians\|Physical M...
10539	Covidien LP	No	Flow Diversion	NaN	Allopathic & Osteopathic Physicians\|Psychiatry...
...	...	...	...	...	...
31423	Terumo Cardiovascular Systems Corporation	No	CARDIOVASCULAR PRODUCT	NaN	Allopathic & Osteopathic Physicians\|Surgery
38147	Smith & Nephew, Inc.	No	Negative Pressure Wound Therapy	NaN	Allopathic & Osteopathic Physicians\|Anesthesio...
21113	Mylan Specialty L.P.	No	NaN	Perforomist	Chiropractic Providers\|Chiropractor
25422	AstraZeneca Pharmaceuticals LP	No	NaN	NaN	Allopathic & Osteopathic Physicians\|Obstetrics...
7764	Celgene Corporation	No	NaN	Revlimid	Allopathic & Osteopathic Physicians\|Obstetrics...

600 rows × 5 columns

.. GENERATED FROM PYTHON SOURCE LINES 48-56 TabICL without skrub ------------------------------------- When string columns are used with TabICL directly, TabICL will interpret them as categorical columns. This means that TabICL doesn't know which strings are similar, it only knows which strings are identical. It can still learn from them if they repeat in the dataset, but may struggle when the number of different strings is high. Note that the runtime here might include the time for downloading the checkpoint. .. GENERATED FROM PYTHON SOURCE LINES 56-62 .. code-block:: Python reg = TabICLClassifier(n_estimators=1, device="cpu") # 1 estimator for speed start_time = time.time() scores = cross_val_score(reg, X, y, cv=2, scoring="roc_auc_ovr") print(f"ROC AUC without skrub: {scores.mean():.3f} (+/- {scores.std():.3f}), time: {time.time()-start_time:.1f} s") .. rst-class:: sphx-glr-script-out .. code-block:: none INFO: You are downloading 'tabicl-classifier-v2-20260212.ckpt', the latest best-performing version, used in our TabICLv2 paper. Checkpoint 'tabicl-classifier-v2-20260212.ckpt' not cached. Downloading from Hugging Face Hub (jingang/TabICL). ROC AUC without skrub: 0.606 (+/- 0.079), time: 108.0 s .. GENERATED FROM PYTHON SOURCE LINES 63-77 TabICL with skrub ------------------------------------- With skrub, we can embed high-cardinality string columns using semantics-aware methods into numerical features. The `TableVectorizer `_ applies different conversions to columns of a dataframe. Here, for efficiency reasons, we use the StringEncoder with lower-dimensional embeddings for all string columns with at least 10 distinct values. For lower-cardinality string columns, we use "passthrough", so they are directly forwarded to TabICL, which then treats them as categoricals. (Without "passthrough", they would be one-hot encoded by default, which is not the recommended way to handle categoricals for TabICL.) We also provide advanced settings for the DatetimeEncoder, even though our example dataset here does not contain dates. .. GENERATED FROM PYTHON SOURCE LINES 77-91 .. code-block:: Python pipeline = make_pipeline( TableVectorizer( low_cardinality="passthrough", # let TabICL handle low-cardinality categories cardinality_threshold=10, high_cardinality=StringEncoder(n_components=10), # fewer components for speed datetime=DatetimeEncoder(add_weekday=True, add_day_of_year=True, periodic_encoding='circular'), ), TabICLClassifier(n_estimators=1, device="cpu") # 1 estimator for speed ) start_time = time.time() scores = cross_val_score(pipeline, X, y, cv=2, scoring="roc_auc_ovr") print(f"ROC AUC with skrub: {scores.mean():.3f} (+/- {scores.std():.3f}), time: {time.time()-start_time:.1f} s") .. rst-class:: sphx-glr-script-out .. code-block:: none ROC AUC with skrub: 0.702 (+/- 0.084), time: 2.2 s .. GENERATED FROM PYTHON SOURCE LINES 92-94 Overall, skrub preprocessing helps TabICL to achieve a larger ROC AUC on this dataset. It increases the runtime because strings get encoded into multiple columns. .. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 50.755 seconds) .. _sphx_glr_download_tutorials_string_handling.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: string_handling.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: string_handling.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: string_handling.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_