.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/string_handling.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_string_handling.py: Handling strings and dates with skrub ============================ This example demonstrates how skrub can be used to preprocess datasets with strings or dates for TabICL to make better predictions. .. GENERATED FROM PYTHON SOURCE LINES 10-27 Preparing the dataset ------------------------------------- Real-world datasets often contain complex heterogeneous data that benefits from more sophisticated preprocessing. For these scenarios, we recommend `skrub `_, a powerful library designed specifically for advanced tabular data preparation. Why use skrub? - Handles diverse data types (numerical, categorical, text, datetime, etc.) - Provides robust preprocessing for dirty data - Offers sophisticated feature engineering capabilities - Supports multi-table integration and joins. To install skrub, use ``pip install -U skrub``. TabICL can handle numerical and categorical columns natively, but will treat string columns as categorical. Here, we show how using skrub can provide TabICL with better string encoding. We use the "open payments" dataset. .. GENERATED FROM PYTHON SOURCE LINES 27-47 .. code-block:: Python from sklearn.model_selection import cross_val_score from sklearn.pipeline import make_pipeline import skrub.datasets from skrub import TableVectorizer, DatetimeEncoder, StringEncoder import pandas as pd import time import numpy as np from tabicl import TabICLClassifier data = skrub.datasets.fetch_open_payments() X, y = data.X, data.y rng = np.random.RandomState(0) subset_indices = rng.permutation(y.shape[0])[:600] X, y = X.iloc[subset_indices], y[subset_indices] # subsample for fast experiments pd.set_option('display.max_columns', None) pd.set_option('display.width', None) X .. rst-class:: sphx-glr-script-out .. code-block:: none Downloading 'open_payments' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/open_payments.zip (attempt 1/3) .. raw:: html
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name Dispute_Status_for_Publication Name_of_Associated_Covered_Device_or_Medical_Supply1 Name_of_Associated_Covered_Drug_or_Biological1 Physician_Specialty
17550 Stryker Corporation No KNEES NaN Podiatric Medicine & Surgery Service Providers...
60387 Pfizer Inc. No NaN TOVIAZ Allopathic & Osteopathic Physicians|Internal M...
58324 Sanofi Pasteur Inc. No NaN SKLICE Allopathic & Osteopathic Physicians|Orthopaedi...
4438 Cornerstone Therapeutics Inc. No NaN CARDENE IV 40mg Sodium Chloride Allopathic & Osteopathic Physicians|Physical M...
10539 Covidien LP No Flow Diversion NaN Allopathic & Osteopathic Physicians|Psychiatry...
... ... ... ... ... ...
31423 Terumo Cardiovascular Systems Corporation No CARDIOVASCULAR PRODUCT NaN Allopathic & Osteopathic Physicians|Surgery
38147 Smith & Nephew, Inc. No Negative Pressure Wound Therapy NaN Allopathic & Osteopathic Physicians|Anesthesio...
21113 Mylan Specialty L.P. No NaN Perforomist Chiropractic Providers|Chiropractor
25422 AstraZeneca Pharmaceuticals LP No NaN NaN Allopathic & Osteopathic Physicians|Obstetrics...
7764 Celgene Corporation No NaN Revlimid Allopathic & Osteopathic Physicians|Obstetrics...

600 rows × 5 columns



.. GENERATED FROM PYTHON SOURCE LINES 48-56 TabICL without skrub ------------------------------------- When string columns are used with TabICL directly, TabICL will interpret them as categorical columns. This means that TabICL doesn't know which strings are similar, it only knows which strings are identical. It can still learn from them if they repeat in the dataset, but may struggle when the number of different strings is high. Note that the runtime here might include the time for downloading the checkpoint. .. GENERATED FROM PYTHON SOURCE LINES 56-62 .. code-block:: Python reg = TabICLClassifier(n_estimators=1, device="cpu") # 1 estimator for speed start_time = time.time() scores = cross_val_score(reg, X, y, cv=2, scoring="roc_auc_ovr") print(f"ROC AUC without skrub: {scores.mean():.3f} (+/- {scores.std():.3f}), time: {time.time()-start_time:.1f} s") .. rst-class:: sphx-glr-script-out .. code-block:: none INFO: You are downloading 'tabicl-classifier-v2-20260212.ckpt', the latest best-performing version, used in our TabICLv2 paper. Checkpoint 'tabicl-classifier-v2-20260212.ckpt' not cached. Downloading from Hugging Face Hub (jingang/TabICL). ROC AUC without skrub: 0.606 (+/- 0.079), time: 108.0 s .. GENERATED FROM PYTHON SOURCE LINES 63-77 TabICL with skrub ------------------------------------- With skrub, we can embed high-cardinality string columns using semantics-aware methods into numerical features. The `TableVectorizer `_ applies different conversions to columns of a dataframe. Here, for efficiency reasons, we use the StringEncoder with lower-dimensional embeddings for all string columns with at least 10 distinct values. For lower-cardinality string columns, we use "passthrough", so they are directly forwarded to TabICL, which then treats them as categoricals. (Without "passthrough", they would be one-hot encoded by default, which is not the recommended way to handle categoricals for TabICL.) We also provide advanced settings for the DatetimeEncoder, even though our example dataset here does not contain dates. .. GENERATED FROM PYTHON SOURCE LINES 77-91 .. code-block:: Python pipeline = make_pipeline( TableVectorizer( low_cardinality="passthrough", # let TabICL handle low-cardinality categories cardinality_threshold=10, high_cardinality=StringEncoder(n_components=10), # fewer components for speed datetime=DatetimeEncoder(add_weekday=True, add_day_of_year=True, periodic_encoding='circular'), ), TabICLClassifier(n_estimators=1, device="cpu") # 1 estimator for speed ) start_time = time.time() scores = cross_val_score(pipeline, X, y, cv=2, scoring="roc_auc_ovr") print(f"ROC AUC with skrub: {scores.mean():.3f} (+/- {scores.std():.3f}), time: {time.time()-start_time:.1f} s") .. rst-class:: sphx-glr-script-out .. code-block:: none ROC AUC with skrub: 0.702 (+/- 0.084), time: 2.2 s .. GENERATED FROM PYTHON SOURCE LINES 92-94 Overall, skrub preprocessing helps TabICL to achieve a larger ROC AUC on this dataset. It increases the runtime because strings get encoded into multiple columns. .. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 50.755 seconds) .. _sphx_glr_download_tutorials_string_handling.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: string_handling.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: string_handling.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: string_handling.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_