.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/regression_heteroscedastic_1D.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_tutorials_regression_heteroscedastic_1D.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_tutorials_regression_heteroscedastic_1D.py:


Quantile regression with TabICL
===============================

This example shows the TabICL predicted quantiles on a simple 1D regression
problem with heteroscedastic noise.

.. GENERATED FROM PYTHON SOURCE LINES 10-18

.. code-block:: Python


    import numpy as np
    from scipy.special import expit
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from scipy.stats import norm
    from tabicl import TabICLRegressor


.. GENERATED FROM PYTHON SOURCE LINES 19-33

Generate heteroscedastic data
-----------------------------

We first generate a simple one-dimensional regression task to illustrate the
natural ability of TabICL to model predictive uncertainty in the presence of
heteroscedastic noise.

`Heteroscedasticity <https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity>`_
means that the variance of the target random variable `y` is not constant over
the feature space: there are regions of  `x` for which `y` is much harder to
predict than for other.

The following data generating process is heteroscedastic because `true_y_std`
is not constant: it is defined as a function of `x`.

.. GENERATED FROM PYTHON SOURCE LINES 33-54

.. code-block:: Python

    rng = np.random.default_rng(0)
    n_samples = int(3e3)
    x = rng.uniform(low=-3, high=3, size=n_samples)
    X = x.reshape((n_samples, 1))


    def true_y_mean(x):
        return expit(x) - 0.5 - 0.1 * x


    def true_y_std(x):
        return 0.07 * np.exp(-((x - 0.5) ** 2) / 0.9)


    y = rng.normal(loc=true_y_mean(x), scale=true_y_std(x))

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=1 / 2, random_state=0
    )


.. GENERATED FROM PYTHON SOURCE LINES 55-67

Fit TabICL and estimate quantiles
---------------------------------

The TabICL model is fitted with `n_estimators=1` to disable ensembling to speed-up
the execution (at the cost of slightly worse predictions).

The model then estimates quantiles of the distribution of Y|X for each test data
point. If the model is good, we expect 80% of the observed values of y to lie
between the predicted lower (0.1) and upper (0.9) quantile values.

The 0.5 quantile prediction estimates the median of Y|X: we expect 50% of the
observation to lie above and the remaining 50% to lie below. 

.. GENERATED FROM PYTHON SOURCE LINES 68-75

.. code-block:: Python

    tabicl = TabICLRegressor(n_estimators=1)
    tabicl.fit(X_train, y_train)

    alphas = [0.10, 0.5, 0.90]
    quantiles = tabicl.predict(X_test, output_type="quantiles", alphas=alphas)


.. GENERATED FROM PYTHON SOURCE LINES 76-89

Plot the quantiles predicted by TabICL
--------------------------------------

The predictions of the lower and upper quantiles are sorted by increasing
value of `x` to define the boarder of a shaded area that represents the
predictive uncertainty of the model on this dataset. We can observe that the
(vertical) width of the prediction interval is much larger for `x` values
between -0.5 and 1.5 than elsewhere and naturally adapts to the noise level of
`y` given `x`.

For `x` values below -1. or above 2., the TabICL model is very confident (narrow
prediction intervals): indeed we can observe that `y` is nearly a deterministic
(noise free) function of `x` in those regions.

.. GENERATED FROM PYTHON SOURCE LINES 89-124

.. code-block:: Python


    def plot_data_and_quantiles(X_test, y_test, quantiles):

        _, ax = plt.subplots(figsize=(5, 4), constrained_layout=True)

        # Plot test data points
        ax.scatter(
            x=X_test,
            y=y_test,
            alpha=0.15,
            color="gray"
        )

        # sort test points for a clean line plot (optional)
        order = np.argsort(X_test[:, 0])
        x_sorted = X_test[order, 0]

        # Plot median
        ax.plot(x_sorted, quantiles[order, 1],
                color='darkgreen', lw=3, label="median")

        # Plot 10-90% interval
        ax.fill_between(x_sorted, quantiles[order, 0],
                        quantiles[order, 2], alpha=0.18,
                        color="green", label="10–90% interval")

        ax.legend(frameon=False)
        ax.set(xlabel="Input feature x", ylabel="Target variable y")
        ax.spines[['top', 'right']].set_visible(False)
        ax.set_title("TabICL predicted quantiles")


    plot_data_and_quantiles(X_test, y_test, quantiles)


.. image-sg:: /tutorials/images/sphx_glr_regression_heteroscedastic_1D_001.png
   :alt: TabICL predicted quantiles
   :srcset: /tutorials/images/sphx_glr_regression_heteroscedastic_1D_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 4.315 seconds)


.. _sphx_glr_download_tutorials_regression_heteroscedastic_1D.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: regression_heteroscedastic_1D.ipynb <regression_heteroscedastic_1D.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: regression_heteroscedastic_1D.py <regression_heteroscedastic_1D.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: regression_heteroscedastic_1D.zip <regression_heteroscedastic_1D.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_