# Generating Synthetic Data

MLJ has a set of functions - make_blobs, make_circles, make_moons and make_regression (closely resembling functions in scikit-learn of the same name) - for generating synthetic data sets. These are useful for testing machine learning models (e.g., testing user-defined composite models; see Composing Models)

## Generating Gaussian blobs

MLJBase.make_blobsFunction
X, y = make_blobs(n=100, p=2; kwargs...)

Generate Gaussian blobs for clustering and classification problems.

Return value

By default, a table X with p columns (features) and n rows (observations), together with a corresponding vector of n Multiclass target observations y, indicating blob membership.

Keyword arguments

• shuffle=true: whether to shuffle the resulting points,

• centers=3: either a number of centers or a c x p matrix with c pre-determined centers,

• cluster_std=1.0: the standard deviation(s) of each blob,

• center_box=(-10. => 10.): the limits of the p-dimensional cube within which the cluster centers are drawn if they are not provided,

• eltype=Float64: machine type of points (any subtype of AbstractFloat).

• rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).

• as_table=true: whether to return the points as a table (true) or a matrix (false). If false the target y has integer element type.

Example

X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
using MLJ, DataFrames
X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
dfBlobs = DataFrame(X)
dfBlobs.y = y
first(dfBlobs, 3)

3 rows × 4 columns

x1x2x3y
Float64Float64Float64Cat…
14.61979-11.0812.545772
29.34523-4.58449-2.061452
3-2.106525.65-3.040181
using VegaLite
dfBlobs |> @vlplot(:point, x=:x1, y=:x2, color = :"y:n") 

dfBlobs |> @vlplot(:point, x=:x1, y=:x3, color = :"y:n") 

## Generating concentric circles

MLJBase.make_circlesFunction
X, y = make_circles(n=100; kwargs...)

Generate n labeled points close to two concentric circles for classification and clustering models.

Return value

By default, a table X with 2 columns and n rows (observations), together with a corresponding vector of n Multiclass target observations y. The target is either 0 or 1, corresponding to membership to the smaller or larger circle, respectively.

Keyword arguments

• shuffle=true: whether to shuffle the resulting points,

• noise=0: standard deviation of the Gaussian noise added to the data,

• factor=0.8: ratio of the smaller radius over the larger one,

• eltype=Float64: machine type of points (any subtype of AbstractFloat).

• rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).

• as_table=true: whether to return the points as a table (true) or a matrix (false). If false the target y has integer element type.

Example

X, y = make_circles(100; noise=0.5, factor=0.3)
using MLJ, DataFrames
X, y = make_circles(100; noise=0.05, factor=0.3)
dfCircles = DataFrame(X)
dfCircles.y = y
first(dfCircles, 3)

3 rows × 3 columns

x1x2y
Float64Float64Cat…
10.116784-0.3038610
20.8530290.4790281
30.0128085-0.3638990
using VegaLite
dfCircles |> @vlplot(:circle, x=:x1, y=:x2, color = :"y:n") 

## Sampling from two interleaved half-circles

MLJBase.make_moonsFunction
    make_moons(n::Int=100; kwargs...)

Generates labeled two-dimensional points lying close to two interleaved semi-circles, for use with classification and clustering models.

Return value

By default, a table X with 2 columns and n rows (observations), together with a corresponding vector of n Multiclass target observations y. The target is either 0 or 1, corresponding to membership to the left or right semi-circle.

Keyword arguments

• shuffle=true: whether to shuffle the resulting points,

• noise=0.1: standard deviation of the Gaussian noise added to the data,

• xshift=1.0: horizontal translation of the second center with respect to the first one.

• yshift=0.3: vertical translation of the second center with respect to the first one.

• eltype=Float64: machine type of points (any subtype of AbstractFloat).

• rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).

• as_table=true: whether to return the points as a table (true) or a matrix (false). If false the target y has integer element type.

Example

X, y = make_moons(100; noise=0.5)
using MLJ, DataFrames
X, y = make_moons(100; noise=0.05)
dfHalfCircles = DataFrame(X)
dfHalfCircles.y = y
first(dfHalfCircles, 3)

3 rows × 3 columns

x1x2y
Float64Float64Cat…
11.45093-0.6027861
22.011510.2451561
31.99726-0.09885891
using VegaLite
dfHalfCircles |> @vlplot(:circle, x=:x1, y=:x2, color = :"y:n") 

## Regression data generated from noisy linear models

MLJBase.make_regressionFunction
make_regression(n, p; kwargs...)

Generate Gaussian input features and a linear response with Gaussian noise, for use with regression models.

Return value

By default, a table X with p columns and n rows (observations), together with a corresponding vector of n Continuous target observations y.

Keywords

• intercept=true: whether to generate data from a model with intercept,

• sparse=0: portion of the generating weight vector that is sparse,

• noise=0.1: standard deviation of the Gaussian noise added to the response,

• outliers=0: portion of the response vector to make as outliers by adding a random quantity with high variance. (Only applied if binary is false)

• binary=false: whether the target should be binarized (via a sigmoid).

• eltype=Float64: machine type of points (any subtype of AbstractFloat).

• rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).

• as_table=true: whether to return the points as a table (true) or a matrix (false).

Example

X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
using MLJ, DataFrames
X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
dfRegression = DataFrame(X)
dfRegression.y = y
first(dfRegression, 3)`

3 rows × 6 columns

x1x2x3x4x5y
Float64Float64Float64Float64Float64Float64
10.731212-1.44803-0.0551643-0.473598-1.14843-1.48917
20.471442-0.7795140.8933560.623297-1.50227-0.0975662
31.109190.776157-2.812491.052830.0777981.1108