Generating Synthetic Data

Here synthetic data means artificially generated data, with no reference to a "real world" data set. Not to be confused "fake data" obtained by resampling from a distribution fit to some actual real data.

MLJ has a set of functions - make_blobs, make_circles, make_moons and make_regression (closely resembling functions in scikit-learn of the same name) - for generating synthetic data sets. These are useful for testing machine learning models (e.g., testing user-defined composite models; see Composing Models)

Generating Gaussian blobs

MLJBase.make_blobs — Function

X, y = make_blobs(n=100, p=2; kwargs...)

Generate Gaussian blobs for clustering and classification problems.

Return value

By default, a table X with p columns (features) and n rows (observations), together with a corresponding vector of n Multiclass target observations y, indicating blob membership.

Keyword arguments

shuffle=true: whether to shuffle the resulting points,
centers=3: either a number of centers or a c x p matrix with c pre-determined centers,
cluster_std=1.0: the standard deviation(s) of each blob,
center_box=(-10. => 10.): the limits of the p-dimensional cube within which the cluster centers are drawn if they are not provided,
eltype=Float64: machine type of points (any subtype of AbstractFloat).
rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).
as_table=true: whether to return the points as a table (true) or a matrix (false). If false the target y has integer element type.

Example

X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])

source

using MLJ, DataFrames
X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
dfBlobs = DataFrame(X)
dfBlobs.y = y
first(dfBlobs, 3)

3×4 DataFrame

Row	x1	x2	x3	y
	Float64	Float64	Float64	Cat…
1	-4.33288	3.80021	7.68147	1
2	-5.09418	6.49195	9.23594	1
3	3.70995	-3.05888	-6.18936	2

using VegaLite
dfBlobs |> @vlplot(:point, x=:x1, y=:x2, color = :"y:n")

svg

dfBlobs |> @vlplot(:point, x=:x1, y=:x3, color = :"y:n")

svg

Generating concentric circles

MLJBase.make_circles — Function

X, y = make_circles(n=100; kwargs...)

Generate n labeled points close to two concentric circles for classification and clustering models.

Return value

By default, a table X with 2 columns and n rows (observations), together with a corresponding vector of n Multiclass target observations y. The target is either 0 or 1, corresponding to membership to the smaller or larger circle, respectively.

Keyword arguments

shuffle=true: whether to shuffle the resulting points,
noise=0: standard deviation of the Gaussian noise added to the data,
factor=0.8: ratio of the smaller radius over the larger one,

eltype=Float64: machine type of points (any subtype of AbstractFloat).
rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).
as_table=true: whether to return the points as a table (true) or a matrix (false). If false the target y has integer element type.

Example

X, y = make_circles(100; noise=0.5, factor=0.3)

source

using MLJ, DataFrames
X, y = make_circles(100; noise=0.05, factor=0.3)
dfCircles = DataFrame(X)
dfCircles.y = y
first(dfCircles, 3)

3×3 DataFrame

Row	x1	x2	y
	Float64	Float64	Cat…
1	0.229766	-0.176036	0
2	-0.153854	-0.284216	0
3	-0.297306	-0.0669254	0

using VegaLite
dfCircles |> @vlplot(:circle, x=:x1, y=:x2, color = :"y:n")

svg

Sampling from two interleaved half-circles

MLJBase.make_moons — Function

    make_moons(n::Int=100; kwargs...)

Generates labeled two-dimensional points lying close to two interleaved semi-circles, for use with classification and clustering models.

Return value

Keyword arguments

shuffle=true: whether to shuffle the resulting points,
noise=0.1: standard deviation of the Gaussian noise added to the data,
xshift=1.0: horizontal translation of the second center with respect to the first one.
yshift=0.3: vertical translation of the second center with respect to the first one.
eltype=Float64: machine type of points (any subtype of AbstractFloat).
rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).
as_table=true: whether to return the points as a table (true) or a matrix (false). If false the target y has integer element type.

Example

X, y = make_moons(100; noise=0.5)

source

using MLJ, DataFrames
X, y = make_moons(100; noise=0.05)
dfHalfCircles = DataFrame(X)
dfHalfCircles.y = y
first(dfHalfCircles, 3)

3×3 DataFrame

Row	x1	x2	y
	Float64	Float64	Cat…
1	1.06963	-0.693786	1
2	-0.483621	0.831705	0
3	1.55299	-0.493024	1

using VegaLite
dfHalfCircles |> @vlplot(:circle, x=:x1, y=:x2, color = :"y:n")

svg

Regression data generated from noisy linear models

MLJBase.make_regression — Function

make_regression(n, p; kwargs...)

Generate Gaussian input features and a linear response with Gaussian noise, for use with regression models.

Return value

By default, a tuple (X, y) where table X has p columns and n rows (observations), together with a corresponding vector of n Continuous target observations y.

Keywords

intercept=true: Whether to generate data from a model with intercept.
n_targets=1: Number of columns in the target.
sparse=0: Proportion of the generating weight vector that is sparse.
noise=0.1: Standard deviation of the Gaussian noise added to the response (target).
outliers=0: Proportion of the response vector to make as outliers by adding a random quantity with high variance. (Only applied if binary is false.)
as_table=true: Whether X (and y, if n_targets > 1) should be a table or a matrix.
eltype=Float64: Element type for X and y. Must subtype AbstractFloat.
binary=false: Whether the target should be binarized (via a sigmoid).
eltype=Float64: machine type of points (any subtype of AbstractFloat).
rng=Random.GLOBAL_RNG: any AbstractRNG object, or integer to seed a MersenneTwister (for reproducibility).
as_table=true: whether to return the points as a table (true) or a matrix (false).

Example

X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)

source

using MLJ, DataFrames
X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
dfRegression = DataFrame(X)
dfRegression.y = y
first(dfRegression, 3)

3×6 DataFrame

Row	x1	x2	x3	x4	x5	y
	Float64	Float64	Float64	Float64	Float64	Float64
1	-0.0620149	0.957933	-1.18308	-1.7924	-1.23835	0.549218
2	-0.415197	-0.75425	-0.984602	-2.46851	-0.666179	0.159293
3	0.376288	-1.96122	0.244017	-0.0472903	-0.348247	16.6205