Change detection based Thompson sampling algorithm for non-stationary bandits

Zaid, Kunwar; Ghatak, Gourab (Advisor)

Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/868

Title:	Change detection based Thompson sampling algorithm for non-stationary bandits
Authors:	Zaid, Kunwar Ghatak, Gourab (Advisor)
Keywords:	Kolmogorov-Smirnov test, Stochastic Bandits, Thompson Sampling
Issue Date:	2020
Publisher:	IIIT-Delhi
Abstract:	The stationary multi-armed bandit (MAB) framework is a well-studied problem in literature, with many rigorous mathematical treatments and optimal solutions. However, for a non-stationary environment, i.e., when the reward distribution changes over time, the MAB problem is notoriously difficult to analyze. In general, to address non-stationary bandit problems, researchers have proposed two approaches: i) passively adaptive techniques, that are analytically tractable, or ii) actively adaptive techniques that keep track of the environment and adapt as soon as changes are detected. Consequently, researchers have come up with variants of bandit algorithms that are based on classical solutions, e.g., sliding-window upper-confidence bound (SW-UCB), dynamic UCB (d-UCB), discounted UCB (D-UCB), discounted Thompson sampling (DTS), etc. In this regard, we consider the piecewise stationary environment, where the reward distribution remains stationary for a random time and changes at an unknown instant. We propose a class of change-detection based, actively-adaptive, TS algorithms for this framework named TS-CD. In particular, the non-stationary in the environment is modeled as a Poisson arrival process, which changes the reward distribution on each arrival. For detecting the change we employ i) mean-estimation based methods, and ii) Goodness-of-fit tests, namely the Kolmogorov-Smirnov test (KS-test) and the Anderson-Darling test (AD-test). Once a change is detected, the TS algorithm either refreshes the parameters, or discounts the past rewards. To assess the performance of the proposed algorithm, we have tested it for edge-control of i) multi-connectivity1 and ii) RAT selection in a wireless network. We have compared the TS-CD algorithms with other bandit algorithms that are designed for non-stationary environments, such as D-UCB, discounted Thompson sampling (DTS) and change detection based UCB (CD-UCB). With extensive simulations, we establish the superior performance of the proposed TS-CD in the considered applications.
URI:	http://repository.iiitd.edu.in/xmlui/handle/123456789/868
Appears in Collections:	Year-2020

Files in This Item:

File	Description	Size	Format
MT18164_Kunwar Zaid.pdf		830.4 kB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets