Change detection based Thompson sampling algorithm for non-stationary bandits

Zaid, Kunwar; Ghatak, Gourab (Advisor)

Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/868

Full metadata record

DC Field	Value	Language
dc.contributor.author	Zaid, Kunwar
dc.contributor.author	Ghatak, Gourab (Advisor)
dc.date.accessioned	2021-03-26T06:53:37Z
dc.date.available	2021-03-26T06:53:37Z
dc.date.issued	2020
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/868
dc.description.abstract	The stationary multi-armed bandit (MAB) framework is a well-studied problem in literature, with many rigorous mathematical treatments and optimal solutions. However, for a non-stationary environment, i.e., when the reward distribution changes over time, the MAB problem is notoriously difficult to analyze. In general, to address non-stationary bandit problems, researchers have proposed two approaches: i) passively adaptive techniques, that are analytically tractable, or ii) actively adaptive techniques that keep track of the environment and adapt as soon as changes are detected. Consequently, researchers have come up with variants of bandit algorithms that are based on classical solutions, e.g., sliding-window upper-confidence bound (SW-UCB), dynamic UCB (d-UCB), discounted UCB (D-UCB), discounted Thompson sampling (DTS), etc. In this regard, we consider the piecewise stationary environment, where the reward distribution remains stationary for a random time and changes at an unknown instant. We propose a class of change-detection based, actively-adaptive, TS algorithms for this framework named TS-CD. In particular, the non-stationary in the environment is modeled as a Poisson arrival process, which changes the reward distribution on each arrival. For detecting the change we employ i) mean-estimation based methods, and ii) Goodness-of-fit tests, namely the Kolmogorov-Smirnov test (KS-test) and the Anderson-Darling test (AD-test). Once a change is detected, the TS algorithm either refreshes the parameters, or discounts the past rewards. To assess the performance of the proposed algorithm, we have tested it for edge-control of i) multi-connectivity1 and ii) RAT selection in a wireless network. We have compared the TS-CD algorithms with other bandit algorithms that are designed for non-stationary environments, such as D-UCB, discounted Thompson sampling (DTS) and change detection based UCB (CD-UCB). With extensive simulations, we establish the superior performance of the proposed TS-CD in the considered applications.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	Kolmogorov-Smirnov test, Stochastic Bandits, Thompson Sampling	en_US
dc.title	Change detection based Thompson sampling algorithm for non-stationary bandits	en_US
dc.type	Thesis	en_US
Appears in Collections:	Year-2020

Files in This Item:

File	Description	Size	Format
MT18164_Kunwar Zaid.pdf		830.4 kB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets