Ph.D. course of Statistical Science
The Graduate University for Advanced Studies
About Me
Ph.D. candidate in Statistical Science
コンピュータサイエンスと機械学習を生業とする研究者・エンジニアです.
Transactions on Machine Learning Research
The Wasserstein Weisfeiler-Lehman(WWL) graph kernel is a popular and efficient approach, utilized in various kernel-dependent machine learning frameworks for practical applications with graph data. It incorporates optimal transport geometry into the Weisfeiler-Lehman graph kernel, to mitigate the information loss inherent in aggregation strategies of graph kernels. While the WWL graph kernel demonstrates superior performance in many applications, it suffers a drawback in its computational complexity, i.e., at least $\mathcal{O}(n_{1} n_{2})$, where $n_{1}, n_{2}$ denote the number of vertices in the input graphs. Consequently, it hinders the practical applicability of the WWL graph kernel, especially in large-scale settings. In this paper, we propose the \emph{Tree Wasserstein Weisfeiler-Lehman}(TWWL) algorithm, which leverages a \emph{tree structure} to scale up the exact computation of the WWL graph kernel for graph data with categorical node labels. In particular, the computational complexity of the TWWL algorithm is $\mathcal{O}(n_{1} + n_{2})$, which enables its application to large-scale graphs. Numerical experiments demonstrate that the performance of the proposed algorithm compares favorably with baseline kernels, while its computation is several orders of magnitude faster than the classic WWL graph kernel. This paves the way for applications in large-scale datasets where the WWL kernel is computationally prohibitive.
Journal of Complex Networks
Graph representation matrices are essential tools in graph data analysis. Recently, Hermitian adjacency matrices have been proposed to investigate directed graph structures. Previous studies have demonstrated that these matrices can extract valuable information for clustering. In this paper, we propose the complex non-backtracking matrix that integrates the properties of the Hermitian adjacency matrix and the non-backtracking matrix. The proposed matrix has similar properties with the non-backtracking matrix of undirected graphs. We reveal relationships between the complex non-backtracking matrix and the Hermitian adjacency matrix. Also, we provide intriguing insights that this matrix representation holds cluster information, particularly for sparse directed graphs.
Neural Computation
Principal component analysis (PCA) is a widely used method for data processing, such as for dimension reduction and visualization. Standard PCA is known to be sensitive to outliers, and various robust PCA methods have been proposed. It has been shown that the robustness of many statistical methods can be improved using mode estimation instead of mean estimation, because mode estimation is not significantly affected by the presence of outliers. Thus, this study proposes a modal principal component analysis (MPCA), which is a robust PCA method based on mode estimation. The proposed method finds the minor component by estimating the mode of the projected data points. As a theoretical contribution, probabilistic convergence property, influence function, finite-sample breakdown point, and its lower bound for the proposed MPCA are derived. The experimental results show that the proposed method has advantages over conventional methods.
Information Geometry
Modal linear regression (MLR) is used for modeling the conditional mode of a response as a linear predictor of explanatory variables. It is an effective approach to dealing with response variables having a multimodal distribution or those contaminated by outliers. Because of the semiparametric nature of MLR, constructing a statistical model manifold is difficult with the conventional approach. To overcome this difficulty, we first consider the information geometric perspective of the modal expectation–maximization (EM) algorithm. Based on this perspective, model manifolds for MLR are constructed according to observations, and a data manifold is constructed based on the empirical distribution. In this paper, the em algorithm, which is a geometric formulation of the EM algorithm, of MLR is shown to be equivalent to the conventional EM algorithm of MLR. The robustness of the MLR model is also discussed in terms of the influence function and information geometry.
Proceedings of the 25th International Conference on Neural Information Processing (ICONIP 2018)
Modal linear regression (MLR) is a standard method for modeling the conditional mode of a response variable using a linear combination of explanatory variables. It is effective when dealing with response variables with an asymmetric, multi-modal distribution. Because of the nonparametric nature of MLR, it is difficult to construct a statistical model manifold in the sense of information geometry. In this work, a model manifold is constructed using observations instead of explicit parametric models. We also propose a method for constructing a data manifold based on an empirical distribution. The em algorithm, which is a geometric formulation of the EM algorithm, of MLR is shown to be equivalent to the conventional EM algorithm of MLR.
The Interim Conference of the Asian Regional Section of the International Association for Statistical Computing
Graph representation matrices are essential tools in graph data analysis. Recently, Hermitian adjacency matrices have been proposed to study the structure of directed graphs. Previous studies have demonstrated that these matrices can extract valuable information for clustering. In this paper, we introduce the complex non-backtracking (CNBT) matrix corresponding to a Hermitian adjacency matrix. The proposed matrix shares similar properties with the non-backtracking matrix of undirected graphs. We reveal relationships between the complex non-backtracking matrix and the Hermitian adjacency matrix. Furthermore, we experimentally show that our spectral clustering method based on the CNBT matrix outperforms an existing method, particularly for sparse directed graphs, which are common in real-world data where the number of edges is only a few times the number of vertices.
第28回情報論的学習理論ワークショップ
グラフの頂点ラベルを用いるWeisfeiler-Lehman(WL)カーネルに最適輸送を導入したWasserstein WL(WWL)カーネルは優れた性能を示す一方で計算コストが高く大規模グラフデータセットへの適用が困難である.本研究はWLアルゴリズムで生成されるラベル構造に着目した高速化を提案し,離散ラベルWWLカーネルの厳密なWasserstein距離が頂点数の定数倍で得られることを示した.
第27回情報論的学習理論ワークショップ
有向グラフの行列表現の1つであるエルミート隣接行列に対応するNon-backtracking(NBT)行列として,複素NBT(CNBT)行列を提案する.無向グラフの隣接行列とNBT行列間で成立するいくつかの関係がエルミート隣接行列とCNBT行列間でも成り立つことを示し,また数値実験によって,スパースグラフのノードクラスタリングでは提案手法が既存手法よりも頑健な推定が可能であることを示す.
第39回情報論的学習理論と機械学習研究会
主成分分析は次元削減や可視化のようなデータ分析のための手法として広く一般に用いられている.しかしながら外れ値に対して脆弱であることが知られており,頑健化のために様々な手法が提案されている.また一般的な統計手法において,外れ値に対する頑健性を改善するために最頻値推定の概念を取り入れた手法が提案されているが,主成分分析への適用はこれまで行われていない.本稿では,主成分方向ではなくマイナー成分方向に着目することで最頻値推定の概念を主成分分析に取り入れた,最頻値主成分分析の提案を行う.提案する推定量の収束性や影響関数,破局点の側面から理論的な性質を評価し,数値実験によって,提案手法が既存手法と同程度の結果を示し,いくつかの状況において優位な性能を有することを確認した.
第32回情報論的学習理論と機械学習研究会
MLR(Modal Linear Regression, 最頻値線形回帰)は,出力の最頻値が説明変数の線形予測子で表されるというモデルに基づく回帰手法である.線形回帰と異なり,パラメトリックな分布の仮定がないという意味で制約が緩いことが, MLRの特徴の1つである. 一方, パラメトリックな分布の仮定がないことが,MLRを情報幾何学の枠組みで捉える際のモデル多様体,データ多様体の構成を非自明にする.本稿では,MLRを情報幾何学の枠組みで捉えるにあたり,一般に経験分布を構成するために用いる観測データを,モデル多様体を構成するために用いることを提案する.またMEMアルゴリズムの考察を通じて,MLRにおける誤差分布の最頻値の仮定を用いた経験分布の構成方法を提案する.本稿のモデル多様体,データ多様体における$em$アルゴリズムが,MLRをEMアルゴリズムの枠組みで捉えた際のE-Step,M-Stepの計算と等しくなるという結果を得た.
The Graduate University for Advanced Studies
University of Tsukuba
University of Tsukuba
National Institute for Environmental Studies
Global AI Innovations Laboratory
KDDI CORPORATION
総研大 統計科学コース アドベントカレンダー2025の記事追加
各要素の余白などデザインの微調整
FooterでのSNSリンクのバグ修正
総研大アドベントカレンダー2025の記事追加
情報の整備とコンテンツの移植
Astroプロジェクトの初期セットアップ