Background. Encoder-based language models have shown successful performance in natural language understanding tasks. Many studies took the standard fine-tuning method that exploits embeddings of the last layer. However, previous studies found that different layers convey different linguistic knowledge, suggesting that the last layer might not be the best for downstream tasks.
Methods. In this paper, as a new fine-tuning method, we propose a layer-wise attention mechanism using a pivot layer. The pivot layer is used to compute attention scores of encoder layers, and we defined three types of pivot layers. We also examined four attention functions and demonstrated that the attention function plays an important role in the layer-wise attention for fine-tuning by experiments.
Results. Our proposed mechanism outperformed the standard fine-tuning method and other recent method in General Language Understanding Evaluation benchmark. By visualizing the attention distributions, we found that the last layer is always not preferable for every General Language Understanding Evaluation benchmark task, and also found that the difference in attention distribution affects the task performance.
If you have any questions about submitting your review, please email us at [email protected].