Java Agent Exploration — appendToSystemClassLoaderSearch Problems
This article describes the exploration of strange error reporting in Java Agent, including Java Agent error reporting, JVM principle, glibc thread safety, and pthread tls.
Background
Java agents are enabled in multiple Alibaba Cloud products for users. Thus, the overall Java Agent duration increases, and each agent stores each other in scenarios where multiple Java agents are used together, resulting in increased memory usage and resource consumption.
Microservices Engine (MSE) initiated a one-java-agent project that can collaborate with various Java agents. It supports more efficient and convenient bytecode injection.
Among them, each Java agent, as the plugin of the one-java-agent, is loaded through multi-thread startup in the premain phase, thus reducing the startup speed from O(n) to O(1) and the overall loading time of Java Agent.
Questions
However, the following errors were found during the one-java-agent premain phase of the new version of Agent verification:
2022-06-15 06:22:47 [oneagent plugin arms-agent start] ERROR c.a.o.plugin.PluginManagerImpl -start plugin error, name: arms-agent
com.alibaba.oneagent.plugin.PluginException: start error, agent jar::/home/admin/.opt/ArmsAgent/plugins/ArmsAgent/arms-bootstrap-1.7.0-SNAPSHOT.jar
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:113)
at com.alibaba.oneagent.plugin.PluginManagerImpl.startOnePlugin(PluginManagerImpl.java:294)
at com.alibaba.oneagent.plugin.PluginManagerImpl.access$200(PluginManagerImpl.java:22)
at com.alibaba.oneagent.plugin.PluginManagerImpl$2.run(PluginManagerImpl.java:325)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.InternalError: null
at sun.instrument.InstrumentationImpl.appendToClassLoaderSearch0(Native Method)
at sun.instrument.InstrumentationImpl.appendToSystemClassLoaderSearch(InstrumentationImpl.java:200)
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:100)
... 4 common frames omitted
2022-06-16 09:51:09 [oneagent plugin ahas-java-agent start] ERROR c.a.o.plugin.PluginManagerImpl -start plugin error, name: ahas-java-agent
com.alibaba.oneagent.plugin.PluginException: start error, agent jar::/home/admin/.opt/ArmsAgent/plugins/ahas-java-agent/ahas-java-agent.jar
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:113)
at com.alibaba.oneagent.plugin.PluginManagerImpl.startOnePlugin(PluginManagerImpl.java:294)
at com.alibaba.oneagent.plugin.PluginManagerImpl.access$200(PluginManagerImpl.java:22)
at com.alibaba.oneagent.plugin.PluginManagerImpl$2.run(PluginManagerImpl.java:325)
at java.lang.Thread.run(Thread.java:855)
Caused by: java.lang.IllegalArgumentException: null
at sun.instrument.InstrumentationImpl.appendToClassLoaderSearch0(Native Method)
at sun.instrument.InstrumentationImpl.appendToSystemClassLoaderSearch(InstrumentationImpl.java:200)
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:100)
... 4 common frames omitted
Those familiar with Java Agent may notice that this is an Instrumentation.appendToSystemClassLoaderSearch
call that reported an error.
First of all, the appendToSystemClassLoaderSearch
path does exist. Secondly, the real reason for this error is in the C++ part, which is difficult to troubleshoot.
Anyway, we still need to investigate why this mistake occurred.
First of all, let’s sort out the specific call process. The following analysis is based on this:
- Instrumentation.appendToSystemClassLoaderSearch (java)
- appendToClassLoaderSearch0 (JNI)
`- appendToClassLoaderSearch
|- AddToSystemClassLoaderSearch
| `-create_class_path_zip_entry
| `-stat
`-convertUft8ToPlatformString
`- iconv
Input the Logs and Confirm the Scene
Since this problem has a 10% probability of occurring in the container environment and is easy to reproduce, the latest code of dragonwell8 is used to add logs to confirm the scene.
First, add the log at the actual entrance of JNI, which is the appendToClassLoaderSearch
method entrance.
After adding the log, I found that the problem was even more confusing.
- When no error is reported, the
appendToClassLoaderSearch
entry will output. - When there is an error, the appendToClassLoaderSearch entry does not output. Does that mean no commands are executed here?
This is not the same as the error log. Is the stacktrace information fooling us?
After a tough night, I consulted colleagues from the Dragonwell team the next day. Here was what they did:
- tty->print_cr(“internal error”)
- If this is not available, use printf(“xxxn”);fflush(stdout)
After that, our logs can be output.
This was the first difficulty we met. The printf needs to add fflush to ensure successful output.
Code Analysis
After that, logs are added continuously. Finally, we found that create_class_path_zip_entry returned NULL.
What If I Can’t Find the Corresponding jar File?
Continue the troubleshooting and find that the stat error is reported. Return No such file or directory. As we mentioned earlier, the path of jarFile exists. Isn’t stat thread-safe?
After checking the document [1], we found that stat is thread-safe.
Looking back, I noticed that the path of stat was abnormal. Sometimes, the path was empty, and other times, the path was /home/admin/.opt/ArmsAgent/plugins/ahas-java-agent/ahas-java-agent.jarSHOT.jar. As seen at the end of the character, it was caused by two characters written into the same memory. Moreover, the length of the corresponding string has become an irregular number.
Then, the problem is clear to see. Let’s start looking for the generation of this string. This character is generated by the convertUft8ToPlatformString.
Is There a Problem with the Character Encoding Conversion?
Therefore, the logic of utf8ToPlatform is debugged. gdb is directly run on ECS to debug jvm and avoid frequent log addition and container restart.
The results show that under Linux, utf8ToPlatform is the direct memcpy, and the target address of memcpy is on the stack.
It seems the thread safety problem doesn’t exist.
After careful investigation, it was found it was related to the environment variables. The environment variable related to coding on ECS is LANG=en_US.UTF-8. The centos:7 does not have this environment variable by default on the container. As such, jvm reads ANSI_X3.4–1968.
This is the second difficulty. Environmental variables will affect the local encoding conversion.
With the preceding phenomena and codes, it is found that it is still necessary to go through iconv to switch from UTF-8 to ANSI_X3.4–1968 coding in the container environment.
It can also be inferred here that if LANG=en_US.UTF-8 is manually set in the container. This problem will not occur again. Additional verification confirms this.
Then, add the log. When finally confirm that it is iconv, the target string breaks down.
Does that mean the iconv is not thread-safe?
Iconv Is Not Thread-Safe.
We checked iconv’s documentation and found that it is not completely thread-safe.
Generally speaking, before iconv, you need to use iconv_open to open an iconv_t, and this iconv_t does not support the simultaneous use of multiple threads.
At this point, the problem has been clearly located since jvm has written iconv_t as a global variable. When multiple threads append, it is possible to call iconv at the same time, resulting in competition problems.
Here is the third difficulty. Iconv is not thread-safe.
How Can We Fix It?
Fix the one-java-agent First
It is easy to modify the Java code. You only need to add a lock.
However, there is a design problem here. The instrument objects have been scattered everywhere in the code. Now, it is suddenly necessary to add a lock. Almost all the places used need to be changed, and the cost of code transformation is high.
Finally, we solve it through the proxy class.
As such, other places only need to use InstrumentationWrapper, and such a problem will not be triggered.
Whether to Repair jvm
Then, we analyzed the code on the jvm side and found that the appendToClassLoaderSearch 0 method was not threaded-safe because iconv_t was not thread-safe. Can it be solved gracefully?
If it is a Java program, it can be solved by directly using ThreadLoal to store iconv_t.
However, on the cpp side, although C++ 11 supports thread_local, jdk8 does not use C++ 11 (you can refer to JEP). Besides, C++ 11 only supports thread_local set and get. Lifecycle management (such as initialization and destruction of thread_local) is not supported yet. For example, it is impossible to automatically recycle iconv_t resources at the end of the thread.
Let’s fall back to pthread. Since pthread provides thread-specific data, you can do similar things.
- Pthread_key_create creates the thread-local storage area.
- Pthread_setspecific is used to put values into thread-local storage.
- Pthread_getspecific is used to retrieve values from the thread-local storage.
- Most importantly, pthread_once meets the requirement that pthread_key_t can only be initialized once.
- In addition, the second parameter of pthread_once is the callback at the end of the thread, which can be used to close iconv_t to avoid resource leakage.
In short, pthread provides the full lifecycle management of thread_local. Therefore, the final code is listed below. The thread-local storage is initialized with make_key.
Therefore, after compiling the JDK, inputting mirroring and restarting pods several times in batches did not cause the problems mentioned at the beginning of the article.
Summary
There are many problems we have come across throughout this process, including Java, JNI/JVMTi, glibc, and pthread.
- Printf must be added with fflush to ensure successful output.
- Environment variables affect local character encoding conversions.
- Iconv is not thread-safe.
- Use pthread thread-local storage to implement full lifecycle management of thread-local variables
From this case, we restored the problem and proposed a solution according to the call stack and code. I hope this article help you learn more about Java/JVM.
References
[1] Document
https://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_09.html
[2] Links that one-java-agent fixes:
https://github.com/alibaba/one-java-agent/issues/31
[3] Links that Dragonwell fixes:
https://github.com/alibaba/dragonwell8/pull/346
[4] The one-java-agent brings us a convenient and non-intrusive microservice governance method:
https://www.alibabacloud.com/product/microservices-engine