For the past ten years, the open-source movement has been one of the key drivers of the IT industry's development, and its crucial component. The role of open-source projects is becoming more and more prominent not only in terms of quantity but also in terms of quality, which changes the very concept of how they are positioned on the IT market in general. Our courageous PVS-Studio team is not sitting idly and is taking an active part in strengthening the presence of open-source software by finding hidden bugs in the enormous depths of codebases and offering free license options to the authors of such projects. This article is just another piece of that activity! Today we are going to talk about Apache Hive. I've got the report — and there are things worth looking at.
About PVS-Studio
The static code analyzer PVS-Studio, which has been around for more than 10 years now, is a multi-functional and easy-to-integrate software solution. At present, it supports C, C++, C#, and Java and runs on Windows, Linux, and macOS.
PVS-Studio is a paid B2B solution used by numerous teams in a number of companies. If you want to try the analyzer, visit this page to download the distribution and request a trial key.
If you are an open-source geek or, say, a student, you can take advantage of one of our free license options.
About Apache Hive
The amount of data has been growing at an enormous rate over the past years. The standard databases can no longer cope with this rapid growth, which is where the term Big Data comes from along with other related notions (such as processing, storage, and other operations on big data).
Apache Hadoop is currently thought to be one of the pioneering Big Data technologies. Its primary tasks are storing, processing, and managing large amounts of data. The main components comprising the framework are Hadoop Common, HDFS, Hadoop MapReduce, and Hadoop YARN. Over time, a large ecosystem of related projects and technologies has developed around Hadoop, many of which originally started as part of the project and then budded off to become independent. Apache Hive is one of them.
Apache Hive is a distributed data warehouse. It manages the data stored in HDFS and provides the query language based on SQL (HiveQL) to handle that data. More detail on the project can be found here.
Running the analysis
It didn't take much effort or time to start the analysis. Here's my algorithm:
- Downloaded Apache Hive from GitHub;
- Read the guide on starting the Java analyzer and launched the analysis;
- Got the analyzer's report, studied it, and wrote out the most interesting cases.
The analysis results are as follows: 1456 warnings of High and Medium levels (602 and 854 respectively) on 6500+ files.
Not all warnings refer to genuine bugs. That's quite normal; you have to tweak the analyzer's settings before you start using it regularly. After that, you typically expect a fairly low rate of false positives (example).
I left out the 407 warnings (177 High- and 230 Medium-level) triggered by the test files. I also ignored the V6022 diagnostic (since you can't reliably distinguish between faulty and correct fragments when you are not familiar with the code), which was triggered as many as 482 times. Neither did I examine the 179 warnings generated by the V6021 diagnostic.
In the end, I still had enough warnings to go with, and since I didn't tweak the settings, there is still some percentage of false positives among them. There's just no point including too many warnings in an article like this :). So we'll just talk about what caught my eye and looked curious enough.
Predetermined conditions
Among the diagnostics examined for this analysis, V6007 holds a record for the number of issued warnings. A little over 200 messages!!! Some look harmless, others suspicious, and some others are genuine bugs after all! Let's take a look at some of them.
V6007 Expression 'key.startsWith(«hplsql.»)' is always true. Exec.java(675)
void initOptions()
{
....
if (key == null || value == null || !key.startsWith("hplsql.")) { // <=
continue;
}
else if (key.compareToIgnoreCase(Conf.CONN_DEFAULT) == 0) {
....
}
else if (key.startsWith("hplsql.conn.init.")) {
....
}
else if (key.startsWith(Conf.CONN_CONVERT)) {
....
}
else if (key.startsWith("hplsql.conn.")) {
....
}
else if (key.startsWith("hplsql.")) { // <=
....
}
}
That's quite a lengthy if-else-if construct! The analyzer doesn't like the condition in the last if (key.startsWith(«hplsql.»)) because if execution reaches it, it will mean it's true. Indeed, if you look at the first line of this whole if-else-if construct, you'll see that it already contains the opposite check, so if the string doesn't begin with «hplsql.», execution will immediately skip over to the next iteration.
V6007 Expression 'columnNameProperty.length() == 0' is always false. OrcRecordUpdater.java(238)
private static
TypeDescription getTypeDescriptionFromTableProperties(....)
{
....
if (tableProperties != null) {
final String columnNameProperty = ....;
final String columnTypeProperty = ....;
if ( !Strings.isNullOrEmpty(columnNameProperty)
&& !Strings.isNullOrEmpty(columnTypeProperty))
{
List<String> columnNames = columnNameProperty.length() == 0
? new ArrayList<String>()
: ....;
List<TypeInfo> columnTypes = columnTypeProperty.length() == 0
? new ArrayList<TypeInfo>()
: ....;
....
}
}
}
....
}
The comparison of the columnNameProperty string's length with zero will always return false. This happens because this comparison follows the !Strings.isNullOrEmpty(columnNameProperty) check. So if execution reaches our condition, it will mean that the columnNameProperty string is surely neither null nor empty.
The same is true for the columnTypeProperty string one line later:
- V6007 Expression 'columnTypeProperty.length() == 0' is always false. OrcRecordUpdater.java(239)
V6007 Expression 'colOrScalar1.equals(«Column»)' is always false. GenVectorCode.java(3469)
private void
generateDateTimeArithmeticIntervalYearMonth(String[] tdesc) throws Exception
{
....
String colOrScalar1 = tdesc[4];
....
String colOrScalar2 = tdesc[6];
....
if (colOrScalar1.equals("Col") && colOrScalar1.equals("Column")) // <=
{
....
} else if (colOrScalar1.equals("Col") && colOrScalar1.equals("Scalar"))
{
....
} else if (colOrScalar1.equals("Scalar") && colOrScalar1.equals("Column"))
{
....
}
}
The good old copy-paste. From the viewpoint of the current logic, the string colOrScalar1 might have two different values at once, which is impossible. Obviously, the checks should have the variable colOrScalar1 on the left and colOrScalar2 on the right.
Similar warnings a few lines below:
- V6007 Expression 'colOrScalar1.equals(«Scalar»)' is always false. GenVectorCode.java(3475)
- V6007 Expression 'colOrScalar1.equals(«Column»)' is always false. GenVectorCode.java(3486)
As a result, this if-else-if construct will never do anything.
A few more V6007 warnings:
- V6007 Expression 'characters == null' is always false. RandomTypeUtil.java(43)
- V6007 Expression 'writeIdHwm > 0' is always false. TxnHandler.java(1603)
- V6007 Expression 'fields.equals("*")' is always true. Server.java(983)
- V6007 Expression 'currentGroups != null' is always true. GenericUDFCurrentGroups.java(90)
- V6007 Expression 'this.wh == null' is always false. New returns not-null reference. StorageBasedAuthorizationProvider.java(93), StorageBasedAuthorizationProvider.java(92)
- and so on...
NPE
V6008 Potential null dereference of 'dagLock'. QueryTracker.java(557), QueryTracker.java(553)
private void handleFragmentCompleteExternalQuery(QueryInfo queryInfo)
{
if (queryInfo.isExternalQuery())
{
ReadWriteLock dagLock = getDagLock(queryInfo.getQueryIdentifier());
if (dagLock == null) {
LOG.warn("Ignoring fragment completion for unknown query: {}",
queryInfo.getQueryIdentifier());
}
boolean locked = dagLock.writeLock().tryLock();
.....
}
}
A null object is caught, logged, and… the program just keeps running. As a result, the check is followed by a null pointer dereference. Ouch!
The developers must have actually wanted the program to exit the function or throw some special exception in the case of getting a null reference.
V6008 Null dereference of 'buffer' in function 'unlockSingleBuffer'. MetadataCache.java(410), MetadataCache.java(465)
private boolean lockBuffer(LlapBufferOrBuffers buffers, ....)
{
LlapAllocatorBuffer buffer = buffers.getSingleLlapBuffer();
if (buffer != null) { // <=
return lockOneBuffer(buffer, doNotifyPolicy);
}
LlapAllocatorBuffer[] bufferArray = buffers.getMultipleLlapBuffers();
for (int i = 0; i < bufferArray.length; ++i) {
if (lockOneBuffer(bufferArray[i], doNotifyPolicy)) continue;
for (int j = 0; j < i; ++j) {
unlockSingleBuffer(buffer, true); // <=
}
....
}
....
}
....
private void unlockSingleBuffer(LlapAllocatorBuffer buffer, ....) {
boolean isLastDecref = (buffer.decRef() == 0); // <=
if (isLastDecref) {
....
}
}
Another potential NPE. If execution reaches the unlockSingleBuffer method, it will mean the buffer object is null. Suppose that's what has happened! If you look at the unlockSingleBuffer method, you'll notice how our object is dereferenced right in the first line. Gotcha!
A shift gone wild
V6034 Shift by the value of 'bitShiftsInWord — 1' could be inconsistent with the size of type: 'bitShiftsInWord — 1' = [-1… 30]. UnsignedInt128.java(1791)
private void shiftRightDestructive(int wordShifts,
int bitShiftsInWord,
boolean roundUp)
{
if (wordShifts == 0 && bitShiftsInWord == 0) {
return;
}
assert (wordShifts >= 0);
assert (bitShiftsInWord >= 0);
assert (bitShiftsInWord < 32);
if (wordShifts >= 4) {
zeroClear();
return;
}
final int shiftRestore = 32 - bitShiftsInWord;
// check this because "123 << 32" will be 123.
final boolean noRestore = bitShiftsInWord == 0;
final int roundCarryNoRestoreMask = 1 << 31;
final int roundCarryMask = (1 << (bitShiftsInWord - 1)); // <=
....
}
This is a potential shift by -1. If the method is called with, say, wordShifts == 3 and bitShiftsInWord == 0, the reported line will end up with 1 << -1. Is that a planned behavior?
V6034 Shift by the value of 'j' could be inconsistent with the size of type: 'j' = [0… 63]. IoTrace.java(272)
public void logSargResult(int stripeIx, boolean[] rgsToRead)
{
....
for (int i = 0, valOffset = 0; i < elements; ++i, valOffset += 64) {
long val = 0;
for (int j = 0; j < 64; ++j) {
int ix = valOffset + j;
if (rgsToRead.length == ix) break;
if (!rgsToRead[ix]) continue;
val = val | (1 << j); // <=
}
....
}
....
}
In the reported line, the j variable can have a value within the range [0… 63]. Because of that, calculation of the value of val in the loop may run in an unexpected way. In the (1 << j) expression, the value 1 is of type int, so shifting it by 32 bits and more takes us beyond the limits of the type's range. This can be fixed by writing ((long)1 << j).
Getting carried away by logging
V6046 Incorrect format. A different number of format items is expected. Arguments not used: 1, 2. StatsSources.java(89)
private static
ImmutableList<PersistedRuntimeStats> extractStatsFromPlanMapper (....) {
....
if (stat.size() > 1 || sig.size() > 1)
{
StringBuffer sb = new StringBuffer();
sb.append(String.format(
"expected(stat-sig) 1-1, got {}-{} ;", // <=
stat.size(),
sig.size()
));
....
}
....
if (e.getAll(OperatorStats.IncorrectRuntimeStatsMarker.class).size() > 0)
{
LOG.debug(
"Ignoring {}, marked with OperatorStats.IncorrectRuntimeStatsMarker",
sig.get(0)
);
continue;
}
....
}
When writing the code to format the string using String.format(), the developer used wrong syntax. As a result, the passed parameters never made it to the resulting string. My guess is that the developer had been working on logging before writing this, which is where they borrowed the syntax from.
A stolen exception
V6051 The use of the 'return' statement in the 'finally' block can lead to the loss of unhandled exceptions. ObjectStore.java(9080)
private List<MPartitionColumnStatistics>
getMPartitionColumnStatistics(....)
throws NoSuchObjectException, MetaException
{
boolean committed = false;
try {
.... /*some actions*/
committed = commitTransaction();
return result;
}
catch (Exception ex)
{
LOG.error("Error retrieving statistics via jdo", ex);
if (ex instanceof MetaException) {
throw (MetaException) ex;
}
throw new MetaException(ex.getMessage());
}
finally
{
if (!committed) {
rollbackTransaction();
return Lists.newArrayList();
}
}
}
Returning anything from the finally block is a very bad practice, and this example vividly shows why.
In the try block, the program is forming a request and accessing the storage. The committed variable has the value false by default and changes its state only after all the previous actions in the try block have been successfully executed. It means that if an exception is raised, that variable will always be false. The catch block will catch the exception, adjust it a bit, and throw it on. So when it's the turn of the finally block, execution will enter the condition from which an empty list will be returned. What does this return cost us? Well, it costs us preventing any caught exception from being thrown on to the outside where it could be properly handled. None of the exceptions specified in the method's signature will ever be thrown; they are simply misleading.
A similar diagnostic message:
- V6051 The use of the 'return' statement in the 'finally' block can lead to the loss of unhandled exceptions. ObjectStore.java(808)
Miscellaneous
V6009 Function 'compareTo' receives an odd argument. An object 'o2.getWorkerIdentity()' is used as an argument to its own method. LlapFixedRegistryImpl.java(244)
@Override
public List<LlapServiceInstance> getAllInstancesOrdered(....) {
....
Collections.sort(list, new Comparator<LlapServiceInstance>() {
@Override
public int compare(LlapServiceInstance o1, LlapServiceInstance o2) {
return o2.getWorkerIdentity().compareTo(o2.getWorkerIdentity()); // <=
}
});
....
}
There could be a number of causes leading to such a silly mistake: copy-paste, carelessness, hurry, and so on. We often see errors like that in open-source projects and even have a whole article about that.
V6020 Divide by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java(265)
public static long divideUnsignedLong(long dividend, long divisor) {
if (divisor < 0L) {
/*some comments*/
return (compareUnsignedLong(dividend, divisor)) < 0 ? 0L : 1L;
}
if (dividend >= 0) { // Both inputs non-negative
return dividend / divisor; // <=
} else {
....
}
}
This one is quite trivial. A series of checks was helpless to avert the division by zero.
A few more warnings:
- V6020 Mod by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java(309)
- V6020 Divide by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java(276)
- V6020 Divide by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java(312)
V6030 The method located to the right of the '|' operator will be called regardless of the value of the left operand. Perhaps, it is better to use '||'. OperatorUtils.java(573)
public static Operator<? extends OperatorDesc> findSourceRS(....)
{
....
List<Operator<? extends OperatorDesc>> parents = ....;
if (parents == null | parents.isEmpty()) {
// reached end e.g. TS operator
return null;
}
....
}
The programmer wrote the bitwise operator | instead of the logical ||. It means the right part will be executed no matter the result of the left one. If parents == null, this typo will end up with an NPE right in the next logical subexpression.
V6042 The expression is checked for compatibility with type 'A' but is cast to type 'B'. VectorColumnAssignFactory.java(347)
public static
VectorColumnAssign buildObjectAssign(VectorizedRowBatch outputBatch,
int outColIndex,
PrimitiveCategory category)
throws HiveException
{
VectorColumnAssign outVCA = null;
ColumnVector destCol = outputBatch.cols[outColIndex];
if (destCol == null) {
....
}
else if (destCol instanceof LongColumnVector)
{
switch(category) {
....
case LONG:
outVCA = new VectorLongColumnAssign() {
....
} .init(.... , (LongColumnVector) destCol);
break;
case TIMESTAMP:
outVCA = new VectorTimestampColumnAssign() {
....
}.init(...., (TimestampColumnVector) destCol); // <=
break;
case DATE:
outVCA = new VectorLongColumnAssign() {
....
} .init(...., (LongColumnVector) destCol);
break;
case INTERVAL_YEAR_MONTH:
outVCA = new VectorLongColumnAssign() {
....
}.init(...., (LongColumnVector) destCol);
break;
case INTERVAL_DAY_TIME:
outVCA = new VectorIntervalDayTimeColumnAssign() {
....
}.init(...., (IntervalDayTimeColumnVector) destCol);// <=
break;
default:
throw new HiveException(....);
}
}
else if (destCol instanceof DoubleColumnVector) {
....
}
....
else {
throw new HiveException(....);
}
return outVCA;
}
We are interested in the classes LongColumnVector extends ColumnVector and TimestampColumnVector extends ColumnVector. The check that the destCol object is an instance of LongColumnVector explicitly suggests that it is an object of this class that will be dealt with in the body of the conditional statement. Despite that, however, it is still cast to TimestampColumnVector! As you can see, these are different classes except that they are derived from the same parent. As a result, we get a ClassCastException.
The same is true in the case of casting to IntervalDayTimeColumnVector:
- V6042 The expression is checked for compatibility with type 'A' but is cast to type 'B'. VectorColumnAssignFactory.java(390)
V6060 The 'var' reference was utilized before it was verified against null. Var.java(402), Var.java(395)
@Override
public boolean equals(Object obj)
{
if (getClass() != obj.getClass()) { // <=
return false;
}
Var var = (Var)obj;
if (this == var)
{
return true;
}
else if (var == null || // <=
var.value == null ||
this.value == null)
{
return false;
}
....
}
Here you see a strange check of the var object for null after the dereference has already occurred. In this context, var and obj are the same object (var = (Var)obj). The presence of the null check implies that the passed object may be null. So calling equals(null) will result in an NPE, instead of the expected false, right at the first line. Yes, the check is there, but, sadly, it's in the wrong place.
A few other similar cases, where an object is used before the check:
- V6060 The 'value' reference was utilized before it was verified against null. ParquetRecordReaderWrapper.java(168), ParquetRecordReaderWrapper.java(166)
- V6060 The 'defaultConstraintCols' reference was utilized before it was verified against null. HiveMetaStore.java(2539), HiveMetaStore.java(2530)
- V6060 The 'projIndxLst' reference was utilized before it was verified against null. RelOptHiveTable.java(683), RelOptHiveTable.java(682)
- V6060 The 'oldp' reference was utilized before it was verified against null. ObjectStore.java(4343), ObjectStore.java(4339)
- and so on...
Conclusion
If you ever interested yourself in Big Data if only a bit, then you could hardly be unaware of how significant Apache Hive is. This is a popular project, and quite a big one, comprised of over 6500 source files (*.java). Many developers have been writing it over many years, which means there are a lot of things for a static analyzer to find there. It just proves one more time that static analysis is extremely important and useful when developing medium and large projects!
Note. Single-time checks like the one I did here are good for showcasing the analyzer's capabilities but are a totally improper scenario of using it. This idea is elaborated on here and here. Static analysis is to be used regularly!
This check of Hive revealed quite a few defects and suspicious fragments. If the authors of Apache Hive come across this article, we'll be glad to help with the hard job of improving the project.
You can't imagine Apache Hive without Apache Hadoop, so the Unicorn from PVS-Studio may well pay a visit to that one too. But that's all for today. Meanwhile, I invite you to download the analyzer and check your own projects.