ACCURACY ASSESSMENT OF DIFFERENT SOIL DATABASES CONCERNING WRB REFERENCE SOIL GROUPS

As a result of international cooperation, the conditions of data access and data usage have been significantly improved during the last two decades. Also, the establishment of web-based geoinformatic infrastructure allowed researchers to share their results with the scientific community more efficiently on the international level. The aim of this study is to investigate the accuracy of databases with different spatial resolutions, using the reference profiles of LUCAS topsoil database. In our study, we investigated the accuracy of World Reference Base for Soil Resources (WRB) Reference Soil Groups (RSG) groups stored in freely accessible soil databases (European Soil Database (ESDB), International Soil Reference and Information Centre (ISRIC)) in Hungary. The study concluded that the continental scale database tends to be more accurate. We used the Kappa Index of Agreement (KIA) statistical index to evaluate accuracy. The European and the international databases showed a value of 0.9643 and 0.3968, respectively. Considering the results, we can conclude that the spatial resolution has a relevant impact on the accuracy of databases, however, the study should be extended to the national level and the indices should be assessed together.


Introduction
Recently the role of (online and offline) regional and continental spatial soil information systems has become more and more important.During the past decade due to the significant increase of European soil data, both the national and the international soil research produced substantial results (Panagos, 2006;Szabó -Czellér, 2009;Pásztor et al., 2010;Várallyay, 2012;Botos et al., 2015).The thematic mapping workflows, which were based on the soil recording, database harmonization and available data in the European Union, provided valuable information for soil and other environmental investigations on the regional and higher levels (Reuter et al., 2008;Grunwald et al. 2011;Panagos et al. 2013).The European Soil Data Centre (ESDAC) and the International Soil Reference and Information Centre (ISRIC) provided significant contributions to this process, which archive and publicize soil data with various contents and details, documents, data-based applications and digitally archived maps (Tóth, 2013).
The differences and connections between the World Reference Base for Soil Resources (in the followings referred to as WRB) and the Hungarian soil classification system were first summarized by Michéli et al. (2006) and Krasilnikov et al. (2009) who also established correlation keys primarily based on field experiences and the definitions of the classification units.However, they pointed out that the classes of the two systems cannot match due to their different approach and methodology.Fehér et al. (2006) investigated the relationship between Hungarian classification and WRB in connection with soils formed on volcanic rocks, primarily in the context of micromorphological, chemical and physical parameters.Based on their results, they also emphasized that the Hungarian and the WRB classification system cannot match; in more cases, soils that belong to the same national soil groups have been ranked among different WRB reference groups based on the classification criteria.Barta et al. (2009) arrived to a similar conclusion based on the investigation of five national rendzina soil sections.Michéli et al. (2011) developed simplified WRB algorithms which can be applied to every Soil Information and Monitoring System TIM database with the purpose of identifying the soil units of the national areas.The TIM is an independent subsystem of the Integrated Environmental Information and Monitoring System (KIM).Based on physiographical-soil-ecological units 1200 representative observation points have been selected 800 points on agricultural land, 200 points in forests and 200 points in environmentally threatened 'hot spot' regions (Várallyay, 2002).
The aim of this study is to investigate the accuracy of European open-access soil databases with different spatial resolutions, using the reference profiles of LUCAS topsoil database.In our study, we investigated the correspondence of World Reference Base for Soil Resources (WRB) Reference Soil Groups (RSG) stored in European Soil Database (ESDB) and International Soil Reference and Information Centre (ISRIC) in a study area covering the area of Hungary.In our investigation we attempted to answer how accurate information the selected databases can provide on the WRB RSG stored in them.

Study area
The study area covers Hungary which is located in the Carpathian Basin.Soils of Hungary are very diverse due to the array Figure1: Location of study area and investigated soil points of soil forming factors in the different geographic areas (Stefanovits, 1963;Szabolcs, 1966).The elevation in more than half of the country is less than 200 m, and only 2% is 400 m above sea level.Most of the current topography is a result of neo-tectonic activities and peri-glacial processes during the quaternary period.The low elevation areas are mainly covered by aeolian and alluvial materials and the higher areas derive from older sedimentary and volcanic rocks (Michéli et al., 2006).
The Hungarian Soil Classification System (HSCS) based on the genetic principles of Dokuchaev (Stefanovits, 1963;Szabolcs, 1966).The HSCS was developed in the 1960s.The major soil types are the highest categories and they are based on climatic, geographical and genetic features.Subtypes and varieties are distinguished according to the assumed dominance of soil forming processes and observable/measurable morphogenetic properties (Stefanovits, 1999).
The WRB is the international standard for soil classification system endorsed by the International Union of Soil Sciences.It was developed by an international collaboration coordinated by the IUSS Working Group.The WRB is based on diagnostic approach.32 Reference Soil Groups (RSG) are defined by a key based on the presence, sequence or exclusion of diagnostic horizons, properties and/or materials.The lower levels are defined by qualifiers added to the names of the reference soil groups for specific soil characteristics (IUSS WRB 2014).The European Commission also selected the WRB as the correlation scheme for harmonized soil maps and databases for Europe (Láng et al., 2010;Fuchs et al., 2011;Láng et al., 2013;Michéli et al., 2014).From the 32 reference groups of WRB the Cryosol, Plinthosol, Nitisol, Ferralsol, Gypsisol, Durisol, Acrisol, Lixisol soils are extremely rare or cannot even be found in Hungary or in the Carpathian Basin (Novák, 2013).

Main characteristics of the examined soil databases
The European soil databases can be classified based on their contents.These could be databases and maps containing general soil data (e.g.: ESDB v2.0, LUCAS topsoil database, BioSoil database), hydropedological thematic databases (European Hydropedological Data Inventory and maps with thematic soil features (e.g.: Updated Map of Salt Affected Soils in the European Union, Soil pH in Europe, Maps of total heavy metal contents in European topsoils).These databases are suitable not only for the comparison of soil features or digital maps, but provide important information for soil and other environmental investigations on the regional and higher levels (Tóth, 2013).
During our investigation we used the WRB reference groups archived in the LUCAS, the ESDB and the ISRIC databases.Since the investigated soil databases archive the data in data structures with the same content but with different data structure, it is necessary to describe the most important characteristics of them.
The LUCAS program was the first soil recording in Europe which was conducted according to standardized sampling principles.During the project the researchers collected topsoil (0-20 cm) samples from approximately 22000 locations in collaboration with the European Union and Iceland (Tóth et al., 2013a).The sampling has covered all of the major land use types, usually in the ratio of the regional distribution of each land use type within the member countries, except for the soil samples collected from plough lands which intentionally have a higher ratio than their regional percentage in each country.497 samples from Hungary have been entered to the database.From these samples, 314 samples have been collected from plough lands, 6 from vineyards and orchards, 60 from forests, 9 from scrogs, 104 from grasslands and 4 from areas with other land cover types (Nocita et al., 2013;Panagos et al., 2012b;Tóth et al., 2013b).Data are available about land use/land cover and soil types of the sampling locations as well (LUCAS, 2012).
The European Soil Database (ESDB) is one of the most commonly used and popular European soil databases (Panagos 2006;Panagos et al. 2012a) that consists of separate ones: the Soil Geographical Database of Eurasia at scale 1:10000000 (SGDBE) (Lambert et al., 2003), the Soil Profile Analytical Database of Europa (SPADBE) (Van Liedekerke - Panagos, 2009), the Pedotransfer Rules Database (PTRDB) and the Database of Hydraulic Properties of European Soils (HYPRES) (Wöstena et al., 1999).
The SGDBE is a map database, originally in continental scale (but compiled primarily from national maps) which provides a full layer for the geographical Europe both in terms of soil units and other features.The features of the different areas of SGDBE maps or its areas with the same characteristics, the so called soil mapping units (SMU), are described by the soil typological units (STU).The soil typological unit is uniquely defined by the taxonomic classification of the given soil and some of its major characteristics, the typical land use and geographical location and the code lines that represent them.According to the data available in the European areas of the SGDBE, from 30 possible reference groups of 1998-year version of WRB (FAO, 1998) 23 can be found in the continent (represented by 16 reference groups in Hungary) (Panagos et al., 2012a).
The International Soil Reference and Information Centre (ISRIC) has a mission to serve the international community with information about the world's soil resources to help addressing major global issues.ISRIC characterizes a collection of monoliths with morphological, analytical data that represent the main soil reference groups of the World Reference Base for Soil Resources (WRB) (Batjes, 1995;ISRIC, 2013).
SoilGrids1km, that is a part of ISRIC, is the first approximation of predictions of soil properties and soil classes for a global soil mask using automated global soil mapping at a resolution of 1 km.SoilGrids contains 3D predictions and associated prediction accuracies of basic soil properties, following the GlobalSoilMap specifications: organic carbon, pH, texture fractions, coarse fragments, bulk density, depth to bedrock (R horizon) and cation exchange capacity (CEC) at six standard depths, and predictions for soil types based on the FAO's World Reference Base classes and USDA's Soil Taxonomy classes (Hengl et al., 2014).These maps will be updated on a, regular basis and improved using additional contributed data (soil profiles and covariate layers).The predictions available for download at www.soilgrids.orgcan be characterized with limited thematic and spatial accuracy and contain artefacts and missing pixels (Montanarella -Vargas, 2012).SoilGrids is a collection of updatable soil property and class maps of the world, initially at a resolution of 1 km, produced using state-of-the-art model-based statistical methods.

Stored Information
Geographical As Table 1.shows, the examined soil databases are available in different formats and scales.LUCAS was used as a reference, because the database stores data as point objects associated with general soil characteristics and coordinates.The ESDB database is available in vector format, and RSGs with WRB qualifiers are stored as polygon features.In case of ISRIC we used a kml file that contained a raster image (map) about the RSGs all around the world.However, all the three databases stores WRB RSG features, but with different spatial accuracy.

Workflow
Although the databases are freely available, they are stored in different formats, thus we had to uniform data in order to fulfill their comparison.They were imported to ArcGIS, converted to shapefile with point geometry along with adding an attribute field which contained a simple code for soil type according to WRB classification.ISRIC data was special in a sense that it is available as in .kmlformat, but in form of ground overlay (special GroundOverlay tag used in kml that places a raster image on the map) which cannot be converted to shapefile, hence we had to gather WRB classification data by visual interpretation.As a result 3 shapefiles with point geometry evolved and they were imported to Idrisi vector file, then converted to raster, because crosstabulation can be applied on raster images.Fig. 2. delinates the decribed workflow in details.
In order to estimate the agreement of the examined soil databases we applied the method of evaluating error matrices so as to determine derived indicators of correspondence between reference data Fig. 2. Step of GIS and statistical process of investigation and the formerly mentioned databases.We assigned LUCAS database as a reference due to the fact that it is based on results of field surveys.
According to Congalton (1991), error matrix is a square array of numbers set out in rows and columns that represent the number of sample units, which are pixels characterized by WRB categories in this case.The crosstabulation results in a square matrix, whose columns usually represent the reference data while rows indicate the comparison data (Gopal -Woodcock, 1994); ISRIC and ESDB as comparison data in two single error matrices.We used Idrisi Selva for setting up the matrices and then computed overall accuracy, user's/ producer's accuracy (UA/PA) and Kappa Agreement Index in Microsoft Excel; however, Idrisi is also capable of computing Kappa Agreement Index, thus, it could be used for controlling our calculations.
Once the error matrix is set up, there are different descriptive and analytical methods for gaining information about correspondence of the compared databases, of which the simplest is overall accuracy.It can be computed by dividing the total number of correct entries (i.e. the pixels classified correctly according to the reference data) by the total number of pixels (Congalton, 1991).Furthermore, for all categories we can calculate UA by dividing the given categories' correct entries by row total, and PA by dividing the given categories' correct entries by column total.UA indicates that the pixels labelled as a given category belongs to that category, PA indicates that a pixel which is known to belong to a given category is accurately labelled as that category (Story -Congalton, 1986).
Kappa Agreement Index was introduced by Cohen (1960) and was later adopted for remote sensing accuracy assessment applications (Rosenfield and Fitzpatrick-Lins 1986).It uses not only diagonal, but all values of the error matrix and it is based on the difference between how much agreement is actually present compared to how much agreement would be expected to be present by chance alone.According to an interpretation of Kappa, above a value of 0.61 agreement can be regarded as substantial (Viera -Garrett, 2005).

Results
During the investigation, the points of the LUCAS database were the references, since they are originated from a consistent survey Fig. 3. Distribution of WRB RSGs in investigated databases and include the accurately determined WRB RSGs of them.Therefore, we selected 433 points from the databases which are comparable to the reference database and are located in Hungary.During the selection we can observe that there are differences between the WRB RSGs stored in the three databases.The Alisol and the Calcisol reference groups are missing from LUCAS and ISRIC databases.Histosol and Solonetz reference groups are missing from ESDB database.The Phaeozem group is overrepresented in all the applied databases, while Histosol, Leptosol, Solonchak are underrepresented.
The overall accuracy of the ISRIC database showed a low value (Overall Accuracy [OA]=47.6%,KIA=0.39).There are remarkable differences between the indices calculated per reference groups as well.The PA value is based on how many of the samples associated to a given reference group have been miscategorised and it shows the ratio of the correctly identified samples.This value was the lowest in case of Arenosol, Chernozem and Vertisol.It was the highest in case of Fluvisol, Gleysol, Histosol and Solonchak.We have to note that the high value of Alisol and Calcisol do not indicate correct classification as a result of that they are not represented in the database.The values of UA also show great diversity.
Here the lowest values are associated with Arenosol, Cambisol and Phaeozem groups (Table 2.).
The overall accuracy of the ESDB database showed a high value (OA=96.99%,KIA=0.96).The indices which were calculated per reference groups are also high.The PA value was low only in the case of Gleysol.The UA values were also high.Here the lowest value was as high as 91.11% (Phaeozem).However, Congalton (1991) pointed out that the two indices should be assessed together, otherwise the results may be misleading.For example, the Gleysol showed a PA value of 51.8%, while its UA value is 100%.Therefore, it can be concluded that 51.8 % of the samples which belong to this WRB group were correctly classified to the Gleysol category, while in reality, 100% of them are Gleysol (Table 3.).

Discussion
The accuracy investigation of databases with different spatial resolutions showed different results in the sample area.Considering the overall accuracy, the continental scale ESDB was proven to be more accurate than the international ISRIC database.The KIA indices of the two databases confirmed the difference which was the result of different spatial resolutions.Based on the KIA index, the ESDB can be interpreted as showing almost perfect agreement, while the ISRIC database can be interpreted as showing fair agreement or moderate agreement according to Viera and Garrett (2005).The reason of the difference is that samples of the LUCAS database, which was used as reference, derives its soil units originate from Hungarian soil information monitoring surveys.There are also great differences between the accuracies of WRB RSG groups.Here the differences resulted from the different WRB issues entered into the databases, and from the over-or underrepresentation of WRB RSGs associated with the sampling points.The current WRB diagnostic system is the fifth edition of it, therefore comparing the results from the soil classification and the reference group identification presents significant challenges.Initially the WRB nomenclature differentiated 28 different groups, then due to the continuous clarifications not only new groups have been introduced, but sometimes the order of the groups has been changed as well.This means that the WRB RSGs stored in the given databases based on the different WRB editions may draw a misleading picture on the spatial distribution of soil types.Therefore, beyond the WRB harmonization of national soil classification systems which has been in progress for almost 10 years, it is recommended to harmonize the previous WRB editions as well.The results of the accuracy assessment and the group indices of RSG (PA, UA) should be evaluated with reservations, since the LUCAS database used as a reference database does not cover every WRB RSG (e.g.Technosol, Anthrosol) which can be found in Hungary.At the same time, the statistical methods used for the accuracy assessment highlighted that the results and the indices can be interpreted correctly only in the case of a large amount of reference points.

Conclusion
The freely accessible soil databases provide a large amount of soil data for earth science investigations, which can be processed and assessed by statistical method.Our study focuses on the accuracy investigation of various databases, and showed that the spatial resolution has a significant impact on the accuracy of databases.However, because of the heterogeneity of soils the investigations should be extended to the national level, otherwise the statistical indices may provide misleading results.
Error matrix of LUCAS reference database and the ISRIC database (UA=User's Accuracy, PA=Producer's Accuracy, OA= Overall Accuracy, KIA= Kappa Index of Agreement) Error matrix of LUCAS reference database and the ESDB database (UA=User's Accuracy, PA=Producer's Accuracy, OA= Overall Accuracy, KIA= Kappa Index of Agreement)

Table 1 .
Main characteristic of selected databases