ExperimenterTutorial-3-5-8下载_在线阅读_40

is_363580

暂无简介

ExperimenterTutorial-3-5-8 WEKA Experimenter Tutorial for Version 3-5-8 David Scuse Peter Reutemann July 14, 2008 c©2002-2008 David Scuse and University of Waikato Contents 1 Introduction 2 2 Standard Experiments 3 2.1 Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

WEKA Experimenter Tutorial for Version 3-5-8 David Scuse Peter Reutemann July 14, 2008 c©2002-2008 David Scuse and University of Waikato Contents 1 Introduction 2 2 Standard Experiments 3 2.1 Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 New experiment . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 Results destination . . . . . . . . . . . . . . . . . . . . . . 3 2.1.3 Experiment type . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.5 Iteration control . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.6 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.7 Saving the setup . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.8 Running an Experiment . . . . . . . . . . . . . . . . . . . 11 2.2 Advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Defining an Experiment . . . . . . . . . . . . . . . . . . . 12 2.2.2 Running an Experiment . . . . . . . . . . . . . . . . . . . 15 2.2.3 Changing the Experiment Parameters . . . . . . . . . . . 17 2.2.4 Other Result Producers . . . . . . . . . . . . . . . . . . . 24 3 Remote Experiments 29 3.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Database Server Setup . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Remote Engine Setup . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Configuring the Experimenter . . . . . . . . . . . . . . . . . . . . 31 3.5 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Analysing Results 33 4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Saving the Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Changing the Baseline Scheme . . . . . . . . . . . . . . . . . . . 36 4.4 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 Summary Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 Ranking Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 1 1 Introduction The Weka Experiment Environment enables the user to create, run, modify, and analyse experiments in a more convenient manner than is possible when processing the schemes individually. For example, the user can create an exper- iment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is (statistically) better than the other schemes. The Experiment Environment can be run from the command line using the Simple CLI. For example, the following commands could be typed into the CLI to run the OneR scheme on the Iris dataset using a basic train and test process. (Note that the commands would be typed on one line into the CLI.) java weka.experiment.Experiment -r -T data/iris.arff -D weka.experiment.InstancesResultListener -P weka.experiment.RandomSplitResultProducer -- -W weka.experiment.ClassifierSplitEvaluator -- -W weka.classifiers.rules.OneR While commands can be typed directly into the CLI, this technique is not particularly convenient and the experiments are not easy to modify. The Experimenter comes in two flavours, either with a simple interface that provides most of the functionality one needs for experiments, or with an interface with full access to the Experimenter’s capabilities. You can choose between those two with the Experiment Configuration Mode radio buttons: • Simple • Advanced Both setups allow you to setup standard experiments, that are run locally on a single machine, or remote experiments, which are distributed between several hosts. The distribution of experiments cuts down the time the experiments will take until completion, but on the other hand the setup takes more time. The next section covers the standard experiments (both, simple and ad- vanced), followed by the remote experiments and finally the analysing of the results. This manual is also available online on the WekaDoc Wiki [7]. 2 2 Standard Experiments 2.1 Simple 2.1.1 New experiment After clicking New default parameters for an Experiment are defined. 2.1.2 Results destination By default, an ARFF file is the destination for the results output. But you can choose between • ARFF file • CSV file • JDBC database ARFF file and JDBC database are discussed in detail in the following sec- tions. CSV is similar to ARFF, but it can be used to be loaded in an external spreadsheet application. ARFF file If the file name is left empty a temporary file will be created in the TEMP directory of the system. If one wants to specify an explicit results file, click on Browse and choose a filename, e.g., Experiment1.arff. 3 Click on Save and the name will appear in the edit field next to ARFF file. The advantage of ARFF or CSV files is that they can be created without any additional classes besides the ones from Weka. The drawback is the lack of the ability to resume an experiment that was interrupted, e.g., due to an error or the addition of dataset or algorithms. Especially with time-consuming experiments, this behavior can be annoying. JDBC database With JDBC it is easy to store the results in a database. The necessary jar archives have to be in the CLASSPATH to make the JDBC functionality of a particular database available. After changing ARFF file to JDBC database click on User... to specify JDBC URL and user credentials for accessing the database. 4 After supplying the necessary data and clicking on OK, the URL in the main window will be updated. Note: at this point, the database connection is not tested; this is done when the experiment is started. The advantage of a JDBC database is the possibility to resume an in- terrupted or extended experiment. Instead of re-running all the other algo- rithm/dataset combinations again, only the missing ones are computed. 2.1.3 Experiment type The user can choose between the following three different types • Cross-validation (default) performs stratified cross-validation with the given number of folds • Train/Test Percentage Split (data randomized) splits a dataset according to the given percentage into a train and a test file (one cannot specify explicit training and test files in the Experimenter), after the order of the data has been randomized and stratified 5 • Train/Test Percentage Split (order preserved) because it is impossible to specify an explicit train/test files pair, one can abuse this type to un-merge previously merged train and test file into the two original files (one only needs to find out the correct percentage) Additionally, one can choose between Classification and Regression, depend- ing on the datasets and classifiers one uses. For decision trees like J48 (Weka’s implementation of Quinlan’s C4.5 [3]) and the iris dataset, Classification is necessary, for a numeric classifier like M5P, on the other hand, Regression. Clas- sification is selected by default. Note: if the percentage splits are used, one has to make sure that the cor- rected paired T-Tester still produces sensible results with the given ratio [2]. 6 2.1.4 Datasets One can add dataset files either with an absolute path or with a relative one. The latter makes it often easier to run experiments on different machines, hence one should check Use relative paths, before clicking on Add new.... In this example, open the data directory and choose the iris.arff dataset. After clicking Open the file will be displayed in the datasets list. If one selects a directory and hits Open, then all ARFF files will be added recursively. Files can be deleted from the list by selecting them and then clicking on Delete selected. ARFF files are not the only format one can load, but all files that can be converted with Weka’s “core converters”. The following formats are currently supported: • ARFF (+ compressed) • C4.5 • CSV • libsvm • binary serialized instances • XRFF (+ compressed) 7 By default, the class attribute is assumed to be the last attribute. But if a data format contains information about the class attribute, like XRFF or C4.5, this attribute will be used instead. 2.1.5 Iteration control • Number of repetitions In order to get statistically meaningful results, the default number of it- erations is 10. In case of 10-fold cross-validation this means 100 calls of one classifier with training data and tested against test data. • Data sets first/Algorithms first As soon as one has more than one dataset and algorithm, it can be useful to switch from datasets being iterated over first to algorithms. This is the case if one stores the results in a database and wants to complete the results for all the datasets for one algorithm as early as possible. 2.1.6 Algorithms New algorithms can be added via the Add new... button. Opening this dialog for the first time, ZeroR is presented, otherwise the one that was selected last. With the Choose button one can open the GenericObjectEditor and choose another classifier. 8 The Filter... button enables one to highlight classifiers that can handle certain attribute and class types. With the Remove filter button all the selected capabilities will get cleared and the highlighting removed again. Additional algorithms can be added again with the Add new... button, e.g., the J48 decision tree. After setting the classifier parameters, one clicks on OK to add it to the list of algorithms. 9 With the Load options... and Save options... buttons one can load and save the setup of a selected classifier from and to XML. This is especially useful for highly configured classifiers (e.g., nested meta-classifiers), where the manual setup takes quite some time, and which are used often. One can also paste classifier settings here by right-clicking (or Alt-Shift-left- clicking) and selecting the appropriate menu point from the popup menu, to either add a new classifier or replace the selected one with a new setup. This is rather useful for transferring a classifier setup from the Weka Explorer over to the Experimenter without having to setup the classifier from scratch. 2.1.7 Saving the setup For future re-use, one can save the current setup of the experiment to a file by clicking on Save... at the top of the window. By default, the format of the experiment files is the binary format that Java serialization offers. The drawback of this format is the possible incompatibility between different versions of Weka. A more robust alternative to the binary format is the XML format. Previously saved experiments can be loaded again via the Open... button. 10 2.1.8 Running an Experiment To run the current experiment, click the Run tab at the top of the Experiment Environment window. The current experiment performs 10 runs of 10-fold strat- ified cross-validation on the Iris dataset using the ZeroR and J48 scheme. Click Start to run the experiment. If the experiment was defined correctly, the 3 messages shown above will be displayed in the Log panel. The results of the experiment are saved to the dataset Experiment1.arff. 11 2.2 Advanced 2.2.1 Defining an Experiment When the Experimenter is started in Advanced mode, the Setup tab is displayed. Click New to initialize an experiment. This causes default parameters to be defined for the experiment. To define the dataset to be processed by a scheme, first select Use relative paths in the Datasets panel of the Setup tab and then click on Add new... to open a dialog window. Double click on the data folder to view the available datasets or navigate to an alternate location. Select iris.arff and click Open to select the Iris dataset. 12 The dataset name is now displayed in the Datasets panel of the Setup tab. Saving the Results of the Experiment To identify a dataset to which the results are to be sent, click on the Instances- ResultListener entry in the Destination panel. The output file parameter is near the bottom of the window, beside the text outputFile. Click on this parameter to display a file selection window. 13 Type the name of the output file, click Select, and then click close (x). The file name is displayed in the outputFile panel. Click on OK to close the window. The dataset name is displayed in the Destination panel of the Setup tab. Saving the Experiment Definition The experiment definition can be saved at any time. Select Save... at the top of the Setup tab. Type the dataset name with the extension exp (or select the dataset name if the experiment definition dataset already exists) for binary files or choose Experiment configuration files (*.xml) from the file types combobox (the XML files are robust with respect to version changes). 14 The experiment can be restored by selecting Open in the Setup tab and then selecting Experiment1.exp in the dialog window. 2.2.2 Running an Experiment To run the current experiment, click the Run tab at the top of the Experiment Environment window. The current experiment performs 10 randomized train and test runs on the Iris dataset, using 66% of the patterns for training and 34% for testing, and using the ZeroR scheme. Click Start to run the experiment. 15 If the experiment was defined correctly, the 3 messages shown above will be displayed in the Log panel. The results of the experiment are saved to the dataset Experiment1.arff. The first few lines in this dataset are shown below. @relation InstanceResultListener @attribute Key_Dataset {iris} @attribute Key_Run {1,2,3,4,5,6,7,8,9,10} @attribute Key_Scheme {weka.classifiers.rules.ZeroR,weka.classifiers.trees.J48} @attribute Key_Scheme_options {,’-C 0.25 -M 2’} @attribute Key_Scheme_version_ID {48055541465867954,-217733168393644444} @attribute Date_time numeric @attribute Number_of_training_instances numeric @attribute Number_of_testing_instances numeric @attribute Number_correct numeric @attribute Number_incorrect numeric @attribute Number_unclassified numeric @attribute Percent_correct numeric @attribute Percent_incorrect numeric @attribute Percent_unclassified numeric @attribute Kappa_statistic numeric @attribute Mean_absolute_error numeric @attribute Root_mean_squared_error numeric @attribute Relative_absolute_error numeric @attribute Root_relative_squared_error numeric @attribute SF_prior_entropy numeric @attribute SF_scheme_entropy numeric @attribute SF_entropy_gain numeric @attribute SF_mean_prior_entropy numeric @attribute SF_mean_scheme_entropy numeric @attribute SF_mean_entropy_gain numeric @attribute KB_information numeric 16 @attribute KB_mean_information numeric @attribute KB_relative_information numeric @attribute True_positive_rate numeric @attribute Num_true_positives numeric @attribute False_positive_rate numeric @attribute Num_false_positives numeric @attribute True_negative_rate numeric @attribute Num_true_negatives numeric @attribute False_negative_rate numeric @attribute Num_false_negatives numeric @attribute IR_precision numeric @attribute IR_recall numeric @attribute F_measure numeric @attribute Area_under_ROC numeric @attribute Time_training numeric @attribute Time_testing numeric @attribute Summary {’Number of leaves: 3\nSize of the tree: 5\n’, ’Number of leaves: 5\nSize of the tree: 9\n’, ’Number of leaves: 4\nSize of the tree: 7\n’} @attribute measureTreeSize numeric @attribute measureNumLeaves numeric @attribute measureNumRules numeric @data iris,1,weka.classifiers.rules.ZeroR,,48055541465867954,20051221.033,99,51, 17,34,0,33.333333,66.666667,0,0,0.444444,0.471405,100,100,80.833088,80.833088, 0,1.584963,1.584963,0,0,0,0,1,17,1,34,0,0,0,0,0.333333,1,0.5,0.5,0,0,?,?,?,? 2.2.3 Changing the Experiment Parameters Changing the Classifier The parameters of an experiment can be changed by clicking on the Result generator panel. The RandomSplitResultProducer performs repeated train/test runs. The number of instances (expressed as a percentage) used for training is given in the 17 trainPercent box. (The number of runs is specified in the Runs panel in the Setup tab.) A small help file can be displayed by clicking More in the About panel. Click on the splitEvaluator entry to display the SplitEvaluator properties. Click on the classifier entry (ZeroR) to display the scheme properties. This scheme has no modifiable properties (besides debug mode on/off) but most other schemes do have properties that can be modified by the user. The Capabilities button opens a small dialog listing all the attribute and class types this classifier can handle. Click on the Choose button to select a different scheme. The window below shows the parameters available for the J48 decision- tree scheme. If desired, modify the parameters and then click OK to close the window. 18 The name of the new scheme is displayed in the Result generator panel. Adding Additional Schemes Additional schemes can be added in the Generator properties panel. To begin, change the drop-down list entry from Disabled to Enabled in the Generator properties panel. 19 Click Select property and expand splitEvaluator so that the classifier entry is visible in the property list; click Select. The scheme name is displayed in the Generator properties panel. 20 To add another scheme, click on the Choose button to display the Generic- ObjectEditor window. The Filter... button enables one to highlight classifiers that can handle certain attribute and class types. With the Remove filter button all the selected capabilities will get cleared and the highlighting removed again. To change to a decision-tree scheme, select J48 (in subgroup trees). 21 The new scheme is added to the Generator properties panel. Click Add to add the new scheme. Now when the experiment is run, results are generated for both schemes. To add additional schemes, repeat this process. To remove a scheme, select the scheme by clicking on it and then click Delete. Adding Additional Datasets The scheme(s) may be run on any number of datasets at a time. Additional datasets are added by clicking Add new... in the Datasets panel. Datasets are deleted from the experiment by selecting the dataset and then clicking Delete Selected. 22 Raw Output The raw output generated by a scheme during an experiment can be saved to a file and then examined at a later time. Open the ResultProducer window by clicking on the Result generator panel in the Setup tab. Click on rawOutput and select the True entry from the drop-down list. By default, the output is sent to the zip file splitEvaluatorOut.zip. The output file can be changed by clicking on the outputFile panel in the window. Now when the experiment is run, the result of each processing run is archived, as shown below. The contents of the first run are: ClassifierSplitEvaluator: weka.classifiers.trees.J48 -C 0.25 -M 2(version -217733168393644444)Classifier model: J48 pruned tree ------------------ petalwidth <= 0.6: Iris-setosa (33.0) petalwidth > 0.6 | petalwidth <= 1.5: Iris-versicolor (31.0/1.0) | petalwidth > 1.5: Iris-virginica (35.0/3.0) Number of Leaves : 3 Size of the tree : 5 23 Correctly Classified Instances 47 92.1569 % Incorrectly Classified Instances 4 7.8431 % Kappa statistic 0.8824 Mean absolute error 0.0723 Root mean squared error 0.2191 Relative absolute error 16.2754 % Root relative squared error 46.4676 % Total Number of Instances 51 measureTreeSize : 5.0 measureNumLeaves : 3.0 measureNumRules : 3.0 2.2.4 Other Result Producers Cross-Validation Result Producer To change from random train and test experiments to cross-validation exper- iments, click on the Result generator entry. At the top of the window, click on the drop-down list and select CrossValidationResultProducer. The window now contains parameters specific to cross-validation such as the number of par- titions/folds. The experiment performs 10-fold cross-validation ins

本文档为【ExperimenterTutorial-3-5-8】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

ExperimenterTutorial-3-5-8

热门搜索

历史搜索