kb/data/en.wikipedia.org/wiki/Data_analysis-4.md

6.5 KiB
Raw Blame History

title chunk source category tags date_saved instance
Data analysis 5/5 https://en.wikipedia.org/wiki/Data_analysis reference science, encyclopedia 2026-05-05T09:53:29.395055+00:00 kb-cron

==== Exploratory and confirmatory approaches ==== In the main analysis phase, either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for models that describe the data well. In a confirmatory analysis, clear hypotheses about the data are tested. Exploratory data analysis should be interpreted carefully. When testing multiple models at once there is a high chance on finding at least one of them to be significant, but this can be due to a type 1 error. It is important to always adjust the significance level when testing multiple models with, for example, a Bonferroni correction. Also, one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well. When a model is found exploratory in a dataset, then following up that analysis with a confirmatory analysis in the same dataset could simply mean that the results of the confirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place. The confirmatory analysis therefore will not be more informative than the original exploratory analysis.

==== Stability of results ==== It is important to obtain some indication about how generalizable the results are. While this is often difficult to check, one can look at the stability of the results. Are the results reliable and reproducible? There are two main ways of doing that.

Cross-validation. By splitting the data into multiple parts, we can check if an analysis (like a fitted model) based on one part of the data generalizes to another part of the data as well. Cross-validation is generally inappropriate, though, if there are correlations within the data, e.g. with panel data. Hence other methods of validation sometimes need to be used. For more on this topic, see statistical model validation. Sensitivity analysis. A procedure to study the behavior of a system or model when global parameters are (systematically) varied. One way to do that is via bootstrapping.

== Free software for data analysis == Free software for data analysis include:

DevInfo A database system endorsed by the United Nations Development Group for monitoring and analyzing human development. ELKI Data mining framework in Java with data mining oriented visualization functions. KNIME The Konstanz Information Miner, a user friendly and comprehensive data analytics framework. Orange A visual programming tool featuring interactive data visualization and methods for statistical data analysis, data mining, and machine learning. Pandas Python library for data analysis. PAW FORTRAN/C data analysis framework developed at CERN. R A programming language and software environment for statistical computing and graphics. ROOT C++ data analysis framework developed at CERN. SciPy Python library for scientific computing. Julia A programming language well-suited for numerical analysis and computational science.

== Reproducible analysis == The typical data analysis workflow involves collecting data, running analyses, creating visualizations, and writing reports. However, this workflow presents challenges, including a separation between analysis scripts and data, as well as a gap between analysis and documentation. Often, the correct order of running scripts is only described informally or resides in the data scientist's memory. The potential for losing this information creates issues for reproducibility. To address these challenges, it is essential to document analysis script content and workflow. Additionally, overall documentation is crucial, as well as providing reports that are understandable by both machines and humans, and ensuring accurate representation of the analysis workflow even as scripts evolve.

== Data analysis contests == Different companies and organizations hold data analysis contests to encourage researchers to utilize their data or to solve a particular question using data analysis. A few examples of well-known international data analysis contests are:

Kaggle competitions; the Kaggle platform is owned and run by Google. LTPP data analysis contest held by FHWA and ASCE.

== See also ==

== References ==

=== Citations ===

=== Bibliography === Adèr, Herman J. (2008a). "Chapter 14: Phases and initial steps in data analysis". In Adèr, Herman J.; Mellenbergh, Gideon J.; Hand, David J (eds.). Advising on research methods : a consultant's companion. Huizen, Netherlands: Johannes van Kessel Pub. pp. 333356. ISBN 9789079418015. OCLC 905799857. Adèr, Herman J. (2008b). "Chapter 15: The main analysis phase". In Adèr, Herman J.; Mellenbergh, Gideon J.; Hand, David J (eds.). Advising on research methods : a consultant's companion. Huizen, Netherlands: Johannes van Kessel Pub. pp. 357386. ISBN 9789079418015. OCLC 905799857. Tabachnick, B.G. & Fidell, L.S. (2007). Chapter 4: Cleaning up your act. Screening data prior to analysis. In B.G. Tabachnick & L.S. Fidell (Eds.), Using Multivariate Statistics, Fifth Edition (pp. 60116). Boston: Pearson Education, Inc. / Allyn and Bacon.

== Further reading ==

Adèr, H.J. & Mellenbergh, G.J. (with contributions by D.J. Hand) (2008). Advising on Research Methods: A Consultant's Companion. Huizen, the Netherlands: Johannes van Kessel Publishing. ISBN 978-90-79418-01-5 Chambers, John M.; Cleveland, William S.; Kleiner, Beat; Tukey, Paul A. (1983). Graphical Methods for Data Analysis, Wadsworth/Duxbury Press. ISBN 0-534-98052-X Fandango, Armando (2017). Python Data Analysis, 2nd Edition. Packt Publishers. ISBN 978-1787127487 Juran, Joseph M.; Godfrey, A. Blanton (1999). Juran's Quality Handbook, 5th Edition. New York: McGraw Hill. ISBN 0-07-034003-X Lewis-Beck, Michael S. (1995). Data Analysis: an Introduction, Sage Publications Inc, ISBN 0-8039-5772-6 NIST/SEMATECH (2008) Handbook of Statistical Methods Pyzdek, T, (2003). Quality Engineering Handbook, ISBN 0-8247-4614-7 Richard Veryard (1984). Pragmatic Data Analysis. Oxford : Blackwell Scientific Publications. ISBN 0-632-01311-7 Tabachnick, B.G.; Fidell, L.S. (2007). Using Multivariate Statistics, 5th Edition. Boston: Pearson Education, Inc. / Allyn and Bacon, ISBN 978-0-205-45938-4