Profile Picture

Your Name

A short description about yourself.

About Me

Anim eiusmod irure incididunt sint cupidatat. Incididunt irure irure irure nisi ipsum do ut quis fugiat consectetur proident cupidatat incididunt cillum. Dolore voluptate occaecat qui mollit laborum ullamco et. Ipsum laboris officia anim laboris culpa eiusmod ex magna ex cupidatat anim ipsum aute. Mollit aliquip occaecat qui sunt velit ut cupidatat reprehenderit enim sunt laborum. Velit veniam in officia nulla adipisicing ut duis officia.

Exercitation voluptate irure in irure tempor mollit Lorem nostrud ad officia. Velit id fugiat occaecat do tempor. Sit officia Lorem aliquip eu deserunt consectetur. Aute proident deserunt in nulla aliquip dolore ipsum Lorem ut cupidatat consectetur sit sint laborum.

Skills & Interests

  • Bioinformatics
  • Scientometric tools
  • Epigenetics
  • AI in Healthcare
  • Data Visualization
  • Academic Research
  • Open Source Software
  • Scientific Communication

Activity Calendar

Apr
Aug
Dec
Feb
Jan
Jul
Jun
Mar
May
Nov
Oct
Sep
Less
More

Content Analysis (Zipf's Law)

Word Frequency Distribution

Zipf's law, named after linguist George Kingsley Zipf (1902-1950), states that the frequency of any word is inversely proportional to its rank in the frequency table. For example, if the most common word occurs n times, the second most common occurs n/2 times, the third most common n/3 times, etc.

Mathematically expressed as: f(r) ∝ 1/rα, where f(r) is the frequency of the word with rank r, and α is close to 1.

This visualization compares the actual vocabulary distribution (blue dots) against the ideal Zipf's Law distribution (dashed line). The phenomenon appears not only in language but across many natural and social systems, reflecting organizational principles of human behavior and information.

About the Cleaned Corpus

The "cleaned corpus" refers to the collection of words processed through several cleaning steps:

  1. Words are extracted from all posts (titles, descriptions, tags, categories)
  2. All words are converted to lowercase
  3. Punctuation and special characters are removed
  4. Very short words (2 characters or less) are filtered out
  5. Common stop words like "a", "an", "the", "and", etc. are removed

This cleaning process is important because it removes noise that would skew the frequency analysis, normalizes text to ensure word variations are counted as the same word, and excludes common words that occur frequently but don't add much meaning.

References:

  • Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Addison-Wesley.
  • Piantadosi, S. T. (2014). Zipf's word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5), 1112-1130.
  • Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
  • Jäger, G. (2012). Power laws and other heavy-tailed distributions in linguistic typology. Advances in Complex Systems, 15(3).
  • Ferrer-i-Cancho, R., & Solé, R. V. (2003). Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences, 100(3), 788-791.

Word Frequency Analysis

RankWordFreq (n/total)Pr(%)Ideal
Source: Analysis of content from titles, descriptions, tags, and categories across all posts in this knowledge base.

Topic Analysis (LDA)

Topic Distribution

Topic Details

Analyzing content using GPU-accelerated LDA...

This visualization uses Latent Dirichlet Allocation (LDA) to discover topics in the content. The analysis is performed using GPU acceleration via TensorFlow.js for optimal performance.

On mobile: tap topics for details, use the toggle button to switch between chart and table views.