
Your Name
A short description about yourself.
About Me
Anim eiusmod irure incididunt sint cupidatat. Incididunt irure irure irure nisi ipsum do ut quis fugiat consectetur proident cupidatat incididunt cillum. Dolore voluptate occaecat qui mollit laborum ullamco et. Ipsum laboris officia anim laboris culpa eiusmod ex magna ex cupidatat anim ipsum aute. Mollit aliquip occaecat qui sunt velit ut cupidatat reprehenderit enim sunt laborum. Velit veniam in officia nulla adipisicing ut duis officia.
Exercitation voluptate irure in irure tempor mollit Lorem nostrud ad officia. Velit id fugiat occaecat do tempor. Sit officia Lorem aliquip eu deserunt consectetur. Aute proident deserunt in nulla aliquip dolore ipsum Lorem ut cupidatat consectetur sit sint laborum.
Skills & Interests
- Bioinformatics
- Scientometric tools
- Epigenetics
- AI in Healthcare
- Data Visualization
- Academic Research
- Open Source Software
- Scientific Communication
Activity Calendar
Content Analysis (Zipf's Law)
Word Frequency Distribution
Zipf's law, named after linguist George Kingsley Zipf (1902-1950), states that the frequency of any word is inversely proportional to its rank in the frequency table. For example, if the most common word occurs n times, the second most common occurs n/2 times, the third most common n/3 times, etc.
Mathematically expressed as: f(r) ∝ 1/rα, where f(r) is the frequency of the word with rank r, and α is close to 1.
This visualization compares the actual vocabulary distribution (blue dots) against the ideal Zipf's Law distribution (dashed line). The phenomenon appears not only in language but across many natural and social systems, reflecting organizational principles of human behavior and information.
About the Cleaned Corpus
The "cleaned corpus" refers to the collection of words processed through several cleaning steps:
- Words are extracted from all posts (titles, descriptions, tags, categories)
- All words are converted to lowercase
- Punctuation and special characters are removed
- Very short words (2 characters or less) are filtered out
- Common stop words like "a", "an", "the", "and", etc. are removed
This cleaning process is important because it removes noise that would skew the frequency analysis, normalizes text to ensure word variations are counted as the same word, and excludes common words that occur frequently but don't add much meaning.
References:
- Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Addison-Wesley.
- Piantadosi, S. T. (2014). Zipf's word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5), 1112-1130.
- Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
- Jäger, G. (2012). Power laws and other heavy-tailed distributions in linguistic typology. Advances in Complex Systems, 15(3).
- Ferrer-i-Cancho, R., & Solé, R. V. (2003). Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences, 100(3), 788-791.
Word Frequency Analysis
Rank | Word | Freq (n/total) | Pr(%) | Ideal |
---|
Topic Analysis (LDA)
Topic Distribution
Topic Details
Analyzing content using GPU-accelerated LDA...
This visualization uses Latent Dirichlet Allocation (LDA) to discover topics in the content. The analysis is performed using GPU acceleration via TensorFlow.js for optimal performance.
On mobile: tap topics for details, use the toggle button to switch between chart and table views.