Structured knowledge is important for many AI applications. Commonsense knowledge, which is crucial for robust human-centric AI, is covered by a small number of structured knowledge projects. However, they lack knowledge about human traits and behaviors conditioned on socio-cultural contexts, which is crucial for situative AI. In this project, we present Candle, an end-to-end methodology for extracting high-quality cultural commonsense knowledge (CCSK) at scale. Candle extracts CCSK assertions from a huge web corpus and organizes them into coherent clusters, for 3 domains of subjects (geography, religion, occupation) and several cultural facets (food, drinks, clothing, traditions, rituals, behaviors). Candle includes judicious techniques for classification-based filtering and scoring of interestingness. Experimental evaluations show the superiority of the Candle CCSK collection over prior works, and an extrinsic use case demonstrates the benefits of CCSK for the GPT-3 language model.
The output of Candle is a set of 1.1M CCSK assertions, organized into 60K coherent clusters. The set is organized by 3 domains of interest – geography, religion, occupation – with a total of 386 instances, referred to as subjects (or cultural groups). Per subject, the assertions cover 5 facets of culture: food, drinks, clothing, rituals, traditions (for geography and religion) or behaviors (for occupations). In addition, we also annotate each assertion with its salient concepts.
This web interface allows you to browse the extracted CCSK assertions.
To explore what our CCSK collection captures, you can use the querying interface below, or try some of the following examples (domain — facet — subject):
We found [[ clusters.length | formatInt ]] clusters ([[ totalStatements | formatInt ]] statements) for facet [[ selectedAspect ]] and subject [[ selectedSubject ]]. .
Concepts [ select all ]: [[ concept.name ]] ([[ concept.clusters.length ]]) • [[ concept.name ]] ([[ concept.clusters.length ]]) • more
Selected concepts: [[ selectedConcepts[0].name ]]
Showing top (out of [[ shownClusters.length | formatInt ]]) clusters.
# | Facet | Domain | Subject | Representative | [[ label ]] ▲ ▼ | |||||
[[ index + 1 ]] | [[ cluster.aspect ]] | [[ cluster.domain ]] | [[ cluster.subject ]] | [[ cluster.rep ]] | [[ cluster.tf | formatFloat ]] | [[ cluster.idf | formatFloat ]] | [[ cluster.noun_density | formatFloat ]] | [[ cluster.prob | formatFloat ]] | [[ cluster.combined_score | formatFloat ]] | [[ cluster.size | formatInt ]] |